Text Analysis Tool¶
The tools/analyze_text.py
script is an internal development tool for examining
the context-aware language detection algorithm. It helps developers understand
how the language detection behaves with and without context.
Run it via:
just analyze /path/to/data.txt
or:
uv run tools/analyze_text.py /path/to/data.txt
Features¶
- Two-pass language detection:
- Line-by-line analysis showing individual language detection results
- Context-aware analysis showing how context affects language detection
- Shows original file line numbers for easy reference
- Highlights ambiguous language detections
- Shows confidence scores for each detection
- Indicates when context changes the detected language
- Skips analysis of comments and blank lines (but shows them in output)
Usage¶
File Analysis¶
Analyze a text file:
python tools/analyze_text.py input.txt
Example output:
Text Analysis Results
====================
Basic Statistics:
- Characters: 42
- Lines: 3
- Words: 8
- Paragraphs: 2
Character Categories:
- Letters: 32 (76.2%)
- Uppercase: 5
- Lowercase: 27
- Numbers: 2 (4.8%)
- Punctuation: 4 (9.5%)
- Whitespace: 4 (9.5%)
Unicode Scripts:
- Latin: 28 (66.7%)
- Han: 10 (23.8%)
- Common: 4 (9.5%)
Word Boundaries:
- Word breaks: 8
- Sentence breaks: 2
- Line breaks: 3
Interactive Mode¶
Run in interactive mode for quick analysis:
python tools/analyze_text.py -i
Example session:
Enter text to analyze (Ctrl+D or Ctrl+C to exit)
Text> Hello, 世界!
Character analysis:
- ASCII: Hello, (6 chars)
- CJK: 世界 (2 chars)
- Punctuation: , ! (2 chars)
- Total: 10 characters
Text> 你好,world!
Character analysis:
- ASCII: world (5 chars)
- CJK: 你好 (2 chars)
- Punctuation: , ! (2 chars)
- Total: 9 characters
Example Output¶
Given a file test.txt
with mixed Chinese and English:
# Test file with mixed languages
你好。
Hello, how are you?
# Another section
我很好,谢谢。
Running the tool produces:
Analyzing 3 non-empty, non-comment lines from test.txt
=== LINE-BY-LINE ANALYSIS ===
┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
┃ # ┃ Text ┃ Language ┃ Confidence ┃ Status ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩
│ 1 │ # Test file with... │ │ │ │
│ 2 │ 你好。 │ ZH │ 0.980 │ OK │
│ 3 │ Hello, how are you? │ EN │ 0.950 │ OK │
│ 4 │ │ │ │ │
│ 5 │ # Another section │ │ │ │
│ 6 │ 我很好,谢谢。 │ ZH │ 0.930 │ OK │
└────┴────────────────────┴──────────┴───────────┴──────────┘
=== CONTEXT-AWARE RESULTS ===
┏━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
┃ # ┃ Original ┃ Resolved ┃ Confidence ┃ Status ┃
┡━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩
│ 2 │ ZH │ │ 0.980 │ OK │
│ 3 │ EN │ │ 0.950 │ OK │
│ 6 │ ZH │ │ 0.930 │ OK │
└────┴──────────┴──────────┴───────────┴──────────┘
Implementation Details¶
Line Filtering¶
- Lines starting with
#
are treated as comments - Blank lines are skipped for language analysis
- Both comments and blank lines are shown in the LINE-BY-LINE table
- Only content lines appear in the CONTEXT-AWARE table
Language Detection¶
The tool performs language detection in two passes:
- Line-by-Line Analysis:
- Each non-comment, non-blank line is analyzed independently
- Shows the most likely language and its confidence score
-
Marks detections as AMBIGUOUS if confidence is low
-
Context-Aware Analysis:
- Analyzes content lines as a group
- Uses surrounding text to improve accuracy
- Shows when context changes the detected language
- Only includes non-comment, non-blank lines
Output Format¶
The tool produces two tables:
- LINE-BY-LINE ANALYSIS:
- Shows all lines from the file
- Includes line numbers for reference
- Empty cells for comments and blank lines
-
Language, confidence, and status for content lines
-
CONTEXT-AWARE RESULTS:
- Shows only content lines
- Original detected language
- Changes made by context (if any)
- Confidence scores
- Detection status
Dependencies¶
- Python's built-in `