Language Detection Tool¶
The tools/detect_languages.py
script is an internal development tool for
exploring fast-langdetect
language detection behavior. I use it during the
development of this package, to:
- Understand how the language detection models behave with different inputs
- Debug cases where language detection might be giving unexpected results
- Compare the behavior of small (fast) and large (accurate) models
- Identify potential edge cases or ambiguous text
- Verify language detection accuracy for different scripts and language combinations
The script provides language detection capabilities using the fast-langdetect library. It can analyze text either from a file or interactively, and can use either a small (fast) or large (more accurate) model.
Run it via:
just detect /path/to/data.txt
or:
uv run tools/detect_langauges.py /path/to/data.txt
Features¶
- Detect languages in text files or interactive input
- Compare results between small (fast) and large (accurate) models
- Display results in a formatted table with highlighted highest scores
- Handle multiple languages per sentence with confidence scores
- Skip comments and blank lines in input files
Usage¶
File Analysis¶
Analyze a text file using either the small or large model:
# Using small model (default)
python tools/detect_languages.py input.txt
# Using large model
python tools/detect_languages.py input.txt --model=large
Example output:
Language Detection Results
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Text ┃ ZH ┃ EN ┃ Other ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 你好。 │ 0.98 │ │ yue:0.02 │
│ Hello, how are you? │ │ 0.95 │ fr:0.03 de:0.02 │
│ 我很好,谢谢。 │ 0.93 │ │ yue:0.07 │
└───────────────────────────────────┴──────┴──────┴──────────────────────────┘
The table shows: - The input text in the first column - Columns for languages that appear frequently or with high confidence - An "Other" column showing additional detected languages - Bold scores indicate the highest confidence for each line
Interactive Mode¶
Run in interactive mode to compare small and large models side by side:
python tools/detect_languages.py -i
Example session:
Enter text to analyze (Ctrl+D or Ctrl+C to exit)
Text> 你好,世界!
Small model: zh:0.98 yue:0.02
Large model: zh:0.99 yue:0.01
Text> Hello, world!
Small model: en:0.95 fr:0.03 de:0.02
Large model: en:0.98 fr:0.01 de:0.01
Text> Bonjour le monde!
Small model: fr:0.92 en:0.05 de:0.03
Large model: fr:0.97 en:0.02 de:0.01
Implementation Details¶
Language Selection¶
The script identifies "major languages" for column display based on two criteria: 1. Languages that appear in at least 25% of the sentences with a score ≥ 0.2 2. Languages that have a score at least twice as high as any other language in a sentence
Score Display¶
- Scores below 0.01 are filtered out
- The highest score for each line is shown in bold
- Languages are sorted by total score across all sentences for consistent column ordering
Input File Format¶
- Lines starting with
#
are treated as comments and skipped - Blank lines are ignored
- All other lines are treated as text to analyze
Dependencies¶
fast-langdetect
: Language detection libraryrich
: Terminal formatting and tables
Error Handling¶
- Gracefully handles Ctrl+C and Ctrl+D in interactive mode
- Validates command-line arguments
- Skips invalid input lines
- Handles empty detection results
Development Use Cases¶
Debugging Ambiguous Cases¶
Use the interactive mode to quickly test phrases that might be ambiguous between languages:
Text> 我系学生
Small model: zh:0.45 yue:0.55
Large model: yue:0.75 zh:0.25
Model Comparison¶
Compare how the small and large models handle edge cases:
Text> Je suis étudiant
Small model: fr:0.85 en:0.10 de:0.05
Large model: fr:0.95 en:0.03 de:0.02
Batch Analysis¶
Analyze test files containing known problematic or edge cases:
python tools/detect_languages.py tests/data/mandarin-wu-ambiguous.txt --model=large