Language Detection Tool¶

The tools/detect_languages.py script is an internal development tool for exploring fast-langdetect language detection behavior. I use it during the development of this package, to:

Understand how the language detection models behave with different inputs
Debug cases where language detection might be giving unexpected results
Compare the behavior of small (fast) and large (accurate) models
Identify potential edge cases or ambiguous text
Verify language detection accuracy for different scripts and language combinations

The script provides language detection capabilities using the fast-langdetect library. It can analyze text either from a file or interactively, and can use either a small (fast) or large (more accurate) model.

Run it via:

just detect /path/to/data.txt

or:

uv run tools/detect_langauges.py /path/to/data.txt

Features¶

Detect languages in text files or interactive input
Compare results between small (fast) and large (accurate) models
Display results in a formatted table with highlighted highest scores
Handle multiple languages per sentence with confidence scores
Skip comments and blank lines in input files

Usage¶

File Analysis¶

Analyze a text file using either the small or large model:

# Using small model (default)
python tools/detect_languages.py input.txt

# Using large model
python tools/detect_languages.py input.txt --model=large

Example output:

                             Language Detection Results
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Text                              ┃   ZH ┃   EN ┃ Other                    ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 你好。                             │ 0.98 │      │ yue:0.02                 │
│ Hello, how are you?               │      │ 0.95 │ fr:0.03 de:0.02         │
│ 我很好，谢谢。                      │ 0.93 │      │ yue:0.07                 │
└───────────────────────────────────┴──────┴──────┴──────────────────────────┘

The table shows: - The input text in the first column - Columns for languages that appear frequently or with high confidence - An "Other" column showing additional detected languages - Bold scores indicate the highest confidence for each line

Interactive Mode¶

Run in interactive mode to compare small and large models side by side:

python tools/detect_languages.py -i

Example session:

Enter text to analyze (Ctrl+D or Ctrl+C to exit)

Text> 你好，世界！
Small model: zh:0.98 yue:0.02
Large model: zh:0.99 yue:0.01

Text> Hello, world!
Small model: en:0.95 fr:0.03 de:0.02
Large model: en:0.98 fr:0.01 de:0.01

Text> Bonjour le monde!
Small model: fr:0.92 en:0.05 de:0.03
Large model: fr:0.97 en:0.02 de:0.01

Implementation Details¶

Language Selection¶

The script identifies "major languages" for column display based on two criteria: 1. Languages that appear in at least 25% of the sentences with a score ≥ 0.2 2. Languages that have a score at least twice as high as any other language in a sentence

Score Display¶

Scores below 0.01 are filtered out
The highest score for each line is shown in bold
Languages are sorted by total score across all sentences for consistent column ordering

Input File Format¶

Lines starting with # are treated as comments and skipped
Blank lines are ignored
All other lines are treated as text to analyze

Dependencies¶

fast-langdetect: Language detection library
rich: Terminal formatting and tables

Error Handling¶

Gracefully handles Ctrl+C and Ctrl+D in interactive mode
Validates command-line arguments
Skips invalid input lines
Handles empty detection results

Development Use Cases¶

Debugging Ambiguous Cases¶

Use the interactive mode to quickly test phrases that might be ambiguous between languages:

Text> 我系学生
Small model: zh:0.45 yue:0.55
Large model: yue:0.75 zh:0.25

Model Comparison¶

Compare how the small and large models handle edge cases:

Text> Je suis étudiant
Small model: fr:0.85 en:0.10 de:0.05
Large model: fr:0.95 en:0.03 de:0.02

Batch Analysis¶

Analyze test files containing known problematic or edge cases:

python tools/detect_languages.py tests/data/mandarin-wu-ambiguous.txt --model=large