Skip to content

contextual-langdetect

PyPI Python License: MIT

A context-aware language detection library that improves accuracy by considering document-level language patterns.

Use Case

This library is designed for processing corpora where individual lines or sentences might be in different languages, but with a strong prior that there are only one or two primary languages. It uses document-level context to improve accuracy in cases where individual sentences might be ambiguously detected.

For example, in a primarily Chinese corpus:

  • Some sentences might be detected at an individual level as Japanese, but if they don't contain kana characters, they're likely Chinese
  • Some sentences might be detected as Wu Chinese (wuu), but in a Mandarin context they're likely Mandarin
  • The library uses the dominant language(s) in the corpus to resolve these ambiguities

This is particularly useful for:

  • Transcriptions of bilingual conversations, including
  • Language instruction texts and transcriptions
  • Mixed-language documents where the majority language should inform ambiguous cases

Features

  • Accurate language detection with confidence scores
  • Context-aware detection that uses surrounding text to disambiguate
  • Special case handling for commonly confused languages (e.g., Wu Chinese, Japanese without kana)
  • Support for mixed language documents

Installation

pip install contextual-langdetect

Usage

from contextual_langdetect import contextual_detect

# Process a document with context-awareness
sentences = [
    "你好。",  # Detected as ZH
    "你好吗?",  # Detected as ZH
    "很好。",  # Detected as JA when model=small
    "我家也有四个,刚好。",  # Detected as ZH
    "那么现在天气很冷,你要开暖气吗?",  # Detected as WUU
    "Okay, fine I'll see you next week.",  # English
    "Great, I'll see you then.",  # English
]

# Context-unaware language detection
languages = contextual_detect(sentences, context_correction=False)
print(languages)
# Output: ['zh', 'zh', 'ja', 'zh', 'wuu', 'en', 'en']

# Context-aware language detection
languages = contextual_detect(sentences)
print(languages)
# Output: ['zh', 'zh', 'zh', 'zh', 'zh', 'en', 'en']

# Context-aware detection with language biasing
# Specify expected languages to improve detection in ambiguous cases
languages = contextual_detect(sentences, languages=["zh", "en"])
print(languages)
# Output: ['zh', 'zh', 'zh', 'zh', 'zh', 'en', 'en']

# Force a specific language for all sentences
languages = contextual_detect(sentences, languages=["en"])
print(languages)
# Output: ['en', 'en', 'en', 'en', 'en', 'en', 'en']

Dependencies

This library builds upon: - fast-langdetect for base language detection

Development

For development instructions, see DEVELOPMENT.md.

Documentation

  • audio2anki - Extract audio from video files for creating Anki language flashcards
  • add2anki - Browser extension to add words and phrases to Anki language learning decks

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Oliver Steele (@osteele on GitHub)