Skip to content

contextual-langdetect

CI PyPI Python License: MIT

A context-aware language detection library that improves accuracy by considering document-level language patterns.

Use Case

This library is designed for processing corpora where individual lines or sentences might be in different languages, but with a strong prior that there are only one or two primary languages. It uses document-level context to improve accuracy in cases where individual sentences might be ambiguously detected.

For example, in a primarily Chinese corpus:

  • Some sentences might be detected at an individual level as Japanese, but if they don't contain kana characters, they're likely Chinese
  • Some sentences might be detected as Wu Chinese (wuu), but in a Mandarin context they're likely Mandarin
  • The library uses the dominant language(s) in the corpus to resolve these ambiguities

This is particularly useful for:

  • Transcriptions of bilingual conversations, including
  • Language instruction texts and transcriptions
  • Mixed-language documents where the majority language should inform ambiguous cases

Features

  • Accurate language detection with confidence scores
  • Context-aware detection that uses surrounding text to disambiguate
  • Special case handling for commonly confused languages (e.g., Wu Chinese, Japanese without kana)
  • Support for mixed language documents

Installation

pip install contextual-langdetect

Usage

count_by_language

from contextual_langdetect import contextual_detect

# Process a document with context-awareness
sentences = [
    "你好。",  # Detected as ZH
    "你好吗?",  # Detected as ZH
    "很好。",  # Detected as JA when model=small
    "我家也有四个,刚好。",  # Detected as ZH
    "那么现在天气很冷,你要开暖气吗?",  # Detected as WUU
    "Okay, fine I'll see you next week.",  # English
    "Great, I'll see you then.",  # English
]

# Context-unaware language detection
languages = contextual_detect(sentences, context_correction=False)
print(languages)
# Output: ['zh', 'zh', 'ja', 'zh', 'wuu', 'en', 'en']

# Context-aware language detection
languages = contextual_detect(sentences)
print(languages)
# Output: ['zh', 'zh', 'zh', 'zh', 'zh', 'en', 'en']

# Context-aware detection with language biasing
# Specify expected languages to improve detection in ambiguous cases
languages = contextual_detect(sentences, languages=["zh", "en"])
print(languages)
# Output: ['zh', 'zh', 'zh', 'zh', 'zh', 'en', 'en']

# Force a specific language for all sentences
languages = contextual_detect(sentences, languages=["en"])
print(languages)
# Output: ['en', 'en', 'en', 'en', 'en', 'en', 'en']

count_by_language

def count_by_language(
    sentences: Sequence[str],
    languages: Sequence[Language] | None = None,
    model: ModelSize = ModelSize.SMALL,
    context_correction: bool = True,
) -> dict[Language, int]

Given a batch of sentences, returns a dict mapping language codes to the number of sentences assigned to each language, using the contextual detection algorithm.

Example:

from contextual_langdetect.detection import count_by_language

sentences = [
    "Hello world.",
    "Bonjour le monde.",
    "Hallo Welt.",
    "Hello again.",
]
counts = count_by_language(sentences)
# Example output: {'en': 2, 'fr': 1, 'de': 1}

get_languages_by_count

def get_languages_by_count(
    sentences: Sequence[str],
    languages: Sequence[Language] | None = None,
    model: ModelSize = ModelSize.SMALL,
    context_correction: bool = True,
) -> list[tuple[Language, int]]

Given a batch of sentences, returns a list of (language, count) tuples sorted by decreasing count, using the contextual detection algorithm.

Example:

from contextual_langdetect.detection import get_languages_by_count

sentences = [
    "Hello world.",
    "Bonjour le monde.",
    "Hallo Welt.",
    "Hello again.",
]
language_counts = get_languages_by_count(sentences)
# Example output: [('en', 2), ('fr', 1), ('de', 1)]

get_majority_language

def get_majority_language(
    sentences: Sequence[str],
    languages: Sequence[Language] | None = None,
    model: ModelSize = ModelSize.SMALL,
    context_correction: bool = True,
) -> Language | None

Given a batch of sentences, returns the language code with the highest count (the majority language), or None if there are no sentences.

Example:

from contextual_langdetect.detection import get_majority_language

sentences = [
    "Hello world.",
    "Bonjour le monde.",
    "Hallo Welt.",
    "Hello again.",
]
majority_language = get_majority_language(sentences)
# Example output: 'en'

Dependencies

This library builds upon: - LlmKira/fast-langdetect for base language detection - zafercavdar/fasttext-langdetect (transitively) , which LlmKira/fast-langdetect builds on - FastText by Facebook, which both these projects wrap

Development

For development instructions, see DEVELOPMENT.md.

Documentation

  • add2anki - Browser extension to add words and phrases to Anki language learning decks. contextual-langdetect was extracted from this.
  • audio2anki - Extract audio from video files for creating Anki language flashcards. add2anki was developed to support this and other tools.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Oliver Steele (@osteele on GitHub)