contextual-langdetect¶

A context-aware language detection library that improves accuracy by considering document-level language patterns.

Use Case¶

This library is designed for processing corpora where individual lines or sentences might be in different languages, but with a strong prior that there are only one or two primary languages. It uses document-level context to improve accuracy in cases where individual sentences might be ambiguously detected.

For example, in a primarily Chinese corpus:

Some sentences might be detected at an individual level as Japanese, but if they don't contain kana characters, they're likely Chinese
Some sentences might be detected as Wu Chinese (wuu), but in a Mandarin context they're likely Mandarin
The library uses the dominant language(s) in the corpus to resolve these ambiguities

This is particularly useful for:

Transcriptions of bilingual conversations, including
Language instruction texts and transcriptions
Mixed-language documents where the majority language should inform ambiguous cases

Features¶

Accurate language detection with confidence scores
Context-aware detection that uses surrounding text to disambiguate
Special case handling for commonly confused languages (e.g., Wu Chinese, Japanese without kana)
Support for mixed language documents

Installation¶

pip install contextual-langdetect

Usage¶

`count_by_language`¶

from contextual_langdetect import contextual_detect

# Process a document with context-awareness
sentences = [
    "你好。",  # Detected as ZH
    "你好吗?",  # Detected as ZH
    "很好。",  # Detected as JA when model=small
    "我家也有四个,刚好。",  # Detected as ZH
    "那么现在天气很冷,你要开暖气吗?",  # Detected as WUU
    "Okay, fine I'll see you next week.",  # English
    "Great, I'll see you then.",  # English
]

# Context-unaware language detection
languages = contextual_detect(sentences, context_correction=False)
print(languages)
# Output: ['zh', 'zh', 'ja', 'zh', 'wuu', 'en', 'en']

# Context-aware language detection
languages = contextual_detect(sentences)
print(languages)
# Output: ['zh', 'zh', 'zh', 'zh', 'zh', 'en', 'en']

# Context-aware detection with language biasing
# Specify expected languages to improve detection in ambiguous cases
languages = contextual_detect(sentences, languages=["zh", "en"])
print(languages)
# Output: ['zh', 'zh', 'zh', 'zh', 'zh', 'en', 'en']

# Force a specific language for all sentences
languages = contextual_detect(sentences, languages=["en"])
print(languages)
# Output: ['en', 'en', 'en', 'en', 'en', 'en', 'en']

`count_by_language`¶

def count_by_language(
    sentences: Sequence[str],
    languages: Sequence[Language] | None = None,
    model: ModelSize = ModelSize.SMALL,
    context_correction: bool = True,
) -> dict[Language, int]

Given a batch of sentences, returns a dict mapping language codes to the number of sentences assigned to each language, using the contextual detection algorithm.

Example:

from contextual_langdetect.detection import count_by_language

sentences = [
    "Hello world.",
    "Bonjour le monde.",
    "Hallo Welt.",
    "Hello again.",
]
counts = count_by_language(sentences)
# Example output: {'en': 2, 'fr': 1, 'de': 1}

`get_languages_by_count`¶

def get_languages_by_count(
    sentences: Sequence[str],
    languages: Sequence[Language] | None = None,
    model: ModelSize = ModelSize.SMALL,
    context_correction: bool = True,
) -> list[tuple[Language, int]]

Given a batch of sentences, returns a list of (language, count) tuples sorted by decreasing count, using the contextual detection algorithm.

Example:

from contextual_langdetect.detection import get_languages_by_count

sentences = [
    "Hello world.",
    "Bonjour le monde.",
    "Hallo Welt.",
    "Hello again.",
]
language_counts = get_languages_by_count(sentences)
# Example output: [('en', 2), ('fr', 1), ('de', 1)]

`get_majority_language`¶

def get_majority_language(
    sentences: Sequence[str],
    languages: Sequence[Language] | None = None,
    model: ModelSize = ModelSize.SMALL,
    context_correction: bool = True,
) -> Language | None

Given a batch of sentences, returns the language code with the highest count (the majority language), or None if there are no sentences.

Example:

from contextual_langdetect.detection import get_majority_language

sentences = [
    "Hello world.",
    "Bonjour le monde.",
    "Hallo Welt.",
    "Hello again.",
]
majority_language = get_majority_language(sentences)
# Example output: 'en'

Dependencies¶

This library builds upon: - LlmKira/fast-langdetect for base language detection - zafercavdar/fasttext-langdetect (transitively) , which LlmKira/fast-langdetect builds on - FastText by Facebook, which both these projects wrap

Development¶

For development instructions, see DEVELOPMENT.md.

Documentation¶

Context-Aware Detection - Learn how the context-aware language detection algorithm works

add2anki - Browser extension to add words and phrases to Anki language learning decks. contextual-langdetect was extracted from this.
audio2anki - Extract audio from video files for creating Anki language flashcards. add2anki was developed to support this and other tools.

License¶

This project is licensed under the MIT License - see the LICENSE file for details.

Author¶

Oliver Steele (@osteele on GitHub)

contextual-langdetect¶

Use Case¶

Features¶

Installation¶

Usage¶

count_by_language¶

count_by_language¶

get_languages_by_count¶

get_majority_language¶

Dependencies¶

Development¶

Documentation¶

My Related Projects¶

License¶

Author¶

`count_by_language`¶

`count_by_language`¶

`get_languages_by_count`¶

`get_majority_language`¶