Skip to content

audio2anki

PyPI version Python Version License Code style: ruff CI

Convert audio and video files into Anki flashcard decks with translations.

audio2anki helps language learners create study materials from audio and video content. It automatically: - Transcribes audio using OpenAI Whisper - Segments the audio into individual utterances - Translates each segment using OpenAI or DeepL - Generates pronunciation (currently supports pinyin for Mandarin) - Creates Anki-compatible flashcards with audio snippets

audio2anki

For related language learning resources, visit Oliver Steele's Language Learning Resources.

Features

  • ๐ŸŽต Process audio files (mp3, wav, etc.) and video files
  • ๐Ÿค– Automatic transcription using OpenAI Whisper
  • ๐Ÿ”ค Automatic translation and pronunciation
  • โœ‚๏ธ Smart audio segmentation
  • ๐Ÿ“ Optional manual transcript input
  • ๐ŸŽด Anki-ready output with embedded audio
  • ๐Ÿˆท๏ธ Intelligent sentence filtering for better learning materials:
  • Removes one-word segments and incomplete sentences
  • Eliminates duplicates and maintains language consistency

Requirements

  • Python 3.11 or 3.12
  • ffmpeg installed and available in your system's PATH
  • OpenAI API key (set as OPENAI_API_KEY environment variable)

Optional requirements:

  • DeepL API token (set as DEEPL_API_TOKEN environment variable). If this is set, DeepL will be used for translation. OpenAI will still be used for Chinese and Japanese pronunciation.
  • ElevenLabs API token (set as ELEVENLABS_API_TOKEN environment variable). If this is, the --voice-isolation can be used for short (less than an hour) audio files.

Installation

You can install audio2anki using either uv, pipx, or pip:

Using uv

  1. Install uv if you don't have it already.
  2. Install audio2anki:
    uv tool install audio2anki
    

Using pipx

  1. Install pipx if you don't have it already.

  2. Install audio2anki:

    pipx install audio2anki
    

This method doesn't require a third-party tool, but it is not recommended as it will install audio2anki in the current Python environment, which may cause conflicts with other packages.

pip install audio2anki

Usage

Basic Usage

Create an Anki deck from an audio file:

export OPENAI_API_KEY=your-api-key-here
audio2anki audio.mp3

Use an existing transcript:

export OPENAI_API_KEY=your-api-key-here
audio2anki audio.mp3 --transcript transcript.txt

Specify which translation service to use:

# Use OpenAI for translation (default)
audio2anki audio.mp3 --translation-provider openai

# Use DeepL for translation
export DEEPL_API_TOKEN=your-deepl-token-here
audio2anki audio.mp3 --translation-provider deepl

For a complete list of commands, including cache and configuration management, see the CLI documentation.

Common Use Cases

Process a noisy recording with more aggressive silence removal:

audio2anki audio.mp3 --silence-thresh -30

Process a quiet recording or preserve more background sounds:

audio2anki audio.mp3 --silence-thresh -50

Process a podcast with custom segment lengths and silence detection:

audio2anki podcast.mp3 --min-length 2.0 --max-length 20.0 --silence-thresh -35

Process an audio file with voice isolation:

audio2anki --voice-isolation input.m4a

Voice isolation (optional, via ElevenLabs API) can be enabled with the --voice-isolation flag. This uses approximately 1000 ElevenLabs credits per minute of audio (free plan: 10,000 credits/month). By default, transcription uses the raw (transcoded) audio. Use --voice-isolation to remove background noise before transcription.

Command Line Options

audio2anki <input-file> [options]

Options:
  --transcript FILE    Use existing transcript
  --output DIR        Output directory (default: ./output)
  --model MODEL       Whisper model (tiny, base, small, medium, large)
  --debug            Generate debug information
  --min-length SEC   Minimum segment length (default: 1.0)
  --max-length SEC   Maximum segment length (default: 15.0)
  --language LANG    Source language (default: auto-detect)
  --silence-thresh DB Silence threshold (default: -40)
  --translation-provider {openai,deepl}  Translation service to use (default: openai)
  --voice-isolation  Enable voice isolation (via ElevenLabs API)

Environment Variables

Required: - OPENAI_API_KEY - OpenAI API key (required if DeepL is not used)

Optional: - DEEPL_API_TOKEN - DeepL API key (recommended for higher quality translations)

Translation Services

The tool supports two translation services:

  1. DeepL
  2. Higher quality translations, especially for European languages
  3. Get an API key from DeepL Pro
  4. Set environment variable: export DEEPL_API_TOKEN=your-api-key
  5. Use with: --translation-provider deepl

  6. OpenAI (Default)

  7. Used by default or when DeepL is not configured or fails
  8. Get an API key from OpenAI
  9. Set environment variable: export OPENAI_API_KEY=your-api-key
  10. Use with: --translation-provider openai

Note: OpenAI is always used for generating pronunciations (Pinyin, Hiragana), even when DeepL is selected for translation.

Output

The script creates: 1. A tab-separated deck file (deck.txt) containing: - Original text (e.g., Chinese characters) - Pronunciation (e.g., Pinyin with tone marks) - English translation - Audio reference 2. A media directory containing the audio segments

Importing into Anki

  1. Import the Deck:
  2. Open Anki
  3. Click File > Import
  4. Select the generated deck.txt file
  5. In the import dialog:

    • Set the Type to "Basic"
    • Check that fields are mapped correctly:
    • Field 1: Front (Original text)
    • Field 2: Pronunciation
    • Field 3: Back (Translation)
    • Field 4: Audio
    • Set "Field separator" to "Tab"
    • Check "Allow HTML in fields"
  6. Import the Audio:

  7. Copy all files from the media directory
  8. Paste them into your Anki media collection:

  9. Verify the Import:

  10. The cards should show:
    • Front: Original text
    • Back: Pronunciation, translation, and a play button for audio
  11. Test the audio playback on a few cards

Note: The audio filenames include a hash of the source file to prevent conflicts when importing multiple decks.

If you have add2anki version >=0.1.2 installed, you can import directly:

add2anki deck.csv --tags audio2anki

To check your installed version:

add2anki --version

If your version is older than >=0.1.2, upgrade with:

uv tool update add2anki
# or, if you installed with pipx:
pipx upgrade add2anki

If you don't have add2anki, or your version is too old, and you have uv installed, you can run:

uv tool add2anki deck.csv --tags audio2anki

See the deck README.md for more details.

API Usage Reporting

audio2anki reports on per-run API usage for each model, including: - Number of API calls - Input and output tokens - Character cost (for DeepL) - Minutes of audio processed (for Whisper)

After processing, a usage report table is displayed. Only columns with nonzero values are shown for clarity.

Example usage report:

OpenAI Usage Report
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Model      โ”ƒ Calls โ”ƒ Input Tokens โ”ƒ Minutes       โ”ƒ Character Cost โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ gpt-3.5    โ”‚ 12    โ”‚ 3456         โ”‚ 10.25         โ”‚            โ”‚
โ”‚ ElevenLabs โ”‚ 3     โ”‚              โ”‚ 2.50          โ”‚ 1200       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

This helps you monitor your API consumption and costs across different services.

Limitations

  • Voice Isolation: The voice isolation feature provided by Eleven Labs is limited to audio files that are less than 500MB after transcoding and less than 1 hour in duration. Processing files larger than this may result in an error indicating that Eleven Labs did not return any results.

License

This project is licensed under the MIT License - see the LICENSE file for details.