Dictaker is a command-line tool that helps you build a personalized English dictionary by analyzing subtitle files (in .xlsx format) downloaded using the Language Reactor browser extension.
It processes subtitle data to identify unknown words and provides detailed information to help you learn and manage your vocabulary effectively.
# Clone the repository
git clone https://github.com/Artem1k/DictionaryApp.git
cd dictaker
# Use your preferred tool to create virtual environment
# Install via pip from source
pip install .
# Run the CLI tool
dictaker --help
# Or
dictaker- The program reads
.xlsxsubtitle files exported via Language Reactor. - Extracts unknown or deferred words.
- Displays words with detailed information.
- Lets you interactively manage your vocabulary (mark as known, ignore, defer, etc.).
- Saves words to a persistent SQLite database, which can be viewed, managed, imported, and exported.
- 🚀 Make dictionary (
make_dict) - 📖 View your personal dictionary (
show_dict) - 🕘 Show history(processed files)
- 📥 Import dictionary
- 📤 Export dictionary
- 🌐 Change target language
This command analyzes .xlsx subtitle files — either a single file, multiple files separated by spaces, or an entire folder. It counts the subtitles, identifies unknown or deferred words, and displays word details:
- Word
- Translation
- Part of speech (POS)
- Frequency in the subtitles
- Frequency already recorded in the database
- Total combined frequency
- Sort by
word,pos,count,count_in_db,total_count,category(ascending or descending). - Mark words as:
- Known
- Ignore
- Defer (temporarily skip during current session)
- Toggle display of known words that appear in subtitles. The category will show up when known words are showing.
- Save word lists to separate files.
- Save and exit.
- Exit without saving.
💡 Tip: You can use multiple indexes or index ranges when marking words. Example:
1 2 3-5 9-6.
View your personal dictionary and continue marking words as known, ignored, or deferred.
Unlike in make_dict, ignored and deferred words still appear in this session view.
This command shows:
- Word itself
- Translation
- Part of speech (POS)
- Frequency already recorded in the database
- Category
- Sort by
id,word,pos,count,category(ascending or descending). - Mark words as:
- Known
- Ignore
- Defer (you can treat this as "Unknown")
- Save and exit.
- Exit without saving.
💡 Tip: Index ranges work here too — e.g., 1 2 3-5.
Showing the history of processed .csv files and the number of unknown words found in each.
Words are saved with a unique ID in the database, so to review the added words, use show_dict.
Imports a dictionary from a .csv file.
If a word already exists, its count is increased accordingly.
Exports your current dictionary to a .csv file - just provide the path to the folder.
Run this command without arguments to see available language options.
This project uses the spaCy library and the en_core_web_md model.
It’s fast and lightweight but not always accurate in part-of-speech (POS) tagging.
Words with a named entity type (ent_type), such as names of people or places, are excluded automatically. However, this isn’t always reliable, so the Ignore category allows you to manually clean up such entries.
Translations use GoogleTranslator. Some modal verbs like "can" or "should" may not translate properly — especially to Ukrainian — maybe because translation is done out of context.
Future versions may use better models or APIs to improve accuracy.
- Add built-in learning tools
- GUI
- Browser extension (maybe)
Contributions are welcome! Feel free to fork, create pull requests, or report issues.
Licensed under the MIT License. See the LICENSE file for more information.