TD;DR: Vocabulary tests are useful to assess language proficiency. This repository allows create vocabulary. 📝 Try your vocabulary knowledge
Clone the repository and install the requirements:
git clone https://github.com/polvanrijn/VocabTest
cd VocabTest
REPO_DIR=$(pwd)
python3.9 -m venv env # Setup virtual environment, I used Python 3.9.18 on MacOS
pip install -r requirements.txt
pip install -e .Optionally: Install dictionaries
Make sure you either have hunspell or myspell installed.
DIR_DICT = ~/.config/enchant/hunspell # if you use hunspell
DIR_DICT = ~/.config/enchant/myspell # if you use myspell
mkdir -p $DIR_DICTDownload the Libreoffice dictionaries:
cd $DIR_DICT
git clone https://github.com/LibreOffice/dictionaries
find dictionaries/ -type f -name "*.dic" -exec mv -i {} . \;
find dictionaries/ -type f -name "*.aff" -exec mv -i {} . \;
rm -Rf dictionaries/Manually install missing dictionaries:
# Manually install dictionaries
function get_dictionary() {
f="$(basename -- $1)"
wget $1 --no-check-certificate
unzip $f "*.dic" "*.aff"
rm -f $f
}
# Urdu
get_dictionary https://versaweb.dl.sourceforge.net/project/aoo-extensions/2536/1/dict-ur.oxt
# Western Armenian
get_dictionary https://master.dl.sourceforge.net/project/aoo-extensions/4841/0/hy_am_western-1.0.oxt
# Galician
get_dictionary https://extensions.libreoffice.org/assets/do wnloads/z/corrector-18-07-para-galego.oxt
# Welsh
get_dictionary https://master.dl.sourceforge.net/project/aoo-extensions/1583/1/geiriadur-cy.oxt
mv dictionaries/* .
rm -Rf dictionaries/
# Belarusian
get_dictionary https://extensions.libreoffice.org/assets/downloads/z/dict-be-0-58.oxt
# Marathi
get_dictionary https://extensions.libreoffice.org/assets/downloads/73/1662621066/mr_IN-v8.oxt
mv dicts/* .
rm -Rf dicts/Check all dictionaries are installed:
python3 -c "import enchant
broker = enchant.Broker()
print(sorted(list(set([lang.split('_')[0] for lang in broker.list_languages()]))))"Optionally: Install FastText
cd $REPO_DIR
git clone https://github.com/facebookresearch/fastText.git
cd fastText
pip3 install .
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
cd ..Optionally: Install local UDPipe
Install tensorflow:
pip install tensorflowMake sure GPU is available:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"Install UDPipe:
cd $REPO_DIR
git clone https://github.com/ufal/udpipe
cd udpipe
git checkout udpipe-2
git clone https://github.com/ufal/wembedding_service
pip install .Download the models
curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4804{/udpipe2-ud-2.10-220711.tar.gz}
tar -xvf udpipe2-ud-2.10-220711.tar.gz
rm udpipe2-ud-2.10-220711.tar.gzI had to make one change to the code to make it work locally. Change line 375 in udpipe2_server.py to:
if not hasattr(socket, 'SO_REUSEPORT'):
socket.SO_REUSEPORT = 15Optionally: Word alignment
cd $REPO_DIR/vocabtest/bible/
mkdir dependencies
cd dependencies
git clone https://github.com/clab/fast_align
cd fast_align
mkdir build
cd build
cmake ..
make
Optionally: Uromanize
cd vocabtest/vocabtest/bible/dependencies/
git clone https://github.com/isi-nlp/uroman- Afrikaans 🇿🇦
- Arabic (many countries)
- Belarussian 🇧🇾
- Bulgarian 🇧🇬
- Catalan 🇪🇸
- Czech 🇨🇿
- Welsh 🇬🇧
- Danish 🇩🇰
- German 🇩🇪🇨🇭🇦🇹
- Greek 🇬🇷
- English (many countries)
- Spanish (many countries)
- Estionian 🇪🇪
- Basque 🇪🇸
- Persian 🇮🇷🇦🇫🇹🇯
- Finnish 🇫🇮
- Faroese 🇩🇰
- French (many countries)
- Irish 🇮🇪
- Gaelic (Scottish) 🏴
- Galician 🇪🇸
- Gothic (dead)
- Hebrew 🇮🇱
- Hindi 🇮🇳
- Croatian 🇭🇷
- Hungarian 🇭🇺
- Armenian 🇦🇲
- Western Armenian
- Indonesia 🇮🇩
- Icelandic 🇮🇸
- Italian 🇮🇹
- Japanese 🇯🇵
- Korean 🇰🇷
- Latin (dead)
- Lithuania 🇱🇹
- Latvian 🇱🇻
- Marathi 🇮🇳
- Maltese 🇲🇹
- Dutch 🇳🇱🇧🇪
- Norwegian Nynorsk 🇳🇴
- Norwegian Bokmål 🇳🇴
- Polish 🇵🇱
- Portuguese 🇵🇹
- Romanian 🇷🇴
- Russian 🇷🇺
- Sanskrit 🇮🇳
- Northern Sami 🇳🇴
- Slovak 🇸🇰
- Slovenian 🇸🇮
- Serbian 🇷🇸
- Swedish 🇸🇪
- Tamil 🇮🇳🇱🇰🇸🇬
- Telugu 🇮🇳
- Turkish 🇹🇷
- Uyghur 🇨🇳
- Ukranian 🇺🇦
- Urdu 🇵🇰🇮🇳
- Vietnamese 🇻🇳
- Wolof 🇸🇳
- Chinese 🇨🇳
Creating your own vocabulary test is easy. The only thing you need is a large amount of text in a language and need to implement two functions:
vocabtest.<your_dataset>.download: which downloads the dataset and stores it in a subfolder calleddatavocabtest.<your_dataset>.filter: which filters and cleans the dataset and stores the following files in thedatabasesubfolder:{language_id}-filtered.csvis a table withwordandcountof all words that pass the filter,{language_id}-clean.txtis text file with all words that are cleaned, which is used for training the compound word splitter,{language_id}-all.txtis text file with all words occurring in the corpus, which is used to reject pseudowords which are already in the corpus
You can now run your vocabulary test with:
vocabtest download <your_dataset> <language_id>
vocabtest filter <your_dataset> <language_id>
vocabtest create-pseudowords <your_dataset> <language_id>
vocabtest create-test <your_dataset> <language_id>@misc{vanrijn2023wikivocab,
title={Around the world in 60 words: A generative vocabulary test for online research},
author={Pol van Rijn and Yue Sun and Harin Lee and Raja Marjieh and Ilia Sucholutsky and Francesca Lanzarini and Elisabeth André and Nori Jacoby},
year={2023},
eprint={2302.01614},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2302.01614},
}
