Skip to content

polvanrijn/VocabTest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📖 VocabTest

Vocabulary tests for an open-ended number of languages

TD;DR: Vocabulary tests are useful to assess language proficiency. This repository allows create vocabulary. 📝 Try your vocabulary knowledge


example items from WikiVocab

Setup

Clone the repository and install the requirements:

git clone https://github.com/polvanrijn/VocabTest
cd VocabTest
REPO_DIR=$(pwd)
python3.9 -m venv env  # Setup virtual environment, I used Python 3.9.18 on MacOS
pip install -r requirements.txt
pip install -e .

Developer requirements

Optionally: Install dictionaries

Make sure you either have hunspell or myspell installed.

DIR_DICT = ~/.config/enchant/hunspell # if you use hunspell
DIR_DICT = ~/.config/enchant/myspell # if you use myspell
mkdir -p $DIR_DICT

Download the Libreoffice dictionaries:

cd $DIR_DICT
git clone https://github.com/LibreOffice/dictionaries
find dictionaries/ -type f -name "*.dic" -exec mv -i {} .  \;
find dictionaries/ -type f -name "*.aff" -exec mv -i {} .  \;
rm -Rf dictionaries/

Manually install missing dictionaries:

# Manually install dictionaries
function get_dictionary() {
  f="$(basename -- $1)"
  wget $1 --no-check-certificate
  unzip $f "*.dic" "*.aff"
  rm -f $f
}

# Urdu
get_dictionary https://versaweb.dl.sourceforge.net/project/aoo-extensions/2536/1/dict-ur.oxt

# Western Armenian
get_dictionary https://master.dl.sourceforge.net/project/aoo-extensions/4841/0/hy_am_western-1.0.oxt

# Galician
get_dictionary https://extensions.libreoffice.org/assets/do wnloads/z/corrector-18-07-para-galego.oxt

# Welsh
get_dictionary https://master.dl.sourceforge.net/project/aoo-extensions/1583/1/geiriadur-cy.oxt
mv dictionaries/* .
rm -Rf dictionaries/

# Belarusian
get_dictionary https://extensions.libreoffice.org/assets/downloads/z/dict-be-0-58.oxt

# Marathi
get_dictionary https://extensions.libreoffice.org/assets/downloads/73/1662621066/mr_IN-v8.oxt
mv dicts/* .
rm -Rf dicts/

Check all dictionaries are installed:

python3 -c "import enchant
broker = enchant.Broker()
print(sorted(list(set([lang.split('_')[0] for lang in broker.list_languages()]))))"
Optionally: Install FastText
cd $REPO_DIR
git clone https://github.com/facebookresearch/fastText.git
cd fastText
pip3 install .
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
cd ..
Optionally: Install local UDPipe

Install tensorflow:

pip install tensorflow

Make sure GPU is available:

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Install UDPipe:

cd $REPO_DIR
git clone https://github.com/ufal/udpipe
cd udpipe
git checkout udpipe-2
git clone https://github.com/ufal/wembedding_service
pip install .

Download the models

curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4804{/udpipe2-ud-2.10-220711.tar.gz}
tar -xvf udpipe2-ud-2.10-220711.tar.gz
rm udpipe2-ud-2.10-220711.tar.gz

I had to make one change to the code to make it work locally. Change line 375 in udpipe2_server.py to:

if not hasattr(socket, 'SO_REUSEPORT'):
     socket.SO_REUSEPORT = 15
Optionally: Word alignment
cd $REPO_DIR/vocabtest/bible/
mkdir dependencies
cd dependencies
git clone https://github.com/clab/fast_align
cd fast_align
mkdir build
cd build
cmake ..
make
Optionally: Uromanize
cd vocabtest/vocabtest/bible/dependencies/
git clone https://github.com/isi-nlp/uroman

Tests

WikiVocab: Validated vocabulary test for 60 languages

  1. Afrikaans 🇿🇦
  2. Arabic (many countries)
  3. Belarussian 🇧🇾
  4. Bulgarian 🇧🇬
  5. Catalan 🇪🇸
  6. Czech 🇨🇿
  7. Welsh 🇬🇧
  8. Danish 🇩🇰
  9. German 🇩🇪🇨🇭🇦🇹
  10. Greek 🇬🇷
  11. English (many countries)
  12. Spanish (many countries)
  13. Estionian 🇪🇪
  14. Basque 🇪🇸
  15. Persian 🇮🇷🇦🇫🇹🇯
  16. Finnish 🇫🇮
  17. Faroese 🇩🇰
  18. French (many countries)
  19. Irish 🇮🇪
  20. Gaelic (Scottish) 🏴󠁧󠁢󠁳󠁣󠁴󠁿
  21. Galician 🇪🇸
  22. Gothic (dead)
  23. Hebrew 🇮🇱
  24. Hindi 🇮🇳
  25. Croatian 🇭🇷
  26. Hungarian 🇭🇺
  27. Armenian 🇦🇲
  28. Western Armenian
  29. Indonesia 🇮🇩
  30. Icelandic 🇮🇸
  31. Italian 🇮🇹
  32. Japanese 🇯🇵
  33. Korean 🇰🇷
  34. Latin (dead)
  35. Lithuania 🇱🇹
  36. Latvian 🇱🇻
  37. Marathi 🇮🇳
  38. Maltese 🇲🇹
  39. Dutch 🇳🇱🇧🇪
  40. Norwegian Nynorsk 🇳🇴
  41. Norwegian Bokmål 🇳🇴
  42. Polish 🇵🇱
  43. Portuguese 🇵🇹
  44. Romanian 🇷🇴
  45. Russian 🇷🇺
  46. Sanskrit 🇮🇳
  47. Northern Sami 🇳🇴
  48. Slovak 🇸🇰
  49. Slovenian 🇸🇮
  50. Serbian 🇷🇸
  51. Swedish 🇸🇪
  52. Tamil 🇮🇳🇱🇰🇸🇬
  53. Telugu 🇮🇳
  54. Turkish 🇹🇷
  55. Uyghur 🇨🇳
  56. Ukranian 🇺🇦
  57. Urdu 🇵🇰🇮🇳
  58. Vietnamese 🇻🇳
  59. Wolof 🇸🇳
  60. Chinese 🇨🇳

BibleVocab: Vocabulary test for more than 2000 languages

Create your own vocabulary test

Creating your own vocabulary test is easy. The only thing you need is a large amount of text in a language and need to implement two functions:

  • vocabtest.<your_dataset>.download: which downloads the dataset and stores it in a subfolder called data
  • vocabtest.<your_dataset>.filter: which filters and cleans the dataset and stores the following files in the database subfolder:
    • {language_id}-filtered.csv is a table with word and count of all words that pass the filter,
    • {language_id}-clean.txt is text file with all words that are cleaned, which is used for training the compound word splitter,
    • {language_id}-all.txt is text file with all words occurring in the corpus, which is used to reject pseudowords which are already in the corpus

You can now run your vocabulary test with:

vocabtest download <your_dataset> <language_id>
vocabtest filter <your_dataset> <language_id>
vocabtest create-pseudowords <your_dataset> <language_id>
vocabtest create-test <your_dataset> <language_id>

Citation

@misc{vanrijn2023wikivocab,
      title={Around the world in 60 words: A generative vocabulary test for online research}, 
      author={Pol van Rijn and Yue Sun and Harin Lee and Raja Marjieh and Ilia Sucholutsky and Francesca Lanzarini and Elisabeth André and Nori Jacoby},
      year={2023},
      eprint={2302.01614},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2302.01614}, 
}

About

Vocabulary Tests for an Open-Ended Number of Languages from just Text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages