The Russian language has a vast and nuanced vocabulary. One approach to learning the vocabulary is to approach it in frequency order. The Nicholas Brown book seems dated and the frequency ordering methodology is not clear to me. Some words seem to be clustered by the beginning letter, which seems statistically unlikely. However, it’s a convenient list and I’m slowly building a table that cross-correlates the Nicholas Brown list with the methodologically-superior Russian National Corpus. To do that I harvested the data from the Corpus and built a Python application to search the database and report the rank and frequency data from it.
Creating a sqlite3 version of the Russian National Corpus
There is a CSV version of the Corpus, but the data is not useful for ordering in a meaningful way. Instead, I took the rank ordered tabular data from the page Частотный список лемм (Frequency list of lemmas) and simply pasted it into a Numbers spreadsheet. Since Numbers is extremely slow even on a fast-performing machine, it beachballed for nearly a minute during the paste operation. After that, I exported it as CSV. To get the CSV file into a sqlite3 database, I created a new table with the following schema:
After mapping the column names to those in the CSV, the import was simple.
Accessing the sqlite3 version of the corpus using Python
Next I wrote a little Python application to access the data and return the rank or frequency of any Russian word. The only trick is that the Russian letter ë is rendered as e in the database; so any word containing ë must be altered before the search. Regular expressions to the rescue! To use the application, just launch it with
-h help flag and you’ll see the calling format.
And a little AppleScript
Finally, I wrote an AppleScript wrapper that I can launch with a Quicksilver keystroke trigger. The wrapper takes the word off the clipboard and calls the Python app above, replacing the contents of the clipboard with the rank order of the word. For a little fun, it speaks the rank order number in Russian! Here’s the code for the AppleScript wrapper:
If you want a pre-built sqlite3 version of the Russian National Corpus, here it is.