(N.B. 2018-09-16 A change in the URL scheme for the audio content at Wiktionary required me to change the regular expression that I use to capture the source file URL. See comments in the relevant section below.)
For the last two years, I’ve been working through a 10,000 word Russian vocabulary ordered by frequency. I have a goal of finishing the list before the end of 2019. This requires not only stubborn persistence but an efficient process of collecting the information that goes onto my Anki flash cards.
My manual process has been to work from a Numbers spreadsheet. As I collect information about each word from several websites, I log it in this table.
For each word, I do the following:
- From Open Russian I obtain an example sentence or two.
- From Wiktionary I obtain, the definition, more example phrases, any particular grammatical information I need, and audio of the pronunciation if it is available. I also capture the URL from this site onto my flash card.
- From the Russian National Corpus I capture the frequency according to their listing in case I want to reorder my frequency list in the future.
This involves lots of cutting, pasting and tab-switching. So I devised an automated approach to loading up this information. This most complicated part was downloading the Russian pronunciation from Wiktionary. I did this with Python.
Downloading pronunciation files from Wiktionary
First, we initialize a
WikiPage object by building the main page URL using the Russian word we want to capture. We can capture the page source and look for the direct link to the audio file that we want:
Edit 2018-09-16: Note that the regular expression above should now be
Edit 2018-09-17: Still not right. Try this
audioLink returns a link to the .ogg file that we want to download. Now we just have to download the file:
Now to kick-off the process, we just have to get the word from the mac OS pasteboard, instantiate a
WikiPage object and call
downloadAudio on it:
word = xerox.paste().encode('utf-8')
If you’d like to see the entire Python script, the gist is here.
Automating Google Chrome
Next we want to automate Chrome to pull up the word in the reference websites. We’ll do this in AppleScript.
set searchTerm to the clipboard as text
There we grab the word off the clipboard and build the URL for both sites. Next we’ll look for a tab that contains the Russian National Corpus site and execute a page search for our target word. That way I can easily grab the word frequency from the page.
tell application "Google Chrome" to activate
Then we need to load the word definition pages using the URLs that we built earlier:
-- load word definitions
do shell script we can fire off the Python script to download the audio. Actually, I have the AppleScript do that first to allow time to process the audio as I’ve described previously. Finally, I create a Quicksilver trigger to start the entire process from a single keystroke.
Granted, I have a very specific use case here, but hopefully you’ve been able to glean something useful about process automation of Chrome and using Python to download pronunciation files from Wiktionary. Cheers.