For the last two years, I’ve been working through a 10,000 word Russian vocabulary ordered by frequency. I have a goal of finishing the list before the end of 2019. This requires not only stubborn persistence but an efficient process of collecting the information that goes onto my Anki flash cards.

My manual process has been to work from a Numbers spreadsheet. As I collect information about each word from several websites, I log it in this table.

numbers-sheet-ru.png

For each word, I do the following:

  1. From Open Russian I obtain an example sentence or two.
  2. From Wiktionary I obtain, the definition, more example phrases, any particular grammatical information I need, and audio of the pronunciation if it is available. I also capture the URL from this site onto my flash card.
  3. From the Russian National Corpus I capture the frequency according to their listing in case I want to reorder my frequency list in the future.

This involves lots of cutting, pasting and tab-switching. So I devised an automated approach to loading up this information. This most complicated part was downloading the Russian pronunciation from Wiktionary. I did this with Python.

Downloading pronunciation files from Wiktionary

1
2
3
4
5
6
7
8
9
class WikiPage(object):
"""Wiktionary page - source for the extraction"""
def __init__(self, ruWord):
super(WikiPage, self).__init__()
self.word = ruWord
self.baseURL = u'http://en.wiktionary.org/wiki/'
self.anchor = u'#Russian'
def url(self):
return self.baseURL + self.word + self.anchor

First, we initialize a WikiPage object by building the main page URL using the Russian word we want to capture. We can capture the page source and look for the direct link to the audio file that we want:

1
2
3
4
5
def page(self):
return requests.get(self.url())
def audioLink(self):
searchObj = re.search("commons(\\/.+\\/.+\\/Ru-.+\\.ogg)", self.page().text, re.M)
return searchObj.group(1)

The function audioLink returns a link to the .ogg file that we want to download. Now we just have to download the file:

1
2
3
4
5
6
7
8
9
10
def downloadAudio(self):
path = join(expanduser("~"),'Downloads',self.word + '.ogg')
try:
mp3file = urllib2.urlopen(self.fullAudioLink())
except AttributeError:
print "There appears to be no audio."
notify("No audio","Wiktionary has no pronunciation", "Pronunciation is not available for download.", sound=True)
else:
with open(path,'wb') as output:
output.write(mp3file.read())

Now to kick-off the process, we just have to get the word from the mac OS pasteboard, instantiate a WikiPage object and call downloadAudio on it:

1
2
3
4
5
6
word = xerox.paste().encode('utf-8')
wikipage = WikiPage(word)
if DEBUG:
print wikipage.url()
print wikipage.fullAudioLink()
wikipage.downloadAudio()

If you’d like to see the entire Python script, the gist is here.

Automating Google Chrome

Next we want to automate Chrome to pull up the word in the reference websites. We’ll do this in AppleScript.

1
2
3
set searchTerm to the clipboard as text
set openRussianURL to "https://en.openrussian.org/ru/" & searchTerm
set wiktionaryURL to "https://en.wiktionary.org/wiki/" & searchTerm & "#Russian"

There we grab the word off the clipboard and build the URL for both sites. Next we’ll look for a tab that contains the Russian National Corpus site and execute a page search for our target word. That way I can easily grab the word frequency from the page.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
tell application "Google Chrome" to activate

-- initiate the word find process in dict.ruslang.ru
tell application "Google Chrome"
-- find the tab with the frequency list
set i to 0
repeat with t in (every tab of window 1)
set i to i + 1
set searchURLText to (URL of t) as text
if searchURLText begins with "http://dict.ruslang.ru/" then
set active tab index of window 1 to i
exit repeat
end if
end repeat
end tell

delay 1

tell application "System Events"
tell process "Google Chrome"
keystroke "f" using command down
delay 0.5
keystroke "V" using command down
delay 0.5
key code 36
end tell
end tell

Then we need to load the word definition pages using the URLs that we built earlier:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- load word definitions
tell application "Google Chrome"
activate
set i to 0
set tabList to every tab of window 1
repeat with theTab in tabList
set i to i + 1
set textURL to (URL of theTab) as text
-- load the word in open russian
if textURL begins with "https://en.openrussian.org" then
set URL of theTab to openRussianURL
end if
-- load the word in wiktionary
if textURL begins with "https://en.wiktionary.org" then
set URL of theTab to wiktionaryURL
-- make the wiktionary tab the active tab
set active tab index of window 1 to i
end if
end repeat
end tell

Finally, using do shell script we can fire off the Python script to download the audio. Actually, I have the AppleScript do that first to allow time to process the audio as I’ve described previously. Finally, I create a Quicksilver trigger to start the entire process from a single keystroke.

Granted, I have a very specific use case here, but hopefully you’ve been able to glean something useful about process automation of Chrome and using Python to download pronunciation files from Wiktionary. Cheers.

I wrote a piece previously about using JavaScript in Anki cards. Although I haven’t found many uses for employing this idea, it does come up from time-to-time including a recent use-case I’m writing about now.

After downloading a popular French frequency list deck for my daughter to use, I noticed that it omits the gender of nouns in the French prompt. In school, I was always taught to memorize the gender along with the noun. For example, when you memorize the word for law, “loi” you should mermorize it with either the definite article “la” or the indefinite article “une” so that the feminine gender of the noun is inseparable from the noun itself. But this deck has only the noun prompt and I was afraid that my daughter would fail to memorize the noun’s gender. JavaScript to the rescue.

Since the gender is encoded in a field, we can capitalize on that to insert the right article. My preference is to use the definite articles “le” or “la” where possible. But it gets increasingly complex from there. Nouns that begin with a vowel such as “avocat” require “l’avocat” which obscures the gender. In that case, I’d prefer the indefinite article “un avocat”. Then there’s the “h”. Most words beginning with “h” behave like those with vowels. But some words have h aspiré. With those words, we keep the full definite article without the apostrophe.

So we start with a couple easy preliminaries, such as detecting vowels:

1
2
3
4
5
//	returns true if the character
// is a vowel
function vowelTest(s) {
return (/^[aeiou]$/i).test(s);
}

Now we turn our attention to whether a words would need an apostrophe with the definite article. I’m not actually going to use the apostrophe. Instead we’ll fall back to the indefinite article “un/une” in this case.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// returns true if the word would need
// an apostrophe if used with the
// definite article
function needsApostrophe(str) {
if(str[0]=='h') {
// h words that do not need apostrophe
var aspire = ["hache","hachisch","haddock","haïku",
"haillon","haine","hall",
"halo","halte","hamac",
"hamburger","hameau","hammam",
"hampe","hamster","hanche",
"hand-ball","handicap","hangar",
"harde","hareng","hargne",
"haricot","harpail","harpon",
"hasard","hauteur","havre","hère",
"hérisson","hernie","héron",
"héros","herse","hêtre",
"hiatus","hibou","hic",
"hickory","hiérarchie","hiéroglyphe",
"hobby","Hollande","homard",
"Hongrie","honte","hoquet",
"houe","houle","hooligan",
"houppe","housse","houx",
"houblot","huche","huguenot"
];
return (aspire.indexOf(str) == -1);
}
return vowelTest(str[0]);
}

Now we can wrap this up into a function that adds an article, either definite or indefinite to the noun:

1
2
3
4
5
6
7
//	adds either definite or indefinite article
function addArticle(str,genderstr) {
if( needsApostrophe(str) ) {
return (genderstr == "nm" ) ? "un " + str : "une " + str;
}
return (genderstr == "nm") ? "le " + str : "la " + str;
}

The first step is to make sure that the part of speech field is visible to the script. We do this by inserting it into the card template.

1
<span id="pos">{{Part of Speech}}</span>

Don’t worry, we’ll hide it in a minute.

Then we can obtain the contents of the field and add the gender-specific article accordingly.

1
2
3
4
var content = document.getElementById("pos").innerHTML;
var fword = document.getElementsByClassName("frenchwordless")[0].innerHTML;
artword = addArticle(fword,content);
document.getElementsByClassName("frenchwordless")[0].innerHTML=artword;

And we can hide the gender sentinel field:

1
var content = document.getElementById("pos").style.visibility = "hidden";

Ideally, French Anki decks would be constructed in such a way that the gender is embedded in the noun to be memorized, but with a little creative use of JavaScript, we can retool it on-the-fly.

ghgraph.jpg

Spurious sensor data can wreak havoc in an otherwise finely-tuned home automation system. I use temperature data from an Aeotech Multisensor 6 to monitor the environment in our greenhouse. Living in Canada, I cannot rely solely on passive systems to maintain the temperature, particularly at night. So, using the temperature and humidity measurements transmitted back to the controller over Z-wave, I control devices inside the greenhouse that heat and humidify the environment.

But spurious temperature and humidity data mean that I often falsely trigger the heating and humidification devices. After dealing with this for several weeks, I came up with a workable solution that can be applied to other sensor data. It’s important to note that the solution I developed uses time-averaging of the data. If it’s important to react to the data quickly, then the averaging window needs to be shortened or you may need to look for a different solution.

I started by trying to ascertain exactly what the spurious temperature data were. It turns out that usually the spurious data points were 0’s. But occasionally odd non-zero data would crop up. In all cases the values were lower than the actual value and always by a lot (i.e. 40 or more degrees F difference.)

In most cases with Indigo, for simplicity, we simply trigger events based on absolute values. When spurious data are present, for whatever reason, false triggers will result. My approach takes advantage of the fact that Indigo keeps a database of sensor data. By default it logs these data points to a SQLite database. This database is at /Library/Application Support/Perceptive Automation/Indigo 7/Logs/indigo_history.sqlite. I used the application Base a GUI SQLite client on macOS to explore the structure a bit. Each device has a table named device_history_xxxxxxxx. You simply need to know the device identifier which you can easily find in the Indigo application. Exploring the table, you can see how the data are stored.

base.jpg

To employ a strategy of time-averaging and filtering the data, I decided to pull the last 10 values from the SQLite database. As I get data about every 30 seconds from the sensor, my averaging window is about 5 minutes. It turns out this is quite easy:

1
2
3
4
5
6
7
8
9
10
11
12
13
import sqlite3

SQLITE_PATH = '/Library/Application Support/Perceptive Automation/ \
Indigo 7/Logs/indigo_history.sqlite'

SQLITE_TN = 'device_history_114161618'
SQLITE_TN_ALIAS = 'gh'

conn = sqlite3.connect(SQLITE_PATH)
c = conn.cursor()
SQL = "SELECT gh.sensorvalue from {tn} as {alias} \
ORDER BY ts DESC LIMIT 10".format(tn=SQLITE_TN,alias=SQLITE_TN_ALIAS)

c.execute(SQL)
all_rows = c.fetchall()

Now all_rows contains a list of single-item tuples that we need to compact into a list. In the next step, I filter obviously spurious values and compact the list of tuples into a list of values:

1
tempsF = filter(lambda a: a > 1, [i[0] for i in all_rows])

But some spurious data remains. Remember that many of the errant values are 0.0 but some are just lower than the actual values. To do this, I create a list of the differences from one value to the next and search for significant deviations (5°F in this case.) Having found which value creates the large difference, I exclude it from the list.[1]

1
2
3
4
5
6
7
8
diffs = [abs(x[1]-x[0]) for x in zip(tempsF[1:],tempsF[:-1])]
idx = 0
for diff in diffs:
if diff > 5:
break;
else:
idx = idx+1
filtTempsF = tempsF[:idx+1] + tempsF[idx+2:]

Finally, since it’s a moving average I need to actually average the data.

1
avgTempsF = reduce(lambda x,y : x + y, filtTempsF) / len(filtTempsF)

In summary, this gives me a filtered, time-averaged dataset that excludes spurious data. For applications that are very time-sensitive, this approach won’t work as is. But for most environmental controls, it’s a workable solution to identifying and filtering wonky sensor data.

For reference, the entire script follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#	Update the greenhouse temperature in degrees C
# The sensor reports values in F, so we will update
# the value to see whenever the primary data has any change.

import sqlite3

# device and variable definitions
IDX_CURRENT_TEMP = 1822850463
IDX_FORMATTED = 1778207310
DEV_GH_TEMP = 114161618
SQLITE_PATH = '/Library/Application Support/Perceptive Automation/Indigo 7/Logs/indigo_history.sqlite'
SQLITE_TN = 'device_history_114161618'
SQLITE_TN_ALIAS = 'gh'

DEBUG_GH = True

def F2C(ctemp):
return round((ctemp - 32) / 1.8,1)

def CDeviceTemp(deviceID):
device = indigo.devices[deviceID]
tempF = device.sensorValue
return F2C(tempF)

def movingAverageF():
conn = sqlite3.connect(SQLITE_PATH)
c = conn.cursor()
SQL = "SELECT gh.sensorvalue from {tn} as {alias} ORDER BY ts DESC LIMIT 10".format(tn=SQLITE_TN,alias=SQLITE_TN_ALIAS)
c.execute(SQL)
all_rows = c.fetchall()
tempsF = filter(lambda a: a > 1, [i[0] for i in all_rows])
diffs = [abs(x[1]-x[0]) for x in zip(tempsF[1:],tempsF[:-1])]
idx = 0
for diff in diffs:
if diff > 5:
break;
else:
idx = idx+1
filtTempsF = tempsF[:idx+1] + tempsF[idx+2:]
avgTempsF = reduce(lambda x,y : x + y, filtTempsF) / len(filtTempsF)
return avgTempsF

def movingAverageC():
return F2C(movingAverageF())

# compute moving average
avgC = F2C(movingAverageF())

# current greenhouse temperature in degrees C
ghTempC = F2C(indigo.devices[DEV_GH_TEMP].sensorValue)
indigo.server.log("GH temp: raw={0}F, filtered moving avg={1}C".format(ghTempC,avgC))

# update the server variables (°C temp and formatted string)
indigo.variable.updateValue(IDX_CURRENT_TEMP,value=unicode(avgC))
indigo.variable.updateValue(IDX_FORMATTED, value="{0}°C".format(avgC))

  1. As I was preparing this post, I realized that it this approaches misses the possibility of a dataset having more than one spurious data point. Empirically, I did not notice any occurrence of that, but it's possible. I have to account for that in the future.

With Trump the usual advice of “Follow the money.” doesn’t work because Congress refuses to force him to disclose his conflicts of interest. As enormous and material as those conflicts must be, I’m just going to focus on what I can see with my own eyes, the man’s apparent intent.

In his public life, Donald Trump has never done anything that did not personally and directly benefit him. Most of us, as we go through life, assemble a collection of acts that are variously self-serving and other-serving. This is the way of life. Normal life. With Trump, not so. Even his meager philanthropic acts are tainted with controversy. The man simply cannot act in sacrificial way. He is incurable.[1]

As a corollary, when considering his dismissal of FBI Director Comey yesterday, until a special prosecutor is appointed, I plan to apply that principle. Since Trump acts only in his own personal best interest, I’m going to assume that in firing Mr. Comey, he is personally benefitting from it.

trumpclassact.jpg

Since the evidence that Trump’s concern was over the Russia investigation, it’s safest to presume the firing was about the Russia investigation notwithstanding the feeble excuses of his staff who were caught off-guard by the event.

We would all do well to re-read Masha Gessen’s piece in the New York Review of Books, “Autocracy: Rules for Survival.” Her Rule #1: “Believe the autocrat. He means what he says.” remains applicable. If Trump is fuming about the Russia investigation, he probably fired the very man investigating his Administration’s ties to Russia because of it.


  1. In a campaign event in Fort Dodge, Iowa on November 12, 2015, Trump claimed that rival Ben Carson was "pathological" and that "...if you're pathological, there's no cure for that, folks, okay? There's no cure for that." Since Trump's own psychopathology is widely questioned, one wonders if he, too, is incurable. Given that narcissistic personality disorder is almost certainly among the potential diagnoses, he probably is incurable.

You're wrong, Donnie

In an effort to strip protesters of their legitimacy, Trump and Fox News claim that protesters are simply there because they’re paid by powerful oppositional interests. Never mind that Trump has no evidence for his claim; he has no evidence for practically anything that emerges from his loud mouth. What is more interesting to me is that if money delegitimizes authenticity then presumably we can use this effect to come to additional conclusions.

If, as Trump claims, money discredits and taints the political process, then:

  1. Trump’s own motives, as an extremely wealthy man, should be suspect.
  2. For that matter, wealthy family members, such as Ivanka Trump™ and Jared Kushner who have been appointed to high positions in the Administration but who have no political experience should be viewed with skepticism.
  3. The Republicans pushed the Citizens United decision allowing unlimited amounts of money in politics. If money is good for the political process, as they presumably believe, then the mythical existence of paid protest should give them no pause for concern. It’s just free market capitalism in action, right?
  4. Considering the degree to which the NRA, the brothers Koch and other deep-pocket interests fund the campaigns of GOP politicians, maybe we should just consider all Republican politicians paid protesters.

As Simon Maloy, points out in Salon:

The irony here is that there’s one person we know for sure has paid people to show up and voice a prefabricated political message: Donald Trump. In the summer of 2015, Trump arranged for actors to show up at Trump Tower and cheer and wave signs as he announced his candidacy for president, offering them $50 for their services. And, in typical Trump fashion, he tried to stiff the agency that set up his Potemkin campaign launch.

Classic.

And, no; no one paid me to write this. All protest is my own.

Someday, when I have time to burn, I’m going to write a Twitter bot that takes all of Trump’s vacuous tweets and translate them into Russian. It’ll look like this:

trumptapp.jpg

There’s something ludicrous about the idea of the Trump, who is distractible, impatient, and incurious being able to learn Russian, an incredibly difficult language.

marking time,
eyes glazed, pupils constricted
to the head of a pin
from facing the blue white sterile light
for too long
a zombie tribe
numbering in the millions
if not more
waits.

this throng, agitated
in a subdued anesthetized
way,
crowns one of its own
a clown of sorts
knowing little of the past
less of the present
and practically nothing
of the future.
“why not? it could be worse.”

in a strange unreality
a vaudeville show becomes
its own rehearsal,
a dreamish state from which
only an atomic flash
can awaken a person.

canadianflag.jpg

On January 1, 2016 we packed up all our earthly goods and headed south to Canada. (Yes, it’s true. When you live in Minnesota, it’s possible to move south to Canada. Look at the map!) Having lived here for a little over a year, here are some thoughts about living here, in no particular order:

  1. “Sorry” is more of a greeting than just an apology.
  2. Canadians really are polite; but put them behind the wheel of a car and all bets are off.
  3. Universal healthcare works. Americans love to go on and on about socialized medicine; but I’m here to tell you: it works.
  4. Bumper stickers are rare here.
  5. People don’t really talk politics. Well, they talk about U.S. politics.
  6. Left turn arrows on traffic lights are rare. It makes for interesting moments when the light changes.
  7. The electric utility is called “hydro”, which given the Greek origin of the word makes little sense until you realize that it stands for “hydroelectric.”
  8. Youth music is well-supported - both through private and public funding.
  9. State-church separation is fuzzier. For example, the Catholic school system is tax-payer funded. But only the Catholic schools. It has something to do with the Canadian Charter (a.k.a Constitution.) It was apparently some sort of historical compromise in the 1800’s.
  10. Don’t order iced tea in Canada. It’s way too sweet.
  11. As a practical matter, you can’t be elected Prime Minister unless you speak both English and French fluently. This is a really good thing.[1]
  12. Speaking of politics, campaigns are time-limited to 6 weeks before an election. How cool is that?
  13. Poutine sounds horrible, but it’s actually pretty good.

  1. How many languages does Donald Trump speak fluently, for example?

I’ve written previously about extracting and processing mp3 files from web pages. The use case that I described, obtaining Russian word pronunciations for Anki cards is basically the same although I’m now obtaining many of my words from Forvo. However, Forvo doesn’t seem to apply any audio dynamic range processing or normalization to the audio files. While many of the pronunciation mp3’s are excellent as-is, some need post-processing chiefly because the amplitude is too low. However, being lazy by nature, I set out to find a way of improving the audio quality automatically before I insert the mp3 file into my new vocabulary cards.

As before, the workflow depends heavily on Hazel to identify and process files coming out of Forvo. The download button on their website, sends the mp3 files to the Downloads directory. The first rule in the workflow just grabs downloaded mp3 files and moves them to ~/Documents/mp3 so that I can work on them directly there.

hazel01.png

Another Hazel rule renames the verbosely-titled files to just the single Russian word being pronounced. It’s just neater that way.

hazel02.png
1
rename 's/(pronunciation_ru_)(.*)/$2/' *.mp3

This uses the convenient rename command that you can obtain via Homebrew.

The final rule, grabs the newly-renamed mp3 file and performs a series of audio processing steps:

1
2
3
4
5
6
7
8
ffmpeg -i "$1" tmp.wav;
sox tmp.wav temp_out.wav norm gain compand 0.02,0.20 5:-60,-40,-10 -5 -90 0.1;
ffmpeg -i temp_out.wav -codec:a libmp3lame -qscale:a 2 tmp.mp3;
lame --mp3input -b 64 --resample 22.50 tmp.mp3 tmp;
mv tmp "$1";
rm tmp.mp3;
rm tmp.wav;
rm temp_out.wav

The first line ffmpeg -i "$1" tmp.wav; simply writes a temporary .wav file that we can process using sox. The second line invokes sox with a number of options that normalize and improve the dynamic range of the audio. Finally we use ffmpeg to convert the .wav file back to .mp3, compress the .mp3 file and then clean up.

Now I have excellent normalized audio for my cards, with no work for me!

See also: