Scraping Russian word definitions from Wikitionary: utility for Anki

While my Russian Anki deck contains around 27,000 cards, I’m always making more. (There are a lot words in the Russian language!) Over the years, I’ve become more and more efficient with card production but one of the missing pieces was finding a code-readable source of word definitions. There’s no shortage of dictionary sites, but scraping data from any site is complicated by the ways in which front-end developers spread the semantic content across multiple HTML tags arranged in deep and cryptic hierarchies. Yes, we can cut-and-paste, but my quest is about nearly completely automating quality card production. This is a quick post of a method for scraping word definitions from Wiktionary.

The project relies on the wiktionaryparser module available for Python. Although it’s not feature complete, it’s pretty good. With a little extra processing, it can do a lot of the heavy lifting of extracting word definitions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/usr/bin/env python3

from wiktionaryparser import WiktionaryParser
import re

parser = WiktionaryParser()
parser.set_default_language('russian')
wp = parser.fetch('картонный')
def_list = []
for parsed in wp:
   for definition in parsed['definitions']:
      for text in definition['text']:
         if not bool(re.search('[а-яА-Я]', text)):
            def_list.append(text)
print(', '.join(def_list))

The code is largely self-explanatory; but I would just point out than line 13 is there to exclude any line that contains residual Cyrillic characters. Out-of-the-box, the WiktionaryParser module seems to capture the headword, IPA pronunciation, etc. so we need a way of excluding all of that before we compress the definition lines into a single string of comma-delimited text.

Encoding of the Cyrillic letter й - a UTF-8 gotcha

In the process of writing and maintaining a service that checks Russian word frequencies, I noticed peculiar phenomenon: certain words could not be located in a sqlite database that I knew actually contained them. For example, a query for the word - английский consistently failed, whereas other words would succeed. Eventually the commonality between the failures became obvious. All of the failures contained the letter й , which led me down a rabbit hole of character encoding and this specific case where it can go astray.

Background on Unicode support for Cyrillic characters

Cyrillic characters are found in the 00000400 range of the Unicode table.

The letter Russian й is related to its cousin и but is regarded as a separate letter. In other words, it’s not an и plus a combining diacritical mark; it has its own entry in the Unicode table:

A brief digression on Unicode vs UTF-8

True confession: I’ve harboured some confusion between the concepts of Unicode and UTF-8. Solving this problem meant diving into this a little deeper. First, a really basic distinction:

  • Unicode is a character set that aims to encompass all possible characters.
  • UTF-8 is a flexible way of encoding Unicode characters in strings

Looking at the й entry above, we see that the Unicode value is U+0439 which is the hexadecimal value 0h0439 and it has a UTF-8 encoding of D0 B9. How do we get D0 B9 from U+0439? Herein lies the distinction between Unicode and UTF-8. In the old days, when computers just had to deal with the Latin alphabet, ASCII was sufficient. It was capable of encoding 127 characters which was more than enough to store English language text. But how to deal with characters in other languages? Clearly 8 bits is not going to be enough. So we’ll solve it by giving it an entry in the Unicode table. Problem solved. Not quite. We could just store all text as multi-byte sequences. For sure it’s necessary for some Asian languages, notably Chinese. But if would be inefficient for English, where most of what we write and store is rendered in the Latin alphabet (which resides in the lowest 127 entries in the Unicode table.)

One of the answers to this dilemma is UTF-8 which flexibly encodes characters of varying Unicode width. I’ll simply a bit here. For characters with Unicode values of 0-127, we can just store the character as a single byte. Nice and compact. For Unicode characters with larger values then we have to distribute their values into bytes that have fixed identifying “header bits” (my terminolgy, don’t quote me.) For Unicode values up to 0h07FF we take 11 bits of the Unicode value and insert it into two bytes with a fixed format:

Byte 1 Byte 2 Byte 3 Byte 4 # free bits Maximum Unicode value
0xxxxxxx - - - 7 0h7F
110xxxxx 10xxxxxx 5+6 → 11 0h7FF
1110xxxx 10xxxxxx 10xxxxxx 4+6+6 → 16 0hFFFF
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 3+6+6+6 → 21 0h10FFFF

From here, you’ll have to trust me. If you distribute the bits of U+0439 into the 11 free bits of the 2 byte UTF-8 sequence, you end up with D0 B9.

Now to why this matters for my problem of finding й -containing words.

A discovery

After it dawned on me that some errant encoding might be the cause of the problem. I decided to do a little print debugging and capture what my API was seeing for characters

for l in word:
   print(l.encode('utf-8'))

So for the word численный for example, what were we seeing?

b'\xd1\x87'
b'\xd0\xb8'
b'\xd1\x81'
b'\xd0\xbb'
b'\xd0\xb5'
b'\xd0\xbd'
b'\xd0\xbd'
b'\xd1\x8b'
b'\xd0\xb8'
b'\xcc\x86'

Whoa, were is the D0 B9 (U+0439) that we would expect? The problem is right in front of us now. The last two characters in the UTF-8 string are \xd0\xb8 and \xcc\x86. What are those guys? Well, it turns out that the letter и has a Unicode value of U+0438 and a UTF-8 encoding, therefore of D0 B8. Then what follows it is probably a diacritical mark. Which one? Knowing now how UTF-8 is constructed, we can work backwards. CC86 is 11001100 10000110 in binary format. If we extract out the fixed bits of the two-byte sequence, we’re left with 01100000110 which is 0h306. Looking up U+0306 in the Unicode table, we find:

Therein lies the problem. Somewhere the й got turned into и plus a combining breve diacritical mark.

The solution

Since I’m not able to control the process that led to this bungling of the encoding, all I can do is fix it on the back end. Therefore, we just do a character substitution:

word = word.replace(u"\u0438\u0306", u"\u0439")

Any time we have this stupid encoding, we just fix it before using the word. Now I’m willing to bet that ё is going to get handled the same way.

Problem solved!

References:

Dynamic DNS - auto-updating from macOS

To run a little project (that I’ll describe at some point in the future) I have to run a small web server from my home computer, one that happens to run macOS. More than anything else, this is just a reply of what I did to get it running in case: a) I have to do it again, or b) Someone else can find it useful. Sign up for dynamic DNS service I signed up for service with dynv6 because I saw it recommended elsewhere and it didn’t look creepy like some of the other options.

No sir, I do not want Big Sur

Maybe I’m just getting cranky after over a year of on-again-off-again pandemic lockdowns, but I’ve had it with Apple’s heavy-handed attempts to get me to upgrade to Big Sur. Mind you, I have nothing against it. It’s just an operating system. I don’t particularly like it’s translucent bubbly iOS look. But I could live with. But I don’t want it. I depend on a very unorthodox setup. I have a lot of infrastructure tools that depend on certain versions of Python to be in just the right place.

Dynamically loading Javascript in Anki card templates

The ability to execute Javascript in Anki card templates offers users flexibility in displaying data. In Anki 2.1, though, the asynchronous execution of Javascript means that user script functionality is not entirely predictable. This post on r/Anki discusses an approach for dynamically loading Javascript resources and ensuring that they are available when the card is displayed. Since I modularize my Javascript code so that it can be flexibly deployed to different card types, I extended this method to allow the template developer to load multiple scripts in one <script> block.

Fixing CodeRunner jQuery injection

CodeRunner is one of my favourite development environments on macOS. I use it for small one-off projects or for testing concepts for integration into larger projects. But in version 4.0.3, jQuery injection in a custom HTML page is broken, giving the error: It’s probably due to some unescaped bit of code in their minified jQuery, but I didn’t have time to work that hard. Instead I reported the error to the developer an fixed it myself.

Extending the Anki Cloze Anything script for language learners

It’s possible to use cloze deletion cards within standard Anki note types using the Anki Cloze Anything setup. But additional scripts are required to allow it to function seamlessly in a typical language-learning environment. I’ll show you how to flexibly display a sentence with or without Anki Cloze Anything markup and also not break AwesomeTTS. Anki’s built-in cloze deletion system The built-in cloze deletion feature in Anki is an excellent way for language learners to actively test their recall.

Complete fix for broken Knowclip .apkg files

I think this is the last word on fixing Knowclip .apkg files. I’ve developed this in bits and pieces; but hopefully this is the last word on the subject. See my previous articles, here and here, for the details. This issue, again, is that Knowclip gives these notes and cards sequential id values starting at 1. But Anki uses the note.id and the card.id as the creation date. I logged it as an issue on Github, but as of 2021-04-15 no action has been taken.

Fixing Knowclip .apkg files: one more thing

(N.B. A much-improved version of this script is published in a later post) Fixing the Knowclip note files as I described previously, it turns out, is only half of the fix with the broken .apkg files. You also need to fix the cards table. Why? Same reason. The rows are number sequentially from 1. But since Anki uses the card id field as the date added, the added field is always wrong.

Fixing Knowclip Anki apkg creation dates

(N.B. A much-improved version of this script is published in a later post) Language learners who want to develop their listening comprehension skills often turn to YouTube for videos that feature native language content. Often these videos have subtitles in the original language. A handful of applications allow users to take these videos along with their subtitles and chop them up into sentence-length bites that are suitable for Anki cards. Once such application is Knowclip.