Ojisan Seiuchi

Normalizing spelling in Russian words containing the letter ё

June 07, 2021

The Russian letters ё and e have a complex and troubled relationship. The two letters are pronounced differently, but usually appear the same in written text. This presents complications for Russian learners and for text-to-speech systems. In several recent projects, I have needed to normalize the spelling of Russian words. For examples, if I have the written word определенно , is the word actually определенно ? Or is it определённо ?

This was a larger challenge than I imagined. Apart from udar¹, I failed to find any off-the-shelf solutions to what I call normalizing the spelling of words that should be spelled with ё . It turns out that the Russian language Wiktionary respects URLs whether spelled with ё or e . Therefore, one way of normalizing the spelling is to query Wiktionary and grab the headword from the page. Normally I don’t like creating this sort of dependency; but it’s the only solution that presented itself so far. Here’s the approach I took:

#!/usr/bin/env python3

from lxml import html
from lxml import etree
import requests
import re
from typing import Optional

word = 'еще'

def normalize(word:str) -> Optional[str]:
    # don't bother searching if there's no е or if
    # there *is* a ё
    if not bool(re.search(r'[её]', word)) or bool(re.search(r'[ё]', word)):
        return word
    url = f'https://ru.wiktionary.org/wiki/{word}'
    page = requests.get(url)
    content = page.content.decode()
    tree = etree.fromstring(content.replace('--lang--', ''))
    block = tree.xpath('//h1[@id="firstHeading"]')
    try:
        return block[0].text
    except:
        return word

if __name__ == "__main__":
    print(normalize(word))

udar can work but the installation is non-trivial and it has substantial dependencies that may make it less appealing in some applications. ↩︎

Scraping Russian word definitions from Wikitionary: utility for Anki

May 13, 2021

Programming

While my Russian Anki deck contains around 27,000 cards, I’m always making more. (There are a lot words in the Russian language!) Over the years, I’ve become more and more efficient with card production but one of the missing pieces was finding a code-readable source of word definitions. There’s no shortage of dictionary sites, but scraping data from any site is complicated by the ways in which front-end developers spread the semantic content across multiple HTML tags arranged in deep and cryptic hierarchies. Yes, we can cut-and-paste, but my quest is about nearly completely automating quality card production. This is a quick post of a method for scraping word definitions from Wiktionary.

The project relies on the wiktionaryparser module available for Python. Although it’s not feature complete, it’s pretty good. With a little extra processing, it can do a lot of the heavy lifting of extracting word definitions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


#!/usr/bin/env python3

from wiktionaryparser import WiktionaryParser
import re

parser = WiktionaryParser()
parser.set_default_language('russian')
wp = parser.fetch('картонный')
def_list = []
for parsed in wp:
   for definition in parsed['definitions']:
      for text in definition['text']:
         if not bool(re.search('[а-яА-Я]', text)):
            def_list.append(text)
print(', '.join(def_list))

The code is largely self-explanatory; but I would just point out than line 13 is there to exclude any line that contains residual Cyrillic characters. Out-of-the-box, the WiktionaryParser module seems to capture the headword, IPA pronunciation, etc. so we need a way of excluding all of that before we compress the definition lines into a single string of comma-delimited text.

In the process of writing and maintaining a service that checks Russian word frequencies, I noticed peculiar phenomenon: certain words could not be located in a sqlite database that I knew actually contained them. For example, a query for the word - английский consistently failed, whereas other words would succeed. Eventually the commonality between the failures became obvious. All of the failures contained the letter й , which led me down a rabbit hole of character encoding and this specific case where it can go astray.

To run a little project (that I’ll describe at some point in the future) I have to run a small web server from my home computer, one that happens to run macOS. More than anything else, this is just a reply of what I did to get it running in case: a) I have to do it again, or b) Someone else can find it useful.

I signed up for service with dynv6 because I saw it recommended elsewhere and it didn’t look creepy like some of the other options. I just signed up with email - through an email proxy anonymizer, because I’m paranoid. After verifying my email, I was able to create a new “zone”, basically a record of my public IP address linked to custom DNS.

Maybe I’m just getting cranky after over a year of on-again-off-again pandemic lockdowns, but I’ve had it with Apple’s heavy-handed attempts to get me to upgrade to Big Sur. Mind you, I have nothing against it. It’s just an operating system. I don’t particularly like it’s translucent bubbly iOS look. But I could live with.

But I don’t want it. I depend on a very unorthodox setup. I have a lot of infrastructure tools that depend on certain versions of Python to be in just the right place. Every single macOS major upgrade breaks all of this and I spend days picking up the pieces. I’m tired of Apple messing with it. So when my system launched into what seems like an unbidden upgrade process today, I lost it.

The ability to execute Javascript in Anki card templates offers users flexibility in displaying data. In Anki 2.1, though, the asynchronous execution of Javascript means that user script functionality is not entirely predictable. This post on r/Anki discusses an approach for dynamically loading Javascript resources and ensuring that they are available when the card is displayed. Since I modularize my Javascript code so that it can be flexibly deployed to different card types, I extended this method to allow the template developer to load multiple scripts in one <script> block.

CodeRunner is one of my favourite development environments on macOS. I use it for small one-off projects or for testing concepts for integration into larger projects. But in version 4.0.3, jQuery injection in a custom HTML page is broken, giving the error:

It’s probably due to some unescaped bit of code in their minified jQuery, but I didn’t have time to work that hard. Instead I reported the error to the developer an fixed it myself. The original (default) run script for jQuery is:

It’s possible to use cloze deletion cards within standard Anki note types using the Anki Cloze Anything setup. But additional scripts are required to allow it to function seamlessly in a typical language-learning environment. I’ll show you how to flexibly display a sentence with or without Anki Cloze Anything markup and also not break AwesomeTTS.

Anki’s built-in cloze deletion system

The built-in cloze deletion feature in Anki is an excellent way for language learners to actively test their recall. For example, a cloze deletion note type with the following content requires the learner to supply the missing word:

I think this is the last word on fixing Knowclip .apkg files. I’ve developed this in bits and pieces; but hopefully this is the last word on the subject. See my previous articles, here and here, for the details.

This issue, again, is that Knowclip gives these notes and cards sequential id values starting at 1. But Anki uses the note.id and the card.id as the creation date. I logged it as an issue on Github, but as of 2021-04-15 no action has been taken.

(N.B. A much-improved version of this script is published in a later post)

Fixing the Knowclip note files as I described previously, it turns out, is only half of the fix with the broken .apkg files. You also need to fix the cards table. Why? Same reason. The rows are number sequentially from 1. But since Anki uses the card id field as the date added, the added field is always wrong. Again, the fix is simple:

Normalizing spelling in Russian words containing the letter ё

Scraping Russian word definitions from Wikitionary: utility for Anki

Encoding of the Cyrillic letter й - a UTF-8 gotcha

Dynamic DNS - auto-updating from macOS

No sir, I do not want Big Sur

Dynamically loading Javascript in Anki card templates

Fixing CodeRunner jQuery injection

Extending the Anki Cloze Anything script for language learners

Anki’s built-in cloze deletion system

Complete fix for broken Knowclip .apkg files

Fixing Knowclip .apkg files: one more thing

Sign up for dynamic DNS service

Anki’s built-in cloze deletion system