Ojisan Seiuchi

Accessing Anki collection models from Python

January 22, 2022

For one-off projects that target Anki collections, I often use Python in a standalone application rather than an Anki add-on. Since I’m not going to distribute these little creations that are specific to my own needs, there’s no reason to create an add-on. These are just a few notes - nothing comprehensive - on the process.

One thing to be aware of is that there must be a perfect match between the Anki major and minor version numbers for the Python anki module to work. If you are running Anki 2.1.48 on your desktop application but have the Python module built for 2.1.49, it will not work. This is a huge irritation and there’s no backwards compatibility; the versions must match precisely.

Anyway, here’s a little application to illustrate the simple process of finding all of the moderls (also know as note types.)

#!/usr/bin/env python3

import os
from anki.collection import Collection
from anki.models import ModelManager
import anki.errors

COLLECTION_PATH = os.environ.get('ANKI_RU_COL_PATH')

def anki_utf8_tr(ustr: str) -> str:
   try:
      return ustr.encode('latin1').decode('utf8')
   except UnicodeEncodeError:
      return ustr

if __name__ == "__main__":
   try:
      col = Collection(COLLECTION_PATH)
   except anki.errors.DBError:
      print("ERROR: Anki should be closed")
      quit()
   model_mgr = ModelManager(col)
   for m in model_mgr.all_names_and_ids():
      m_id = m.id
      m_name = anki_utf8_tr(m.name)
      print(f'{m_id} - {m_name}')

An interesting caveat in reading the model names: if you’ve used names with character sets other than Latin, then the output of the model.name looks a little strnge, e.g. name: "\320\240\321\203\321\201\321\201\320\272\320\270\320\271 enhanced" which took a bit of time to figure out. It’s actually just decimal UTF-8 codes. The function anki_utf8_tr function is meant to provide the translation to the original representation. I’ve written more about this previously.

Converting Cyrillic UTF-8 text encoded as Latin-1

November 12, 2021

programming

This may be obvious to some, but visually-recognizing character encoding at a glance is not always obvious.

For example, pronunciation files downloaded form Forvo have the following appearance:

pronunciation_ru_Ð¾ÑÐ±ÑÐ²Ð°Ð½Ð¸Ðµ.mp3

How can we extact the actual word from this gibberish? Optimally, the filename should reflect that actual word uttered in the pronunciation file, after all.

Step 1 - Extracting the interesting bits

The gibberish begins after the pronunciation_ru_ and ends before the file extension. Any regex tool can tease that out.

This is what I did in the shell:

echo $fn | perl -CSD -pe 's/pronunciation_ru_(.*)\.mp3/$1/gm;'

Now we have are left with Ð¾ÑÐ±ÑÐ²Ð°Ð½Ð¸Ðµ and the question of what kind of strange encoding this is.

Step 2 - Figuring out character encoding

Obviously, this uses some Latin character set. Since the Russian language does not, we have some work to do. The task of unraveling this is easier when you can visualize the hex character codes laid out. A simple Python script makes that easy:

#!/usr/bin/env python3

word = 'Ð¾ÑÐ±ÑÐ²Ð°Ð½Ð¸Ðµ'

print(":".join("{:02x}".format(ord(c)) for c in word))

Running this little script, we see d0:be:4e:303:82:d0:b1:4e:303:8b:d0:b2:d0:b0:d0:bd:d0:b8:d0:b5.

Immediately we can begin to discern a pattern - lots of D0 codes followed by something else. It’s beginning to look like Unicode. So, on macOS, I fired up the character viewer from the menu bar and drilled down to the Cyrillic (or Unicode) section. Lookup any Cyrillic character, for example ж :

Aha! The Cyrillic range in includes characters whose first byte is D0, so now it’s just a matter of lining up 2 byte groups and reading it as UTF-8. So the first character would be D0 BE - which, according to the table, is a lower case Cyrillic o

However, one complication remains. What is happening when the sequence is broken? There is an interruption in the two-byte reading frame that begins with the sequence 4e:303:82 then happens again with the sequence 4e:303:8b? The first step is to figure out the common portion 4e:303. Back to the character viewer table, we find 4E is the Latin capital letter N. So what about 303? Using the search feature of the Character Viewer, we easily see that U+0303 is a combining tilde. It’s a symbol that combines with the character that immediately precedes it. So what we have is not just Cyrillic UTF-8 characters encoded in Latin symbols, but with the additional oddity of a composed Ñ character. It we search for that character, we find that it is D1. So, the sequence isn’t really interrupted; it’s just an issue of how Ñ is comprised.

Step 3 - Reading `N` + combining tilde as `u'\u00D1'`

This just requires substituting one UTF-8 sequence for another. In Python, this will work:

# strange issue where Ñ (\u00D1) is intended
# but is encoded as N + tilde. Obviously this 
# is meaningless in terms of UTF-8 encoding
# but we have to deal with it before the decoding
# takes place.
word = word.replace(u'\u004E\u0303',u'\u00D1')

Step 4 - Putting it all together

After correcting the odd Ñ composition, we can simple decode the text as UTF-8, but we have one little twist first. .decode('utf8') requires a sequence of bytes (class <bytes>) not a string. So we have to make a trip through encoding in ‘latin1’ first, then decode it to UTF-8.

tr_word = word.encode('latin1').decode('utf8')

rfndecode - a Python script to decode this form of encoding

#!/usr/bin/env python3

# 
# rfndecode
#
# When downloading files from Forvo, we get file
# names that look like: .ÐºÐ¾Ñ.mp3
# This puts the text into ordinary utf-8
#
# Input: Text to translate as argument or 
#        on stdin
# Output: Re-encoded text
#

import sys

# accept word as either argument or on stdin
try:
   word = sys.argv[1]
   except IndexError:
      word = sys.stdin.read()

      # check if this word is in the expected encoding
      if word.find(u'\u00D0') == -1:
         print(word.strip())
            exit()

               # strange issue where Ñ (\u00D1) is intended
               # but is encoded as N + tilde. Obviously this 
               # is meaningless in terms of UTF-8 encoding
               # but we have to deal with it before the decoding
               # takes place.
               word = word.replace(u'\u004E\u0303',u'\u00D1')

               # convert string to bytes in latin script
               # then decode it as UTF-8
               tr_word = word.encode('latin1').decode('utf8')
               print(tr_word.strip())

Shell script to extract the unencoded text and rename

Now it’s just a matter of connecting all the components, which I did in a small shell script.

#!/usr/local/bin/zsh

# extract the really messed-up name of the 
# pronunciation file
if [ "$#" -gt 0 ]; then
  fn=$1
else
  read fn
fi

tr_fn=$(echo $fn | perl -CSD -pe 's/pronunciation_ru_(.*)\.mp3/$1/gm;' | rfndecode ).mp3
tr_fn=$(basename $tr_fn)
printf "*** tr_fn = %s\n" $tr_fn >> $HOME/wtf.txt
mv $fn $HOME/Documents/mp3/$tr_fn

Undoubtedly, the mysterious encoding might have been obvious to some, but for me it was an illustration of how to approach technical problems by taking them apart into the smallest discernible piece then applying what you know - even if limited in scope - to assemble the pieces into a comprehensive solution.

I’ve written a lot about applying and removing syllabic stress marks in Russian text because I use it a lot when making Anki cards. This iteration is a command line tool for applying the stress mark at a particular character index. The advantage of these little shell tools is that they can be composable, integrating into different tools as the need arises. #!/usr/local/bin/zsh while getopts i:w: flag do case "${flag}" in i) index=${OPTARG};; w) word=${OPTARG};; esac done if [ $word ]; then temp=$word else read temp fi outword="" for (( i=0; i<${#temp}; i++ )); do thischar="${temp:$i:1}" if [ $i -eq $index ]; then thischar=$(echo $thischar | perl -C -pe 's/(.

Introducing sterilize-ng [GitHub link] - a URL sterilizer made to work flexibily on the command line. Background The surveillance capitalist economy is built on the relentless tracking of users. Imagine going about town running errands but everywhere you go, someone is quietly following you. When you pop into the grocery, they examine your receipt. They look into the bags to see what you bought. Then they hop in the car with you and keep careful records of where you go, how fast you drive, whom you talk with on the phone.

One of the things that I love about Keyboard Maestro is the ability to chain together disparate technologies to achieve some automation goal on macOS. In most of my previous posts about Keyboard Maestro macros, I’ve used Python or shell scripts, but I decided to draw on some decades-old experience with Perl to do a little text processing for a specific need. Background I want this text from Wiktionary: to look like this:

Russian text intended for learners sometimes contains marks that indicate the syllabic stress. It is usually rendered as a vowel + a combining diacritical mark, typically the combining acute accent \u301. Here are a couple ways of stripping these marks on the command line: First is a version using Perl #!/bin/bash f='покупа́ешья́'; echo $f | perl -C -pe 's/\x{301}//g;' And then another using the sd tool: #!/bin/bash f='покупа́ешья́'; echo $f | sd "\u0301" "" Both rely on finding the combining diacritical mark and removing it with regex.

It seems like the command line is one of those places where you can accomplish crazy efficient things with one-liners. Here’s a perfect use case for a CLI one-liner: In Anki, I often add lists of synonyms and antonyms to my vocabulary cards, but I like them formatted as a bulleted list. My usual route to that involves Markdown. But how to convert this: известный, точный, определённый, достоверный to

известный - точный - определённый - достоверный After trying to come up with a single text replacement strategy to make this work, the best I could do was this:

Often when I import a pronunciation file into Anki, from Forvo for example, the volume isn’t quite right or there’s a lot of background noise; and I want to edit the sound file. How? The solution for me, as it often the case is a Keyboard Maestro macro. Prerequisites Keyboard Maestro - if you are a macOS power user and don’t have KM, then your missing on a lot. Audacity - the multi-platform FOSS audio editor Outline of the approach Since Keyboard Maestro won’t know the path to our file in Anki’s collection.

When the Anki application is open on the desktop, it places a lock on the sqlite3 database such that it can’t be queried by another process. One workaround is to try to open the database and if it fails, then make a temporary copy and query that. Of course, this only works with read-only queries. Here’s the basic strategy: #!/usr/local/bin/python3 # -- coding: utf-8 -- # requires python >= 3.8 to run because of anki module from anki import Collection, errors if name == "main": try: col = Collection(path_to_anki_db) except (errors.

The Russian letters ё and e have a complex and troubled relationship. The two letters are pronounced differently, but usually appear the same in written text. This presents complications for Russian learners and for text-to-speech systems. In several recent projects, I have needed to normalize the spelling of Russian words. For examples, if I have the written word определенно , is the word actually определенно ? Or is it определённо ?

Accessing Anki collection models from Python

Converting Cyrillic UTF-8 text encoded as Latin-1

Step 1 - Extracting the interesting bits

Step 2 - Figuring out character encoding

Step 3 - Reading `N` + combining tilde as `u'\u00D1'`

Step 4 - Putting it all together

rfndecode - a Python script to decode this form of encoding

Shell script to extract the unencoded text and rename

accentchar: a command-line utility to apply Russian stress marks

sterilize-ng: a command-line URL sterilizer

Using Perl in Keyboard Maestro macros

Stripping Russian stress marks from text from the command line

Splitting a string on the command line - the search for the one-liner

A Keyboard Maestro macro to edit Anki sound file

Querying the Anki database when the application is running

Normalizing spelling in Russian words containing the letter ё

Step 1 - Extracting the interesting bits

Step 2 - Figuring out character encoding

Step 3 - Reading N + combining tilde as u'\u00D1'

Step 4 - Putting it all together

rfndecode - a Python script to decode this form of encoding

Shell script to extract the unencoded text and rename

Step 3 - Reading `N` + combining tilde as `u'\u00D1'`