Splitting text into sentences: Russian edition

Splitting text into sentences is one of those tasks that looks simple but on closer inspection is more difficult than you think. A common approach is to use regular expressions to divide up the text on punction marks. But without adding layers of complexity, that method fails on a sentence such as:

“Trapper John, M.D. was as fine as any Ph.D."

It’s obviously only one sentence, but try it with regex and the difficulty is obvious.

A solution suggested on Stack Overflow is to use the spaCy natural language processing module along with its ‘sentencizer’ pipeline to do the heavy lifting. The recommended solutions are all based on English language processing; so I was anxious to see if it would work on Russian text. The short answer is “yes.” This post is just to document the solution.

from spacy.lang.ru import Russian

nlp_simple = Russian()
nlp_simple.add_pipe('sentencizer')

doc = nlp_simple(text)
sentences = [str(sent).strip() for sent in doc.sents]

What's up with Pinboard? And an alternative

Beginning somewhere around April 2022, the bookmarking web application Pinboard began to suffer prolonged outages without really any substantive commentary from the developer. Reports on Hacker News reveal a pattern of frequently-broken functionality. As of this writing, the API is no longer functioning.

One of the great things about the HN community is that you can almost always find an open-source tool to get the job done. That’s how I discovered Espial. It’s a minimalist open-source self-hosted bookmarking tool that looks and works like Pinboard. It also imports the Pinboard export JSON format.

Espial installed readily for me on macOS and seems very usable. My advice is to export your Pinboard bookmarks while you can and spin-up an instance of Espial.

My favourite Cyrillic font

I’ve tried a lot of fonts for Cyrillic. My favourite is Georgia. As a non-native Russian speaker, there’s something about serif fonts, either on-screen or in print, that makes the text so much more legible.

The cancellation of Russian music

Free speech in Russia has never been particularly favoured. The Romanov dynasty remained in power long past their expiration date by suppressing waves of free thought, from the ideals of the Enlightenment, to the anti-capitalist ideals of Marx and Engels. At least, until the 1917 Revolution. And even then, the Bolsheviks continue to suppress dissent for the entire seventy-something year history of the Soviet Union. Perestroika and the collapse of the Soviet Union promised change.

Bash variable scope and pipelines

I alluded to this nuance involving variable scope in my post on automating pdf processing, but I wanted to expand on it a bit. Consider this little snippet: i=0 printf "foo:bar:baz:quux" | grep -o '[^:]+' | while read -r line ; do printf "Inner scope: %d - %s\n" $i $line ((i++)) [ $i -eq 3 ] && break; done printf "====\nOuter scope\ni = %d\n" $i; If you run this script - not in interactive mode in the shell - but as a script, what will i be in the outer scope?

Automating the handling of bank and financial statements

In my perpetual effort to get out of work, I’ve developed a suite of automation tools to help file statements that I download from banks, credit cards and others. While my setup described here is tuned to my specific needs, any of the ideas should be adaptable for your particular circumstances. For the purposes of this post, I’m going to assume you already have Hazel. None of what follows will be of much use to you without it.

Bulk rename tags in DEVONthink 3

In DEVONthink, I tag a lot. It’s an integral part of my strategy for finding things in my paperless environment. As I wrote about previously hierarchical tags are a big part of my organizational system in DEVONthink. For many years, I tagged subject matter with tags that emmanate from a single tag named topic_, but it was really an unnecessary top-level complication. So, the first item on my to-do list was to get rid of the all tags with a topic_ first level.

Stripping Russian syllabic stress marks in Python

I have written previously about stripping syllabic stress marks from Russian text using a Perl-based regex tool. But I needed a means of doing in solely in Python, so this just extends that idea. #!/usr/bin/env python3 def strip_stress_marks(text: str) -> str: b = text.encode('utf-8') # correct error where latin accented ó is used b = b.replace(b'\xc3\xb3', b'\xd0\xbe') # correct error where latin accented á is used b = b.replace(b'\xc3\xa1', b'\xd0\xb0') # correct error where latin accented é is used b = b.

Accessing Anki collection models from Python

For one-off projects that target Anki collections, I often use Python in a standalone application rather than an Anki add-on. Since I’m not going to distribute these little creations that are specific to my own needs, there’s no reason to create an add-on. These are just a few notes - nothing comprehensive - on the process. One thing to be aware of is that there must be a perfect match between the Anki major and minor version numbers for the Python anki module to work.

Converting Cyrillic UTF-8 text encoded as Latin-1

This may be obvious to some, but visually-recognizing character encoding at a glance is not always obvious. For example, pronunciation files downloaded form Forvo have the following appearance: pronunciation_ru_оÑ‚бывание.mp3 How can we extact the actual word from this gibberish? Optimally, the filename should reflect that actual word uttered in the pronunciation file, after all. Step 1 - Extracting the interesting bits The gibberish begins after the pronunciation_ru_ and ends before the file extension.