Normalizing spelling in Russian words containing the letter ё

June 07, 2021

The Russian letters ё and e have a complex and troubled relationship. The two letters are pronounced differently, but usually appear the same in written text. This presents complications for Russian learners and for text-to-speech systems. In several recent projects, I have needed to normalize the spelling of Russian words. For examples, if I have the written word определенно , is the word actually определенно ? Or is it определённо ?

This was a larger challenge than I imagined. Apart from udar¹, I failed to find any off-the-shelf solutions to what I call normalizing the spelling of words that should be spelled with ё . It turns out that the Russian language Wiktionary respects URLs whether spelled with ё or e . Therefore, one way of normalizing the spelling is to query Wiktionary and grab the headword from the page. Normally I don’t like creating this sort of dependency; but it’s the only solution that presented itself so far. Here’s the approach I took:

#!/usr/bin/env python3

from lxml import html
from lxml import etree
import requests
import re
from typing import Optional

word = 'еще'

def normalize(word:str) -> Optional[str]:
    # don't bother searching if there's no е or if
    # there *is* a ё
    if not bool(re.search(r'[её]', word)) or bool(re.search(r'[ё]', word)):
        return word
    url = f'https://ru.wiktionary.org/wiki/{word}'
    page = requests.get(url)
    content = page.content.decode()
    tree = etree.fromstring(content.replace('--lang--', ''))
    block = tree.xpath('//h1[@id="firstHeading"]')
    try:
        return block[0].text
    except:
        return word

if __name__ == "__main__":
    print(normalize(word))

udar can work but the installation is non-trivial and it has substantial dependencies that may make it less appealing in some applications. ↩︎