Beginning to experiement with Stanza for natural language processing

After installing Stanza as dependency of UDAR which I recently described, I decided to play around with what is can do.

Installation

The installation is straightforward and is documented on the Stanza getting started page.

First,

sudo pip3 install stanza

Then install a model. For this example, I installed the Russian model:

#!/usr/local/bin/python3
import stanza
stanza.download('ru')

Usage

Part-of-speech (POS) and morphological analysis

Here’s a quick example of POS analysis for Russian. I used PrettyTable to clean up the presentation, but it’s not strictly-speaking necessary.

#!/usr/local/bin/python3
import stanza
from prettytable import PrettyTable

tab = PrettyTable()
tab.field_names = ["word","lemma","upos","xpos","features"]
for field_name in tab.field_names:
    tab.align[field_name] = "l"

nlp = stanza.Pipeline(lang='ru', processors='tokenize,pos,lemma')
doc = nlp('Моя собака внезапно прыгнула на стол.')
for sent in doc.sentences:
    for word in sent.words:
       tab.add_row([word.text, word.lemma, word.upos,
       word.xpos, word.feats if word.feats else "_"])
print(tab)

Note that upos are the universal parts of speech where xpos are language-specific parts of speech.

Named-entity recognition

Stanza can also recognize named entities - persons, organizations, and locations in the text it analyzes:

import stanza
from prettytable import PrettyTable

tab = PrettyTable()
tab.field_names = ["Entity","Type"]
for field_name in tab.field_names:
	tab.align[field_name] = "l"

nlp = stanza.Pipeline(lang='ru', processors='tokenize,ner')
doc = nlp("Владимир Путин живёт в Москве и является Президентом России.")
for sent in doc.sentences:
	for ent in sent.ents:
		tab.add_row([ent.text, ent.type])
print(tab)

which, tells us:

I’m excited to see what can be built from this for language-learning purposes.

Automated marking of Russian syllabic stress

One of the challenges that Russian learners face is the placement of syllabic stress, an essential determinate of pronunciation. Although most pedagogical texts for students have marks indicating stress, practically no tests intended for native speakers do. The placement of stress is inferred from memory and context.

I was delighted to discover Dr. Robert Reynolds' work on natural language processing of Russian text to mark stress based on grammatical analysis of the text. What follows is a brief description of the installation and use of this work. The project page on Github has installation instructions; but I found a number of items that needed to be addressed that were not covered there. For example, this project (UDAR) depends on Stanza; which in turn requires a language-specific (Russian) model.

Installation

The first step is to installation a few dependencies:

  1. Install the pexpect module:
sudo pip3 install pexpect
  1. Install stanza
sudo pip3 install stanza
  1. Install Stanza’s Russian model:
#!/usr/local/bin/python3
import stanza
stanza.download('ru')

Note the my python3 is the Homebrew version; so your hashbang may be different.

  1. The project depends on hfst1 and vislcg32 which can be installed by downloading the following script, i.e.. I had to download the script and run it in CodeRunner.

  2. Install udar:

sudo pip3 install --user git+https://github.com/reynoldsnlp/udar

Basic usage

See the project page on Github for more comprehensive details; but I was quickly able to create my own example following the documentation. For example:

#!/usr/local/bin/python3
import udar
doc1 = udar.Document('Моя собака внезапно прыгнула на стол.')
print(doc1.stressed())

which prints the correctly-marked Моя соба́ка внеза́пно пры́гнула на сто́л.

I’m looking forward to exploring the capabilities of this NLP tool further.

References


  1. Helsinki Finite-State Transducer. ↩︎

  2. Constraint grammar - implementation CG-3. ↩︎

sed matching whitespace on macOS

sed is such a useful pattern-matching and substitution tool for work on the command line. But there’s a little quirk on macOS that will trip you up. It tripped me up. On most platforms, \s is the character class for whitespace. It’s ubiquitous in regexes. But on macOS, it doesn’t work. In fact, it silently fails. Consider this bash one-liner which looks like it should work but doesn’t:

should print I am corrupt (W.

Partitioning a large directory into subdirectories by size

Since I’m not fond of carrying around all my photos on a cell phone where they’re perpetually at list of loss, I peridiocally dump the image and video files to a drive on my desktop for later burning to optical disc.1 Saving these images in archival form is a hedge against the bet that my existing backup system won’t fail someday. I’m using Blue-Ray optical discs to archive these image and video files; and each stores 25 GB of data.

More chorus repetition macros for Audacity

In a previous post I described macros to support certain tasks in generating source material for L2 chorus repetition practice. Today, I’ll describe two other macros that automate this practice by slowing the playback speed of the repetition. Background I’ve described the rationale for chorus repetition practice in previous posts. The technique I describe here is to slow the sentence playback speed to give the learner time to build speed by practicing slower repetitions.

Audacity macros to support chorus repetition practice

Achieving fluid, native-quality speech in a second language is difficult task for adult learners. For several years, I’ve used Dr. Olle Kjellin’s method of “chorus repetition” for my Russian language study. In this post, I’m presenting a method for scripting Audacity to facilitate the development of audio source material to support his methodology. Background For detailed background on the methodology, I refer you to Kjellin’s seminal paper “Quality Practise Pronunciation with Audacity - The Best Method!

Scripting Apple Music on macOS for chorus repetition practice

This is an update to my previous post on automating iTunes on macOS to support chorus repetition practice. You can read the original post for the theory behind the idea; but in short, one way of developing prosody and quality pronunciation in a foreign language is to do mass repetitions in chorus with a recording of a native speaker. Because in macOS 10.15, iTunes is no more, I’ve updated the script to work with the new Music app.

A meritocracy reading list

Meritocracy has been on everyone’s minds lately, it seems. Reading Daniel Markovits' “The Meritocracy Trap,” I was fully ready to condemn the concept completely. I may be still; but I need to take a moment to think about it more fully. Here’s the problem with condemning meritocracy outright: if we look at ability on a case-by-case basis, would you rather a well-trained, accomplished pilot or a mediocre one? Would you rather go to a concert performed by a scratchy third-rate violinist or someone whose pedigree includes Juilliard, Curtis, or the like?

A folder-based image gallery for Hugo

Hugo is the platform I use to publish this weblog. Occasionally I have the need to include a collection of images in a post. Mostly this comes up on other sites that I publish. Fancybox can do this; but it wasn’t immediately clear how to direct Fancybox to create a gallery of images in a page based on images in a directory. Previously, I’ve solved this in different ways, but I was anxious to find a simple shortcode-based method.