Beginning to experiement with Stanza for natural language processing

After installing Stanza as dependency of UDAR which I recently described, I decided to play around with what is can do.

Installation

The installation is straightforward and is documented on the Stanza getting started page.

First,

sudo pip3 install stanza

Then install a model. For this example, I installed the Russian model:

#!/usr/local/bin/python3
import stanza
stanza.download('ru')

Usage

Part-of-speech (POS) and morphological analysis

Here’s a quick example of POS analysis for Russian. I used PrettyTable to clean up the presentation, but it’s not strictly-speaking necessary.

#!/usr/local/bin/python3
import stanza
from prettytable import PrettyTable

tab = PrettyTable()
tab.field_names = ["word","lemma","upos","xpos","features"]
for field_name in tab.field_names:
    tab.align[field_name] = "l"

nlp = stanza.Pipeline(lang='ru', processors='tokenize,pos,lemma')
doc = nlp('Моя собака внезапно прыгнула на стол.')
for sent in doc.sentences:
    for word in sent.words:
       tab.add_row([word.text, word.lemma, word.upos,
       word.xpos, word.feats if word.feats else "_"])
print(tab)

Note that upos are the universal parts of speech where xpos are language-specific parts of speech.

Named-entity recognition

Stanza can also recognize named entities - persons, organizations, and locations in the text it analyzes:

import stanza
from prettytable import PrettyTable

tab = PrettyTable()
tab.field_names = ["Entity","Type"]
for field_name in tab.field_names:
	tab.align[field_name] = "l"

nlp = stanza.Pipeline(lang='ru', processors='tokenize,ner')
doc = nlp("Владимир Путин живёт в Москве и является Президентом России.")
for sent in doc.sentences:
	for ent in sent.ents:
		tab.add_row([ent.text, ent.type])
print(tab)

which, tells us:

I’m excited to see what can be built from this for language-learning purposes.

Automated marking of Russian syllabic stress

One of the challenges that Russian learners face is the placement of syllabic stress, an essential determinate of pronunciation. Although most pedagogical texts for students have marks indicating stress, practically no tests intended for native speakers do. The placement of stress is inferred from memory and context.

I was delighted to discover Dr. Robert Reynolds’ work on natural language processing of Russian text to mark stress based on grammatical analysis of the text. What follows is a brief description of the installation and use of this work. The project page on Github has installation instructions; but I found a number of items that needed to be addressed that were not covered there. For example, this project (UDAR) depends on Stanza; which in turn requires a language-specific (Russian) model.

Installation

The first step is to installation a few dependencies:

  1. Install the pexpect module:
sudo pip3 install pexpect
  1. Install stanza
sudo pip3 install stanza
  1. Install Stanza’s Russian model:
#!/usr/local/bin/python3
import stanza
stanza.download('ru')

Note the my python3 is the Homebrew version; so your hashbang may be different.

  1. The project depends on hfst1 and vislcg32 which can be installed by downloading the following script, i.e.. I had to download the script and run it in CodeRunner.

  2. Install udar:

sudo pip3 install --user git+https://github.com/reynoldsnlp/udar

Basic usage

See the project page on Github for more comprehensive details; but I was quickly able to create my own example following the documentation. For example:

#!/usr/local/bin/python3
import udar
doc1 = udar.Document('Моя собака внезапно прыгнула на стол.')
print(doc1.stressed())

which prints the correctly-marked Моя соба́ка внеза́пно пры́гнула на сто́л.

I’m looking forward to exploring the capabilities of this NLP tool further.

References


  1. Helsinki Finite-State Transducer. ↩︎

  2. Constraint grammar - implementation CG-3. ↩︎

Speaking of regex: sd - a tool to search and displace

With all its peculiarities of syntax, sed leaves a bit to be desired. That’s why I was pleased to find sd, or “Search and Displace”, a tool that does what sed does but with less arcane syntax.

For example:

# prints "Corruption" (W.Barr)
echo "\"Corruption\" by W.Barr" | sd '(.+)\sby\s(.+)' '$1 ($2)'

Or just leave out the \s metacharacter in favour of a literal string:

sd '(.+) by (.+)' '$1 ($2)'

It does all the tricks you’d expect from a sed replacement. For example, named capture groups:

sed matching whitespace on macOS

sed is such a useful pattern-matching and substitution tool for work on the command line. But there’s a little quirk on macOS that will trip you up. It tripped me up. On most platforms, \s is the character class for whitespace. It’s ubiquitous in regexes. But on macOS, it doesn’t work. In fact, it silently fails.

Consider this bash one-liner which looks like it should work but doesn’t:

# should print I am corrupt (W.Barr)
# instead it prints I am corrupt by W.Barr
echo "I am corrupt by W.Barr" | sed -E 's|^(.+)\sby\s(.+)|\1 (\2)|g'

What does work is the character class [:space:]:

Partitioning a large directory into subdirectories by size

Since I’m not fond of carrying around all my photos on a cell phone where they’re perpetually at list of loss, I peridiocally dump the image and video files to a drive on my desktop for later burning to optical disc.1 Saving these images in archival form is a hedge against the bet that my existing backup system won’t fail someday.

I’m using Blue-Ray optical discs to archive these image and video files; and each stores 25 GB of data. So my plan was to split the iPhone image dump into 24 GB partitions. H

A comprehensive system for generating chorus repetition source material from Glossika sentence tracks

Generate iterated tracks at normal, 75% and 50% speeds

In previous posts I described a group of Audacity macros that create iterated tracks for chorus repetition practice. As Dr. Kjellin described we use Audacity to create a total of six repetitions per track.

Automating the generation of tracks with AppleScript

On macOS, AppleScript is the core technology used to automate routine tasks. In this case, we make extensive use of UI scripting since Audacity is not a scriptable application.

More chorus repetition macros for Audacity

In a previous post I described macros to support certain tasks in generating source material for L2 chorus repetition practice. Today, I’ll describe two other macros that automate this practice by slowing the playback speed of the repetition.

Background

I’ve described the rationale for chorus repetition practice in previous posts. The technique I describe here is to slow the sentence playback speed to give the learner time to build speed by practicing slower repetitions. By applying the Change Tempo... effect^[Change tempo effect in the Audacity manual] in Audacity. In my own practice, I will often begin complex Russian sentences at -50% speed and progress to -25% speed before practicing the pronunciation at native-level speed. By practicing at slow speeds, it gives the learner time to appreciate how syllables are connected to each other. The prosody is more apparent.

Audacity macros to support chorus repetition practice

Achieving fluid, native-quality speech in a second language is difficult task for adult learners. For several years, I’ve used Dr. Olle Kjellin’s method of “chorus repetition” for my Russian language study. In this post, I’m presenting a method for scripting Audacity to facilitate the development of audio source material to support his methodology.

Background

For detailed background on the methodology, I refer you to Kjellin’s seminal paper “Quality Practise Pronunciation with Audacity - The Best Method!” on the subject of chorus repetition practice. The first half of the paper outlines the neurophysiologic rational for the method and the second half describes the practical use of the cross-platform tool Audacity to generate source material for this practice.

Scripting Apple Music on macOS for chorus repetition practice

This is an update to my previous post on automating iTunes on macOS to support chorus repetition practice. You can read the original post for the theory behind the idea; but in short, one way of developing prosody and quality pronunciation in a foreign language is to do mass repetitions in chorus with a recording of a native speaker.

Because in macOS 10.15, iTunes is no more, I’ve updated the script to work with the new Music app. It turns out that it’s a lot simpler. No need to dive into the application classes.