Beginning to experiement with Stanza for natural language processing

After installing Stanza as dependency of UDAR which I recently described, I decided to play around with what is can do.

Installation

The installation is straightforward and is documented on the Stanza getting started page.

First,

sudo pip3 install stanza

Then install a model. For this example, I installed the Russian model:

#!/usr/local/bin/python3
import stanza
stanza.download('ru')

Usage

Part-of-speech (POS) and morphological analysis

Here’s a quick example of POS analysis for Russian. I used PrettyTable to clean up the presentation, but it’s not strictly-speaking necessary.

#!/usr/local/bin/python3
import stanza
from prettytable import PrettyTable

tab = PrettyTable()
tab.field_names = ["word","lemma","upos","xpos","features"]
for field_name in tab.field_names:
    tab.align[field_name] = "l"

nlp = stanza.Pipeline(lang='ru', processors='tokenize,pos,lemma')
doc = nlp('Моя собака внезапно прыгнула на стол.')
for sent in doc.sentences:
    for word in sent.words:
       tab.add_row([word.text, word.lemma, word.upos,
       word.xpos, word.feats if word.feats else "_"])
print(tab)

Note that upos are the universal parts of speech where xpos are language-specific parts of speech.

Named-entity recognition

Stanza can also recognize named entities - persons, organizations, and locations in the text it analyzes:

import stanza
from prettytable import PrettyTable

tab = PrettyTable()
tab.field_names = ["Entity","Type"]
for field_name in tab.field_names:
	tab.align[field_name] = "l"

nlp = stanza.Pipeline(lang='ru', processors='tokenize,ner')
doc = nlp("Владимир Путин живёт в Москве и является Президентом России.")
for sent in doc.sentences:
	for ent in sent.ents:
		tab.add_row([ent.text, ent.type])
print(tab)

which, tells us:

I’m excited to see what can be built from this for language-learning purposes.