Beginning to experiement with Stanza for natural language processing
After installing Stanza as dependency of UDAR which I recently described, I decided to play around with what is can do.
Installation
The installation is straightforward and is documented on the Stanza getting started page.
First,
sudo pip3 install stanza
Then install a model. For this example, I installed the Russian model:
#!/usr/local/bin/python3
import stanza
stanza.download('ru')
Usage
Part-of-speech (POS) and morphological analysis
Here’s a quick example of POS analysis for Russian. I used PrettyTable
to clean up the presentation, but it’s not strictly-speaking necessary.
#!/usr/local/bin/python3
import stanza
from prettytable import PrettyTable
tab = PrettyTable()
tab.field_names = ["word","lemma","upos","xpos","features"]
for field_name in tab.field_names:
tab.align[field_name] = "l"
nlp = stanza.Pipeline(lang='ru', processors='tokenize,pos,lemma')
doc = nlp('Моя собака внезапно прыгнула на стол.')
for sent in doc.sentences:
for word in sent.words:
tab.add_row([word.text, word.lemma, word.upos,
word.xpos, word.feats if word.feats else "_"])
print(tab)
Note that upos
are the universal parts of speech where xpos
are language-specific parts of speech.
Named-entity recognition
Stanza can also recognize named entities - persons, organizations, and locations in the text it analyzes:
import stanza
from prettytable import PrettyTable
tab = PrettyTable()
tab.field_names = ["Entity","Type"]
for field_name in tab.field_names:
tab.align[field_name] = "l"
nlp = stanza.Pipeline(lang='ru', processors='tokenize,ner')
doc = nlp("Владимир Путин живёт в Москве и является Президентом России.")
for sent in doc.sentences:
for ent in sent.ents:
tab.add_row([ent.text, ent.type])
print(tab)
which, tells us:
I’m excited to see what can be built from this for language-learning purposes.