Interlinear glossing dealing with punctuation

In a previous post I presented a CSS-based solution to interlinear glossing that uses only CSS. It’s a solution that may be preferrable others such as leipzig.js or interlinear.js because both of the latter assume a different annotation purpose than what I envision for my app. Whereas those libraries deal with punctuation gracefully, my CSS-only approach does not. So we end up with something like this:

where the PUNCT nodes end up standing alone. These extra punctuation nodes add nothing to the understanding of the text and look ragged.

What I would really like is for the punctuation marks to live with the previous element and for the markup to go away. A little jQuery helps here. The basic strategy is this:

  1. Find the p.pos nodes and select the ones containing PUNCT.
  2. Loop over the p.pos punctuation nodes and find their parent node, which we’re going to delete.
  3. Find the previous sibling of the punctuation div.
  4. Append the punctuation mark onto the p.ru of the previous sibling div.
  5. Remove the punctuation div from the DOM.

The result looks like this:

The visual appearance is much better now, I think.

The CSS and HTML example code are as presented previously. Here’s the jQuery code we use to move around the punctuation.

$(function() {
    /* document ready code here */
    $('p.pos').filter(function() {
        return $(this).text().trim().toLowerCase() === 'punct';
    }).each(function(index) {
        /* these are each punctuation <p> */
        let punctDiv = $(this).parent();
        // get the exact punctuation mark in use
        let punctMark = punctDiv.children().filter('.ru').first().text();
        /*  find the previous div because
        	that's where we need to add back the
            punctuation mark
        */
        let punctPrevDiv = punctDiv.prev();
        // the p.ru child
        var punctRuP = punctPrevDiv.children().filter('.ru').first();
        // glom the punctuation mark onto previous p.ru
        punctRuP.append(punctMark);
        // remove the PUNCT div from the DOM
        punctDiv.remove()
    })
})

There is a JSFiddle to play with if this is helpful. There’s still much more to do in my project, integrating various pieces, but it’s beginning to take shape.

Three-line (though non-standard) interlinear glossing

Still thinking about interlinear glossing for my language learning project. The leizig.js library is great but my use case isn’t really what the author had in mind. I really just need to display a unit consisting of the word as it appears in the text, the lemma for that word form, and (possibly) the part of speech. For academic linguistics purposes, what I have in mind is completely non-standard.

The other issue with leizig.js for my use case is that I need to be able to respond to click events on individual words so that they can be tagged, defined or otherwise worked with. It’s straightforward how I could apply CSS id attributes to word-level elements to support that functionality.

So I’m back to a CSS-only solution.

Here’s what a three-line CSS-only interlinear glossing display might look like:

You can find the code - in progress, as always, in a JSFiddle.

One my priorities is going to be dealing with punctuation. It looks messy and unrefined right now. First, the punctuation marks need to be glommed onto the previous word rather than standing alone. Second, there’s no need to display either a lemma or a POS for punctuation marks. It’s going to need either JavaScript running on the page to dynamically deal with the UI, or something on the backend. Most likely the former.

Splitting text into sentences: Russian edition

Splitting text into sentences is one of those tasks that looks simple but on closer inspection is more difficult than you think. A common approach is to use regular expressions to divide up the text on punction marks. But without adding layers of complexity, that method fails on some sentences. This is a method using spaCy.

My favourite Cyrillic font

I’ve tried a lot of fonts for Cyrillic. My favourite is Georgia. As a non-native Russian speaker, there’s something about serif fonts, either on-screen or in print, that makes the text so much more legible.

The cancellation of Russian music

Free speech in Russia has never been particularly favoured. The Romanov dynasty remained in power long past their expiration date by suppressing waves of free thought, from the ideals of the Enlightenment, to the anti-capitalist ideals of Marx and Engels. At least, until the 1917 Revolution. And even then, the Bolsheviks continue to suppress dissent for the entire seventy-something year history of the Soviet Union. Perestroika and the collapse of the Soviet Union promised change.

Bash variable scope and pipelines

I alluded to this nuance involving variable scope in my post on automating pdf processing, but I wanted to expand on it a bit. Consider this little snippet: i=0 printf "foo:bar:baz:quux" | grep -o '[^:]+' | while read -r line ; do printf "Inner scope: %d - %s\n" $i $line ((i++)) [ $i -eq 3 ] && break; done printf "====\nOuter scope\ni = %d\n" $i; If you run this script - not in interactive mode in the shell - but as a script, what will i be in the outer scope?

Automating the handling of bank and financial statements

In my perpetual effort to get out of work, I’ve developed a suite of automation tools to help file statements that I download from banks, credit cards and others. While my setup described here is tuned to my specific needs, any of the ideas should be adaptable for your particular circumstances. For the purposes of this post, I’m going to assume you already have Hazel. None of what follows will be of much use to you without it.