sterilize-ng: a command-line URL sterilizer

Introducing sterilize-ng [GitHub link] - a URL sterilizer made to work flexibily on the command line.

Background

The surveillance capitalist economy is built on the relentless tracking of users. Imagine going about town running errands but everywhere you go, someone is quietly following you. When you pop into the grocery, they examine your receipt. They look into the bags to see what you bought. Then they hop in the car with you and keep careful records of where you go, how fast you drive, whom you talk with on the phone. This is surveillance capitalism - the relentless “digital exhaust” left by our actions online.

The techniques employed by surveillance capitalists are multifold, but one of the easiest to fix is the pollution of URLs with tracking parameters. If you visit a link on Facebook by clicking on it, you are actually giving up a wealth of information about yourself unnecessarily. Here’s a typical outgoing link that you would find on Facebook:

What is all this extra garbage that Facebook attachs to the actual link? Who knows? Somehow Facebook uses this to track your online behaviour. Otherwise, they would just display the actual link, which is quite simply: https://www.playsmart.ca/social-hub/the-missing-millions/

When you click on a link in Facebook or on Google1 search results, these surveillance capitalists use tracking parameters to follow you around the web, serve ads to you and generally spy on you. The problem in avoiding this sort of surveillance is that they don’t show you their god-awful links transparently. Instead they silently attach all this garbage and hope you won’t notice.

Prerequisites

This was developed on macOS but the much of the code should work as-is on Linux, but I don’t have a system to test it on. On macOS, I would suggest installing Homebrew so that you can leverage proxychains-ng (installed via Homebrew.) Using proxychains, you can anonymize the expansion of shortened links. If proxychains-ng is not installed, the script will just expand the shortned links without hiding behind proxies.

Usage

I’ve made use of sterilize-ng.sh by installing in a Keyboard Maestro macro. I right click on a link, copy it to the clipboard, and invoke the KM macro. Then I just paste the sterilized link into a browser. Not as easy as just clicking links; but it’s safer and I feel like I’m doing my part to thwart surveillance capitalism.

This is a work in progress. You can find the repository at GitHub: https://github.com/NSBum/sterilize-ng. Feel free to fork the repo and adapt to your needs. Pull requests are welcome.

Testing

You can run a test suite of sorts by loading links in the test_links_sterilize.csv file. These are just pairs of URLs - original (unsterile) and sterilized links. To use the testing facility, run test_sterilize_links.sh.


  1. Why are you still using Google? Seriously, change your search engine to Duck Duck Go. ↩︎

Using Perl in Keyboard Maestro macros

One of the things that I love about Keyboard Maestro is the ability to chain together disparate technologies to achieve some automation goal on macOS.

In most of my previous posts about Keyboard Maestro macros, I’ve used Python or shell scripts, but I decided to draw on some decades-old experience with Perl to do a little text processing for a specific need.

Background

I want this text from Wiktionary:

to look like this:

- `по проше́ствии пяти́ лет` - after five years had elapsed; five years later

so that I can then render this Markdown into HTML on my Anki cards.

That’s it. Simple; I would just highlight the block in the browser, copy, and allow Keyboard Maestro to reformat the text.

Splitting the text into lines

Not knowing how the lines were split, I started by analyzing the string on the clipboard character-by-character.

my @chars = split("", $str);
foreach (@chars) {
   printf("%02x ", ord($_));
}

which shows:

With that information in hand, we know that the line separator is \x0A.

No we can easily split the string on that character and reformat. So the core of the macro will be:

#!/usr/bin/perl

my $str = $ENV{KMVAR_ruword};
my @lines = split("\x0A", $str); 
printf("- `%s` - %s", $lines[0], $lines[2]);

No we just need to get the clipboard into the variable ruword and pass the results of the Perl script back to the clipboard, and paste.

Stripping Russian stress marks from text from the command line

Russian text intended for learners sometimes contains marks that indicate the syllabic stress. It is usually rendered as a vowel + a combining diacritical mark, typically the combining acute accent \u301. Here are a couple ways of stripping these marks on the command line:

First is a version using Perl

#!/bin/bash

f='покупа́ешья́';
echo $f | perl -C -pe 's/\x{301}//g;'

And then another using the sd tool:

#!/bin/bash

f='покупа́ешья́';
echo $f | sd "\u0301" ""

Both rely on finding the combining diacritical mark and removing it with regex.

Splitting a string on the command line - the search for the one-liner

It seems like the command line is one of those places where you can accomplish crazy efficient things with one-liners.

Here’s a perfect use case for a CLI one-liner:

In Anki, I often add lists of synonyms and antonyms to my vocabulary cards, but I like them formatted as a bulleted list. My usual route to that involves Markdown. But how to convert this:

известный, точный, определённый, достоверный

to

- `известный`
- `точный`
- `определённый`
- `достоверный`

After trying to come up with a single text replacement strategy to make this work, the best I could do was this:

A Keyboard Maestro macro to edit Anki sound file

Often when I import a pronunciation file into Anki, from Forvo for example, the volume isn’t quite right or there’s a lot of background noise; and I want to edit the sound file. How?

The solution for me, as it often the case is a Keyboard Maestro macro.

Prerequisites

  • Keyboard Maestro - if you are a macOS power user and don’t have KM, then your missing on a lot.
  • Audacity - the multi-platform FOSS audio editor

Outline of the approach

Since Keyboard Maestro won’t know the path to our file in Anki’s collection.media directory, we have to find it. But the first task is to extract the filename. In the Anki note field, it’s going to have this format:

Querying the Anki database when the application is running

When the Anki application is open on the desktop, it places a lock on the sqlite3 database such that it can’t be queried by another process. One workaround is to try to open the database and if it fails, then make a temporary copy and query that. Of course, this only works with read-only queries. Here’s the basic strategy:

#!/usr/local/bin/python3
# -*- coding: utf-8 -*-

# requires python >= 3.8 to run because of anki module

from anki import Collection, errors

if __name__ == "__main__":
    try:
        col = Collection(path_to_anki_db)
    except (errors.DBError:
        # anki is open, copy to temp file
        import tempfile
        import shutil
        import os

        with tempfile.TemporaryDirectory() as tmpdir:
            dst = os.path.join(tmpdir, 'collectiontemp.anki2')
            shutil.copy(COLLECTION_PATH, dst)
            col = Collection(dst)
            # do something with Anki db

Note that the tempfile context manager will discard the database, if there are actions on the collection that are common to the Anki-is-open and Anki-is-not-open paths then those should be abstracted to separate function.

Normalizing spelling in Russian words containing the letter ё

The Russian letters ё and e have a complex and troubled relationship. The two letters are pronounced differently, but usually appear the same in written text. This presents complications for Russian learners and for text-to-speech systems. In several recent projects, I have needed to normalize the spelling of Russian words. For examples, if I have the written word определенно , is the word actually определенно ? Or is it определённо ?

This was a larger challenge than I imagined. Apart from udar1, I failed to find any off-the-shelf solutions to what I call normalizing the spelling of words that should be spelled with ё . It turns out that the Russian language Wiktionary respects URLs whether spelled with ё or e . Therefore, one way of normalizing the spelling is to query Wiktionary and grab the headword from the page. Normally I don’t like creating this sort of dependency; but it’s the only solution that presented itself so far. Here’s the approach I took:

Scraping Russian word definitions from Wikitionary: utility for Anki

While my Russian Anki deck contains around 27,000 cards, I’m always making more. (There are a lot words in the Russian language!) Over the years, I’ve become more and more efficient with card production but one of the missing pieces was finding a code-readable source of word definitions. There’s no shortage of dictionary sites, but scraping data from any site is complicated by the ways in which front-end developers spread the semantic content across multiple HTML tags arranged in deep and cryptic hierarchies. Yes, we can cut-and-paste, but my quest is about nearly completely automating quality card production. This is a quick post of a method for scraping word definitions from Wiktionary.

Encoding of the Cyrillic letter й - a UTF-8 gotcha

In the process of writing and maintaining a service that checks Russian word frequencies, I noticed peculiar phenomenon: certain words could not be located in a sqlite database that I knew actually contained them. For example, a query for the word - английский consistently failed, whereas other words would succeed. Eventually the commonality between the failures became obvious. All of the failures contained the letter й , which led me down a rabbit hole of character encoding and this specific case where it can go astray.

Dynamic DNS - auto-updating from macOS

To run a little project (that I’ll describe at some point in the future) I have to run a small web server from my home computer, one that happens to run macOS. More than anything else, this is just a reply of what I did to get it running in case: a) I have to do it again, or b) Someone else can find it useful.

Sign up for dynamic DNS service

I signed up for service with dynv6 because I saw it recommended elsewhere and it didn’t look creepy like some of the other options. I just signed up with email - through an email proxy anonymizer, because I’m paranoid. After verifying my email, I was able to create a new “zone”, basically a record of my public IP address linked to custom DNS.