Scraping Forvo pronunciations

November 07, 2022

Most language learners are familiar with Forvo, a site that allows users to download and contribute pronunciations for words and phrases. For my Russian studies, I make daily use of the site. In fact, to facilitate my Anki card-making workflow, I am a paid user of the Forvo API. But that’s where the trouble started.

When the Forvo API works, it works OK, often extremely slow. But lately, it has been down more than up. In an effort to patch my workflow and continue to download Russian word pronunciations, I wrote this little scraper. I’d prefer to use the API, but experience has shown now that the API is slow and unreliable. I’ll keep paying for the API access, because I support what the company does. And as often as not when a company offers a free service, it’s likely to be involved in surveillance capitalism. So I’d rather companies offer a reliable product at a reasonable price.

There are other such projects out in open-source. This project incorporates one interesting feature in that it attempts to rank pronunciations in a scoring system that relies on whether the contributing user is a favourite and how many votes that the pronunciation has gained.¹

If you just want to get started with the scraper, it’s up on GitHub. I’m open to pull requests if you have something interesting to contribute, or honestly, you can do whatever you would like with it. I’d appreciate a little acknowledgement if you adopt the code in your own work.

If you want to stick around and see how I did things, feel free to follow along.

Approach

This scaper uses Selenium for Python which loads a browser head in order to capture the HTML that we’re going to scrape. The idea is to future-proof the script against attempts on behalf of the company to detect script-based access. It was also a chance to learn how to integrate Selenium with Beautiful Soup 4, my go-to scraping technology for Python. First, we’re going to need to import those dependencies:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait

from bs4 import BeautifulSoup, element

We use command-line arguments in the script, so we will need the argparse infrastructure to get our download destination and the word we’re researching:

import argparse

if __name__ == "__main__":
   prog_desc = "Download pronunciation file from Forvo"
   parser = argparse.ArgumentParser()
   parser = argparse.ArgumentParser(prog=prog_desc)
   parser.add_argument('--dest',
                        help="The directory for download",
                        type=str)
   parser.add_argument('--word',
                        help="Word to research",
                        type=str)

   args = parser.parse_args()
   word = args.word
   dest = args.dest

Capturing the Forvo page

We’re using Selenium to capture the HTML content of the pronunciation list page so that we can analyze it:

def get_forvo_page(url: str) -> BeautifulSoup:
   """Get the bs4 object from Forvo page
   
   url: str - the Forvo pronunciation page

   Returns
   -------
   BeautifulSoup 4 object for the page
   """
   driver = webdriver.Safari()
   driver.get(url)
   driver.implicitly_wait(30)
   agree_button = driver.execute_script("""return document.querySelector("button[mode='primary']");""")
   try:
      agree_button.click()
   except AttributeError:
      pass
   try:
      close_button = driver.execute_script("""return document.querySelector("button.mfp-close");""")
      close_button.click()
   except AttributeError:
      pass
   time.sleep(1)
   soup = BeautifulSoup(driver.page_source, 'html.parser')
   return soup

Note that this uses Safari; obviously I’m on macOS, so you can use a different driver if you are on another platform.

Note that that we have to try to respond to privacy and other pop-ups that try to ruin our day. We could probably do it without the JavaScript, but that’s what came to me in the moment.

<ul class="pronunciations-list pronunciations-list-ru" id=
   "pronunciations-list-ru">
   <li class="pronunciation li-active">
      <!-- detail for this pronunciation -->
   </li>

All of the Russian language pronunciations are in the unordered list with the id pronunciations-list-ru, so our next task is to find that ul element and enumerate it. Fortunately, Beautiful Soup makes that incredible easy:

ru_pronunciation_list = soup.find("ul", {"id": "pronunciations-list-ru"})
if ru_pronunciation_list is None:
   exit('ERROR - this word may not exist on Forvo!')

Then we can loop over ru_pronunciation_list to find all of the pronunciation list items (li) and accumulate them as our custom Pronunciation objects:

pronunciations = []
for li in ru_pronunciation_list.find_all("li"):
   pronunciation = pronunciation_for_li(li)
   pronunciations.append(pronunciation)

Next, we’ll take a look at what pronunciation_for_li does with each of those <li> elements:

def pronunciation_for_li(element: element.Tag) -> Optional[Pronunciation]:
   """Pronunciation object from its <li> element

   Returns an optional Pronunciation object from a
   <li> element that contains the required info.

   Returns
   -------
   Pronunciation object, or None
   """
   info_span = element.find("span", {"class": "info"})
   if info_span is not None:
      user = user_from_info_span(info_span)
   votes = num_votes_from_li(element)
   url = audio_link_for_li(element)
   if url is not None:
      pronunciation = Pronunciation(user, votes, url)
      return pronunciation
   return None

Here we’re just extracting vote, username and audio file link from the deeper levels of the hierarchy, which is left as an exercise for the reader. One detail that bears mentioning is how we extract the link to the .ogg file. Each pronunciation has a play button with an onclick attribute. The JavaScript code provides a base64-encoded value that we can extract. The value is a component of the audio file path that we extract.

Ranking pronunciations

As mentioned, we rank pronunciations by two variables - username and the number of votes. But we need a method for ordering them in the list of pronunciations. We use functools.total_ordering for this.

from functools import total_ordering

@total_ordering
class Pronunciation(object):
   def __init__(self, uname:str, positive: int, path: str):
      self.user_name = uname
      self.positive: int = positive
      self.path: str = path

By decorating the Pronunciation class, we can later use max against our list of pronunciations to select that highest rated item. But we do have to implement certain functions required by total_ordering:

@property
def score(self) -> int:
   subscore = 0
   if self.user_name in FAVS:
      subscore = 2
   return self.positive + subscore
   
def __eq__(self, other):
   if not isinstance(other, type(self)): return NotImplemented
   return self.score == other.score
   
def __lt__(self, other):
   if not isinstance(other, type(self)): return NotImplemented
   return self.score < other.score

The scoring algorithm is entirely arbitrary. If you want to give a higher weight to favourite users, that’s something you can certainly implement.

Selecting a pronunciation

Having implement the comparison functions in the Pronunciation class, we can select the one with the highest score:

use_p  = max(pronunciations) if len(pronunciations) > 1 else pronunciations[0]

And that’s it! I hope this example is helpful to you. If you have the means, and if the Forvo API improves a lot using that would be the most ethical way to automate the process of grabbing pronunciations. But until then, here’s an alternative. If you have questions, I’m not on Twitter² so please just use my contact page.

Of course the later variable is not entirely reliable because the oldest pronunciations will have had the longest opportunity to garner votes; but the idea is that we can at least look to our favourites as a way of nudging the choice in the desired direction. ↩︎
I refuse to be part of Elon Musk’s attempt to impose his authoritarian world-view through his acquisition of Twitter. I have no accounts on the platform. ↩︎