Pulling A List of Animals From Wikipedia

Beginning

As part of my pursuit of randomness I'm creating lists of things to select from. In this case I wanted a list of animals to create a Visual Alphabet and I decided to grab one from wikipedia. This is what I did.

Set Up

Imports

# python
from functools import partial
from pathlib import Path
from string import ascii_uppercase

import os
import re

# pypi
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from expects import (
    be_true,
    expect,
)
from selenium import webdriver
from tabulate import tabulate

import pandas

Set Up Tabulate

This is just to make printing the pandas data-frames a little easier. emacs-jupyter actually does a pretty good job of rendering them as HTML like a regular jupyter notebook, but this makes it a little more readable while in emacs, plus I can control things better.

TABLE = partial(tabulate, showindex=False, tablefmt="orgtbl", headers="keys")

Set Up Selenium

I started this using Requests-HTML because wikipedia uses JQuery to render the tables and Requests-HTML it uses pyppeteer to render javascript (as opposed to the reqular requests that just does an HTTP GET and so leaves the tables un-rendered). Unfortunately I can't get it to work in a jupyter notebook. One problem is that both jupyter and Requests-HTML use asyncio - using nest_asyncio allows pypeteer to run in the notebook but then I get this error:

AttributeError: 'coroutine' object has no attribute 'newPage'

Which looks like it might be a pypeteer error, but tracking all this down is taking too long so I'm going to switch to using selenium. Since I'm running this on a remote machine we have to run it headless. I used to do this using XVFB but selenium has a headless mode so I'll try that instead.

options = webdriver.FirefoxOptions()
options.headless = True
browser = webdriver.Firefox(firefox_options=options)

Middle

Grabbing Some HTML

Just let selenium render the JQuery tables for us.

URL = "https://en.wikipedia.org/wiki/List_of_animal_names"
browser.get(URL)

Building the Table

After giving up on Requests-HTML I was using Beautiful Soup to pull out the table so I could give it to pandas, but then I looked at the pandas documentation and it turns out they already do this, so you don't need to, but you do need to know how many tables there are and which one you want (it returns a list of dataframes). In this case I want the last one.

table = pandas.read_html(browser.page_source)[-1]
print(TABLE(table.head(1)))
('Animal', 'A') ('Young', 'A') ('Female', 'A') ('Male', 'A') ('Collective noun', 'A') ('Collateral adjective', 'A') ('Culinary noun for meat', 'A')
Aardvark cub sow boar nan orycteropodian nan

That's a little odd looking.

Cleaning Up the Table

First, what's with all the A entries in the header?

print(table.columns)
MultiIndex([(                'Animal', 'A'),
            (                 'Young', 'A'),
            (                'Female', 'A'),
            (                  'Male', 'A'),
            (       'Collective noun', 'A'),
            (  'Collateral adjective', 'A'),
            ('Culinary noun for meat', 'A')],
           )

Hmm..

table.columns = [column[0] for column in table.columns]
print(table.columns)
Index(['Animal', 'Young', 'Female', 'Male', 'Collective noun',
       'Collateral adjective', 'Culinary noun for meat'],
      dtype='object')
print(TABLE(table.head(1)))
Animal Young Female Male Collective noun Collateral adjective Culinary noun for meat
Aardvark cub sow boar nan orycteropodian nan

Meats

At this point I don't know what I really want to do with this. I was originally going to just save the animal names but now the other stuff looks kind of interesting. As an aside I had to look up Collateral Adjective on wikipedia:

A collateral adjective is an adjective that is identified with a particular noun in meaning, but that is not derived from that noun.

Good to know.

meats = table.dropna(subset=["Culinary noun for meat"])
print(len(meats))
49

So, now I noticed something else wrong.

print(TABLE(meats.iloc[-1:]))
Animal Young Female Male Collective noun Collateral adjective Culinary noun for meat
Z Z Z Z Z Z Z

Oops. Maybe I should have looked at the table more closely.

Again, But With Soup This Time

Let's see if using Beautiful Soup will help clean up the dataset a little.

soup = BeautifulSoup(browser.page_source)
soup_table = soup.find_all("table")[-1]

The read_html method always returns a list, even though I'm only passing in one table.

table = pandas.read_html(str(soup_table))[0]
print(TABLE(table.iloc[-1:]))
('Animal', 'A') ('Young', 'A') ('Female', 'A') ('Male', 'A') ('Collective noun', 'A') ('Collateral adjective', 'A') ('Culinary noun for meat', 'A')
Zebra foal colt (male) filly (female) mare stallion herd[11][107] cohort[107]dazzle [108] zeal[11][107] zebrine hippotigrine nan

Actually, that looks worse.

I went back and looked at the page and the table is broken up by sub-headers indicating the first letter of the animal's name.

print(str(soup_table).split("\n")[10])
<th colspan="7"><span id="A"></span><b>A</b>

So, maybe I can clean that up a little before creating the table - here's where Beautiful Soup comes in. First I'll find all the tags whose ID matches a letter of the alphabet and then I'll use the decompose method to destroy the grand-parent of that tag - decompose destroys both the tag it's called on and all the descendants of that tag so this will destroy the grand-parent (table-row tag), parent (table-header) and matching tag (span with the ID matching the letter) and it's child (the bold tag). I'm also calling smooth and encode to clean the tree up once the tags are destroyed, although, to be honest, I don't know if that's really needed here, it just seemed like a good idea.

for character in ascii_uppercase:
    element = soup_table.find(id=character)
    if element is None:
        print(f"Element for {character} not found")
    else:
        element.parent.parent.decompose()

soup_table.smooth()
soup_table.encode()
table = pandas.read_html(str(soup_table))[0]
print(TABLE(table.head(1)))
Animal Young Female Male Collective noun Collateral adjective Culinary noun for meat
Aardvark cub sow boar nan orycteropodian nan

So, now our first row is for Aardvark instead of the letter A. And now back to the meats.

meats = table.dropna(subset=["Culinary noun for meat"])
print(len(meats))
24

There are twenty-four entries in our table that have a noun referring to the animal as meat. Let's see what they are.

meats = meats.rename(columns={"Culinary noun for meat": "Meat"})
for row in meats.itertuples():
    print(f"{row.Animal}: {row.Meat}")
Cattle[note 3] (list): beef veal
Chicken (list): poultry
Deer: venison humble (organ meat)
DogfishAlso see Shark: Spiny dogfish capeshark (USA) flake (UK, AUS) huss (UK) kahada (CAN)[50] rigg (UK)rock salmon (UK)
Duck (list) Also see Mallard: poultry
Elk (wapiti): venison
Gaur: garabeef
Goat (list): chevon cabrito mutton
Goose: poultry
Guinea fowl: poultry
Guinea pig: cuy
MallardAlso see Duck: poultry
Moose: venison
Peafowl: poultry
Pig (list) Also see Boar: pork ham bacon
Pigeon (list): squab
RamAlso see Sheep: lamb mutton
Red deer: venison
Shark: flake (AUS)
Sheep (list) Also see Ram: lamb mutton hogget
Snail: escargot
Turkey (list): poultry
Water buffalo: carabeef
Whale: blubber

So now there's another problem - there are footnotes and links that makes things a bit messy. We could just clear out all link references in the HTML but that would destroy the animal names which are themselves links so I'll just do some string substitution in pandas instead to erase them.

FOOTNOTES = r"\[\w*\s*\d\]"
FOOTNOTES_2 = r"\s" + FOOTNOTES
EMPTY = ""
SPACE = " "
LIST = r"\(list\)"

for column in table.columns:
    table[column] = table[column].str.replace(FOOTNOTES_2, EMPTY)
    table[column] = table[column].str.replace(FOOTNOTES, SPACE)
    table[column] = table[column].str.replace(LIST, EMPTY)
print()
print(table.loc[32:32])

      Animal  Young Female  Male                       Collective noun  \
32  Cattle    calf    cow   bull  herd  drove  yoke (oxen) team (oxen)   

                                 Collateral adjective Culinary noun for meat  
32  bovine  taurine (male) vaccine (female) vituli...              beef veal  
meats = table.dropna(subset=["Culinary noun for meat"])
meats = meats.rename(columns={"Culinary noun for meat": "Meat"})
for row in meats.itertuples():
    print(f"{row.Animal}: {row.Meat}")
Cattle  : beef veal
Chicken : poultry
Deer: venison humble (organ meat)
DogfishAlso see Shark: Spiny dogfish capeshark (USA) flake (UK, AUS) huss (UK) kahada (CAN)  rigg (UK)rock salmon (UK)
Duck  Also see Mallard: poultry
Elk (wapiti): venison
Gaur: garabeef
Goat : chevon cabrito mutton
Goose: poultry
Guinea fowl: poultry
Guinea pig: cuy
MallardAlso see Duck: poultry
Moose: venison
Peafowl: poultry
Pig  Also see Boar: pork ham bacon
Pigeon : squab
RamAlso see Sheep: lamb mutton
Red deer: venison
Shark: flake (AUS)
Sheep  Also see Ram: lamb mutton hogget
Snail: escargot
Turkey : poultry
Water buffalo: carabeef
Whale: blubber

And now we have another prolbem - there's a weird smashing together of words (e.g. DogfishAlso) which appears to happen because wikipedia inserts <br/> tags to create a sub-row within the table rows and pandas appears to smash the rows together. Time to try again - I'm going to reload the soup and add a replace call to get rid of the breaks.

soup = BeautifulSoup(browser.page_source)
soup_table = soup.find_all("table")[-1]
for character in ascii_uppercase:
    element = soup_table.find(id=character)
    if element is None:
        print(f"Element for {character} not found")
    else:
        element.parent.parent.decompose()

for tag in soup_table.find_all("br"):
    tag.replace_with(SPACE)

soup_table.smooth()
soup_table.encode()
table = pandas.read_html(str(soup_table))[0]
print(TABLE(table[table.Animal.str.startswith("Ram")]))
Animal Young Female Male Collective noun Collateral adjective Culinary noun for meat
Ram Also see Sheep lamb ewe ram flock arietine ovine lamb mutton

So that fixes the smashing together, time to get rid of the footnotes and links to lists.

print(table.loc[32:32])
for column in table.columns:
    table[column] = table[column].str.replace(FOOTNOTES_2, EMPTY)
    table[column] = table[column].str.replace(FOOTNOTES, SPACE)
    table[column] = table[column].str.replace(LIST, EMPTY)
print()
print(table.loc[32:32])
                   Animal     Young   Female  Male  \
32  Cattle[note 3] (list)  calf[31]  cow[32]  bull   

                               Collective noun  \
32  herd[11] drove[11] yoke (oxen) team (oxen)   

                                 Collateral adjective Culinary noun for meat  
32  bovine[note 4] taurine (male) vaccine (female)...              beef veal  

      Animal  Young Female  Male                       Collective noun  \
32  Cattle    calf    cow   bull  herd  drove  yoke (oxen) team (oxen)   

                                 Collateral adjective Culinary noun for meat  
32  bovine  taurine (male) vaccine (female) vituli...              beef veal  

And once again, the meats.

meats = table.dropna(subset=["Culinary noun for meat"])
meats = meats.rename(columns={"Culinary noun for meat": "Meat"})
for row in meats.itertuples():
    print(f"{row.Animal}: {row.Meat}")
Cattle  : beef veal
Chicken : poultry
Deer: venison humble (organ meat)
Dogfish Also see Shark: Spiny dogfish capeshark (USA) flake (UK, AUS) huss (UK) kahada (CAN)  rigg (UK) rock salmon (UK)
Duck  Also see Mallard: poultry
Elk (wapiti): venison
Gaur: garabeef
Goat : chevon cabrito mutton
Goose: poultry
Guinea fowl: poultry
Guinea pig: cuy
Mallard Also see Duck: poultry
Moose: venison
Peafowl: poultry
Pig  Also see Boar: pork ham bacon
Pigeon : squab
Ram Also see Sheep: lamb mutton
Red deer: venison
Shark: flake (AUS)
Sheep  Also see Ram: lamb mutton hogget
Snail: escargot
Turkey : poultry
Water buffalo: carabeef
Whale: blubber

This isn't quite the nice set that I was hoping for, but, oh well. Maybe just one more clean-up to put paretheses around the Also see statements and onward.

PATTERN = "Also see (?P<name>\w+)"
def see_also(match):
    name = match["name"]
    return f"(Also see {name})"

table["Animal"] = table.Animal.str.replace(PATTERN, see_also)

for column in table.columns:
    table[column] = table[column].str.strip()

meats = table.dropna(subset=["Culinary noun for meat"])
meats = meats.rename(columns={"Culinary noun for meat": "Meat"})
for row in meats.itertuples():
    print(f"{row.Animal}: {row.Meat}")
Cattle: beef veal
Chicken: poultry
Deer: venison humble (organ meat)
Dogfish (Also see Shark): Spiny dogfish capeshark (USA) flake (UK, AUS) huss (UK) kahada (CAN)  rigg (UK) rock salmon (UK)
Duck  (Also see Mallard): poultry
Elk (wapiti): venison
Gaur: garabeef
Goat: chevon cabrito mutton
Goose: poultry
Guinea fowl: poultry
Guinea pig: cuy
Mallard (Also see Duck): poultry
Moose: venison
Peafowl: poultry
Pig  (Also see Boar): pork ham bacon
Pigeon: squab
Ram (Also see Sheep): lamb mutton
Red deer: venison
Shark: flake (AUS)
Sheep  (Also see Ram): lamb mutton hogget
Snail: escargot
Turkey: poultry
Water buffalo: carabeef
Whale: blubber

End

And now to save it for later.

load_dotenv()
path = Path(os.environ["WIKIPEDIA"]).expanduser()
if not path.is_dir():
    path.mkdir()

path = Path(os.environ["WIKIPEDIA_ANIMALS"])
table.to_csv(path, index=False)

And for an example use.

animals = pandas.read_csv(
    Path(os.environ["WIKIPEDIA_ANIMALS"]).expanduser())

print(animals.sample(1).iloc[0])
Animal                    Marmoset
Young                       infant
Female                           ?
Male                             ?
Collective noun                  ?
Collateral adjective        simian
Culinary noun for meat         NaN
Name: 155, dtype: object
print(meats.sample(1).iloc[0])
Animal                                           Dogfish (Also see Shark)
Young                                                                   ?
Female                                                                  ?
Male                                                                    ?
Collective noun                                                     troop
Collateral adjective                                   selachian squaloid
Meat                    Spiny dogfish capeshark (USA) flake (UK, AUS) ...
Name: 59, dtype: object

So, there we go. What have we learned?

  • How to pull a page from wikipedia using Selenium
  • How to find and clean a wikipedia table using Beautiful Soup
  • How to convert the Beautiful Soup object to a pandas DataFrame
  • How to further clean up the data frame using string methods and save the table for later as a CSV

Not so exciting, but it took longer than I thought it would. I blame the dog-days of August - the heat, it drives one mental.