Pulling Legendary Creatures From Wikipedia

The Cloistered Monkey

2020-08-06 15:04

Beginning

Set Up

Imports

# python
from argparse import Namespace
from functools import partial
import re

# pypi
from bs4 import BeautifulSoup, UnicodeDammit
from selenium import webdriver
from tabulate import tabulate

import pandas

Tabulate

TABLE = partial(tabulate, headers="keys", tablefmt="orgtbl", showindex=False)

Selenium

I'll use a headless instance of selenium-firefox to grab and render the wikipedia page.

options = webdriver.FirefoxOptions()
options.headless = True
browser = webdriver.Firefox(firefox_options=options)

Middle

Grab the Page and Make Some Soup

URL = "https://en.wikipedia.org/wiki/List_of_legendary_creatures_by_type"

browser.get(URL)

This page is a little odd in that it uses tables but the tables then contain unordered lists which seem to give pandas a hard time, so I'll have to poke around a little bit using Beautiful Soup first to figure it out.

soup = BeautifulSoup(browser.page_source)

Looking at the Headings

The level-2 headlines describe different ways to catalog the creatures, in addition they have "span" tags within them that have the "mw-class" associated with them so we can grab those tags using Beautiful Soup's select method, which uses CSS selectors.

headlines = soup.select("h2 span.mw-headline")
for headline in headlines:
    print(f" - {headline}")

Animals, creatures associated with
Artificial creatures
Body parts, creatures associated with
Concepts, creatures associated with
Demons
Elements, creatures associated with
Habitats, creatures associated with
Humanoids
Hybrids
Astronomical objects, creatures associated with
World
Creatures associated with Plants
Shapeshifters
Creatures associated with Times
Undead
Miscellaneous
References

We don't need references, but the other headlines might be helpful in categorizing these animals. Unfortunately the headings are above the sections with the stuff we want - they aren't parents of the sections so to get the actual parts we're going to need to do something else.

Tables

I'll grab all the tables then filter out the ones I don't want - the first three tables are admonitions that the page might not be up to snuff and the last four are references and links to other pages.

tables = soup.find_all("table")
tables = tables[3:-4]
print(len(tables))

The Table-Indices

Table = Namespace(
    aquatic=0,
    arthropods=1,
    bears=2,
)

Animals Associated With

Aquatic and Marine Animals

PATTERN = "\w+"
WEIRD = "\xa0"
ERASE = ""

table = tables[Table.aquatic]
items = table.find_all("li")
DASH = items[0].text[23]
animals = []
origin = []
description = []
for item in items:
    text = item.text.replace(WEIRD, ERASE)
    text = text.replace(DASH, ERASE)
    tokens = text.split(" (")
    two_tokens = tokens[1].split(")")
    animals.append(tokens[0].strip())
    origin.append(two_tokens[0].strip())
    description.append(two_tokens[1].strip())

beasts = pandas.DataFrame.from_dict(dict(
    animal=animals,
    origin=origin,
    description=description))
beasts["type"] = "Aquatic"
print(TABLE(beasts))

animal	origin	description	type
Bake-kujira	Japanese	ghost whale	Aquatic
Ceffyl Dŵr	Welsh	water horse	Aquatic
Encantado	Brazil	shapeshifting trickster dolphins	Aquatic
Kelpie	Scottish	water horse	Aquatic
Kushtaka	Tlingit	shapeshifting "land otter man"	Aquatic
Selkie	Scottish	shapeshifting seal people	Aquatic

Arthropods

table = tables[Table.arthropods]
animal = []
origin = []
description = []
for item in table.find_all("li"):
    first, last = item.text.split("(")
    second, third = last.split(f"{WEIRD}{DASH} ")
    first = first.strip()
    second = second.replace(")", ERASE).strip()
    third = third.strip()
    animal.append(first)
    origin.append(second)
    description.append(third)

to_append = pandas.DataFrame.from_dict(
    dict(
        animal=animal,
        origin=origin,
        description=description
    )
)
to_append["type"] = "Arthropod"
beasts = pandas.concat([beasts, to_append])
print(TABLE(to_append))

animal	origin	description	type
Anansi	West African	trickster spider	Arthropod
Arachne	Greek	weaver cursed into a spider	Arthropod
Khepri	Ancient Egyptian	beetle who pushes the sun	Arthropod
Tsuchigumo	Japanese	shapeshifting giant spider	Arthropod
Myrmecoleon	Christian	ant-lion	Arthropod
Myrmidons	Greek	warriors created from ants by Zeus	Arthropod
Jorōgumo	Japanese	ghost woman who shapeshifts into a spider	Arthropod
Karkinos	Greek	Cancer the crab	Arthropod
Mothman	American cryptid	man with moth wings and features	Arthropod
Pabilsag	Babylonian	Sagittarius-like creature with scorpion tail	Arthropod
Scorpion man	Babylonian	protector of travellers	Arthropod
Selket	Ancient Egyptian	scorpion death/healing goddess	Arthropod

Bears

table = tables[Table.bears]
animal, origin, description = [], [], []
for item in table.find_all("li"):
    first, right = item.text.split("(")
    second, third = right.split(f"){WEIRD}{DASH} ")
    animal.append(first)
    origin.append(second)
    description.append(third)
to_append = pandas.DataFrame.from_dict(
    dict(animal=animal,
         origin=origin,
         description=description)
)
beasts = pandas.concat([beasts, to_append], ignore_index=True)
print(TABLE(to_append))

animal	origin	description
Bugbear	Celtic	child-eating hobgoblin
Callisto	Greek	A nymph who was turned into a bear by Hera.

Table of Contents