Pulling Legendary Creatures From Wikipedia

Beginning

Set Up

Imports

# python
from argparse import Namespace
from functools import partial
import re

# pypi
from bs4 import BeautifulSoup, UnicodeDammit
from selenium import webdriver
from tabulate import tabulate

import pandas

Tabulate

TABLE = partial(tabulate, headers="keys", tablefmt="orgtbl", showindex=False)

Selenium

I'll use a headless instance of selenium-firefox to grab and render the wikipedia page.

options = webdriver.FirefoxOptions()
options.headless = True
browser = webdriver.Firefox(firefox_options=options)

Middle

Grab the Page and Make Some Soup

URL = "https://en.wikipedia.org/wiki/List_of_legendary_creatures_by_type"

browser.get(URL)

This page is a little odd in that it uses tables but the tables then contain unordered lists which seem to give pandas a hard time, so I'll have to poke around a little bit using Beautiful Soup first to figure it out.

soup = BeautifulSoup(browser.page_source)

Looking at the Headings

The level-2 headlines describe different ways to catalog the creatures, in addition they have "span" tags within them that have the "mw-class" associated with them so we can grab those tags using Beautiful Soup's select method, which uses CSS selectors.

headlines = soup.select("h2 span.mw-headline")
for headline in headlines:
    print(f" - {headline}")
  • <span class="mw-headline" id="Animals,_creatures_associated_with">Animals, creatures associated with</span>
  • <span class="mw-headline" id="Artificial_creatures">Artificial creatures</span>
  • <span class="mw-headline" id="Body_parts,_creatures_associated_with">Body parts, creatures associated with</span>
  • <span class="mw-headline" id="Concepts,_creatures_associated_with">Concepts, creatures associated with</span>
  • <span class="mw-headline" id="Demons">Demons</span>
  • <span class="mw-headline" id="Elements,_creatures_associated_with">Elements, creatures associated with</span>
  • <span class="mw-headline" id="Habitats,_creatures_associated_with">Habitats, creatures associated with</span>
  • <span class="mw-headline" id="Humanoids">Humanoids</span>
  • <span class="mw-headline" id="Hybrids">Hybrids</span>
  • <span class="mw-headline" id="Astronomical_objects,_creatures_associated_with">Astronomical objects, creatures associated with</span>
  • <span class="mw-headline" id="World">World</span>
  • <span class="mw-headline" id="Creatures_associated_with_Plants">Creatures associated with Plants</span>
  • <span class="mw-headline" id="Shapeshifters">Shapeshifters</span>
  • <span class="mw-headline" id="Creatures_associated_with_Times">Creatures associated with Times</span>
  • <span class="mw-headline" id="Undead">Undead</span>
  • <span class="mw-headline" id="Miscellaneous">Miscellaneous</span>
  • <span class="mw-headline" id="References">References</span>

We don't need references, but the other headlines might be helpful in categorizing these animals. Unfortunately the headings are above the sections with the stuff we want - they aren't parents of the sections so to get the actual parts we're going to need to do something else.

Tables

I'll grab all the tables then filter out the ones I don't want - the first three tables are admonitions that the page might not be up to snuff and the last four are references and links to other pages.

tables = soup.find_all("table")
tables = tables[3:-4]
print(len(tables))

The Table-Indices

Table = Namespace(
    aquatic=0,
    arthropods=1,
    bears=2,
)

Animals Associated With

  • Aquatic and Marine Animals
    PATTERN = "\w+"
    WEIRD = "\xa0"
    ERASE = ""
    
    table = tables[Table.aquatic]
    items = table.find_all("li")
    DASH = items[0].text[23]
    animals = []
    origin = []
    description = []
    for item in items:
        text = item.text.replace(WEIRD, ERASE)
        text = text.replace(DASH, ERASE)
        tokens = text.split(" (")
        two_tokens = tokens[1].split(")")
        animals.append(tokens[0].strip())
        origin.append(two_tokens[0].strip())
        description.append(two_tokens[1].strip())
    
    beasts = pandas.DataFrame.from_dict(dict(
        animal=animals,
        origin=origin,
        description=description))
    beasts["type"] = "Aquatic"
    print(TABLE(beasts))
    
    animal origin description type
    Bake-kujira Japanese ghost whale Aquatic
    Ceffyl Dŵr Welsh water horse Aquatic
    Encantado Brazil shapeshifting trickster dolphins Aquatic
    Kelpie Scottish water horse Aquatic
    Kushtaka Tlingit shapeshifting "land otter man" Aquatic
    Selkie Scottish shapeshifting seal people Aquatic
  • Arthropods
    table = tables[Table.arthropods]
    animal = []
    origin = []
    description = []
    for item in table.find_all("li"):
        first, last = item.text.split("(")
        second, third = last.split(f"{WEIRD}{DASH} ")
        first = first.strip()
        second = second.replace(")", ERASE).strip()
        third = third.strip()
        animal.append(first)
        origin.append(second)
        description.append(third)
    
    to_append = pandas.DataFrame.from_dict(
        dict(
            animal=animal,
            origin=origin,
            description=description
        )
    )
    to_append["type"] = "Arthropod"
    beasts = pandas.concat([beasts, to_append])
    print(TABLE(to_append))
    
    animal origin description type
    Anansi West African trickster spider Arthropod
    Arachne Greek weaver cursed into a spider Arthropod
    Khepri Ancient Egyptian beetle who pushes the sun Arthropod
    Tsuchigumo Japanese shapeshifting giant spider Arthropod
    Myrmecoleon Christian ant-lion Arthropod
    Myrmidons Greek warriors created from ants by Zeus Arthropod
    Jorōgumo Japanese ghost woman who shapeshifts into a spider Arthropod
    Karkinos Greek Cancer the crab Arthropod
    Mothman American cryptid man with moth wings and features Arthropod
    Pabilsag Babylonian Sagittarius-like creature with scorpion tail Arthropod
    Scorpion man Babylonian protector of travellers Arthropod
    Selket Ancient Egyptian scorpion death/healing goddess Arthropod
  • Bears
    table = tables[Table.bears]
    animal, origin, description = [], [], []
    for item in table.find_all("li"):
        first, right = item.text.split("(")
        second, third = right.split(f"){WEIRD}{DASH} ")
        animal.append(first)
        origin.append(second)
        description.append(third)
    to_append = pandas.DataFrame.from_dict(
        dict(animal=animal,
             origin=origin,
             description=description)
    )
    beasts = pandas.concat([beasts, to_append], ignore_index=True)
    print(TABLE(to_append))
    
    animal origin description
    Bugbear Celtic child-eating hobgoblin
    Callisto Greek A nymph who was turned into a bear by Hera.