Wikipedia List of Countries

The Cloistered Monkey

2020-09-18 20:27

Beginning

Set Up

Imports

# python
from argparse import Namespace
from functools import partial
from pathlib import Path

# pypi
from bokeh.models import HoverTool
from selenium import webdriver
from tabulate import tabulate

import hvplot.pandas
import pandas

# others
from graeae import EmbedHoloviews

Plotting

SLUG = "wikipedia-list-of-countries"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/{SLUG}")
Plot = Namespace(
    width=990,
    height=780,
    fontscale=2,
)

Set Up Tabulate

This is just to make printing the pandas data-frames a little easier. emacs-jupyter actually does a pretty good job of rendering them as HTML like a regular jupyter notebook, but this makes it a little more readable while in emacs, plus I can control things better.

TABLE = partial(tabulate, showindex=False, tablefmt="orgtbl", headers="keys")

Set Up Selenium

options = webdriver.FirefoxOptions()
options.headless = True
browser = webdriver.Firefox(firefox_options=options)

Middle

Grabbing Some HTML

Build the Table

The tables are created using javascript so we need to use Selenium (or something else that handles javascript) to get the page for us.

URL = "https://www.wikiwand.com/en/List_of_countries_and_dependencies_by_population_density"
browser.get(URL)

Now we can let pandas convert the table to a data-frame. In this case there's multiple tables on the page but I only want the first one.

table = pandas.read_html(browser.page_source)[0]

print(TABLE(table.head(1)))

| ('Rank', 'Rank')   | ('Country (or dependent territory)', 'Country (or dependent territory)')   |   ('Area', 'km2') |   ('Area', 'mi2') |   ('Population', 'Population') |   ('Density', 'pop./km2') |   ('Density', 'pop./mi2') | ('Date', 'Date')   | ('Population source', 'Population source')   |
|--------------------+----------------------------------------------------------------------------+-------------------+-------------------+--------------------------------+---------------------------+---------------------------+--------------------+----------------------------------------------|
| –                  | Macau (China)                                                              |             115.3 |                45 |                         696100 |                     21158 |                     54799 | September 30, 2019 | Official quarterly estimate                  |

That kind of worked but the column names are odd, and I don't want all of them anyway so I'll fix that.

table.columns = [column[0] for column in table.columns]
print(TABLE(table.head(1)))

Rank	Country (or dependent territory)	Area	Area	Population	Density	Density	Date	Population source
–	Macau (China)	115.3	45	696100	21158	54799	September 30, 2019	Official quarterly estimate

Now, I messed up the units for area and density, but I'm not good with the metric system and Density is a calculated column anyway, so maybe I should get rid of some columns to make it easier.

Columns = Namespace(
    rank="Rank",
    date="Date",
    source="Population source",
    area="Area (square miles)",
    area_cleaned = "Area",
    density="Density (population/square miles)",
    density_cleaned = "Density",
    country="Country (or dependent territory)",
    country_cleaned = "Country",
    population="Population",
    )

columns = list(table.columns)

columns[2] = "remove"
columns[3] = Columns.area
columns[5] = "remove2"
columns[6] = Columns.density
table.columns = columns

del(table["remove"])
del(table["remove2"])
print(TABLE(table.head(1)))

| Rank   | Country (or dependent territory)   |   Area (square miles) |   Population |   Density (population/square miles) | Date               | Population source           |
|--------+------------------------------------+-----------------------+--------------+-------------------------------------+--------------------+-----------------------------|
| –      | Macau (China)                      |                    45 |       696100 |                               54799 | September 30, 2019 | Official quarterly estimate |

The Last Two Rows

I guess because the table is so long Wikipedia repeats the header at the bottom of the table.

print(TABLE(table.iloc[-2:]))

Rank	Country (or dependent territory)	Area (square miles)	Population	Density (population/square miles)	Date	Population source
Rank	Country (or dependent territory)	mi2	Population	pop./mi2	Date	Population source
Rank	Country (or dependent territory)	Area	Population	Density	Date	Population source

So we have to remove them and also convert the numeric columns to numeric types since pandas had to render them all as objects (strings).

table = table.iloc[:-2]
print(TABLE(table.iloc[-2:]))

Rank	Country (or dependent territory)	Area (square miles)	Population	Density (population/square miles)	Date	Population source
–	Svalbard and Jan Mayen (Norway)	23706	2655	0.1	September 1, 2012	Official estimate
–	Greenland (Denmark)	836297	55877	0.08	January 1, 2018	Official estimate

I'm guessing that there's no fractional populations, but if you look at the areas some of them are non-integers.

print(table[Columns.area].min())

0.19

So only the population is an integer (I'm assuming).

for column in [Columns.area, Columns.density]:
    table[column] = table[column].astype(float)
table[Columns.population] = table[Columns.population].astype(int)

Plotting

This is just to take a look.

By Area

First I'll sort the table by area using sort_values.

plotter = table.sort_values(by=Columns.area)
plotter = plotter.rename(columns={
    Columns.area: Columns.area_cleaned,
    Columns.density: Columns.density_cleaned,
    Columns.country: Columns.country_cleaned,
})

country = f"@{Columns.country_cleaned}"
population = f"@{Columns.population}{{0,0}}"
density = f"@{{{Columns.density_cleaned}}}"
area =  f"@{Columns.area_cleaned}{{0,0.00}} (Square Miles)"
hover = HoverTool(tooltips=[
    ("Country", country),
    ("Area", area),
    ("Population", population),
    ("Density", density)
])

plot = plotter.hvplot.barh(x=Columns.country_cleaned,
                           y=Columns.area_cleaned,
                           hover_cols=[
                               Columns.population,
                               Columns.density_cleaned
                           ]).opts(
                               title="Countries by Area",
                               width=Plot.width,
                               height=Plot.height * 2,
                               tools=[hover],
                           )

outcome = Embed(plot=plot, file_name="countries_by_area")()

print(outcome)

Well, that's not so easy to read. Maybe just the top and bottom ten.

subset = plotter.iloc[-10:]
plot = subset.hvplot.barh(x=Columns.country_cleaned, y=Columns.area_cleaned,
                          hover_cols=[Columns.population,
                                      Columns.density_cleaned]).opts(
    title="Top Ten Countries by Area",
    width=Plot.width,
    height=Plot.height,
    fontscale=Plot.fontscale,
    xrotation=45,
    tools=[hover],
)

outcome = Embed(plot=plot, file_name="top_ten_countries_by_area")()

print(outcome)

So, this shows us a few things to fix. One, the first row is likely just the sum of the others, I would think… But even if it isn't it's so big that it kind of throws stuff off. Also we could probably suck out Antarctica from the second row to make the table match the one Wikipedia has of just countries by area.

The other thing is that there are reference to notes that I don't know that I want.

numeric = [Columns.area_cleaned, Columns.population, Columns.density_cleaned]
columns={
    Columns.area: Columns.area_cleaned,
    Columns.density: Columns.density_cleaned,
    Columns.country: Columns.country_cleaned,
}

world = plotter.iloc[-1][numeric]
without_antarctica = plotter.iloc[-2][numeric]
antarctica = world - without_antarctica
antarctica[Columns.rank] = "-"
antarctica[Columns.country] = "Antarctica"
antarctica[Columns.date] = plotter[Columns.date].iloc[-1]
antarctica[Columns.source] = plotter[Columns.source].iloc[-1]
antarctica = antarctica.rename(columns)
antarctica = antarctica[plotter.columns]
cleaned = plotter.iloc[:-2]
cleaned = cleaned.append(antarctica, ignore_index=True)
cleaned = cleaned.rename(columns=columns)
print(antarctica)

Rank                                             -
Country                                 Antarctica
Area                                      5.41e+06
Population                                       0
Density                                        -14
Date                            September 19, 2020
Population source    USCB's World population clock
dtype: object

print(f"Area Difference: {world[Columns.area_cleaned] - cleaned[Columns.area_cleaned].sum():,.2f}")
print(f"Population Difference: {world[Columns.population] - cleaned[Columns.population].sum():,}")

Area Difference: -554,000.57
Population Difference: 168,435,684

So, it doesn't actually match the way I thought. I guess there are parts of the world that aren't part of a country… That's a huge difference in population, though… I'll have to investigate that separately.

Just the Population Table

For some reason pandas is acting up here so I had to separate the building of the dataframe and re-running the code blocks a few times. It even said that pandas couldn't find html5lib, which is kind of weird since the earlier read_html call worked…

URL = "https://www.wikiwand.com/en/List_of_countries_and_dependencies_by_population"
browser.get(URL)

Now that I think about it, maybe it takes some time for the javascript to render so you need to have a pause, especially if the internet connection is acting up.

tables = pandas.read_html(browser.page_source)

by_population = tables[1]
print(TABLE(by_population.head(1)))

Rank	Country(or dependent territory)	Population	% of world	Date	Source(official or UN)
1	China[b]	1404513080	nan	20 Sep 2020	National population clock[3]

Pandas doesn't seem to be able to handle the "% of world" column - probably because they put the percent symbol in the rows for some reason.

population_2 = by_population[Columns.population].sum()
print(f"{world[Columns.population] - population_2:,}")
print(f"{cleaned[Columns.population].sum() - population_2:,}")

-7,693,448,437
-7,861,884,121

What?

print(f"{population_2:,}")

15,507,529,605

According to the Wikipedia page the world population is 7,814,267,000 - so something isn't right here.

print(len(by_population))
print(len(cleaned))

242
252

They also don't have the same number of countries (it should be off by 1 because of Antarctica, not by 10).

countries = set(cleaned[Columns.country_cleaned])

# the new table doesn't have a space between Country and the parenthesis
countries_2 = set(by_population["Country(or dependent territory)"])
print(countries - countries_2)
print()
print(countries_2 - countries)

{'Northern Cyprus[note 4]', 'Western Sahara[note 12]', 'Taiwan', 'British Virgin Islands (United Kingdom)', 'Morocco', 'Bonaire (Netherlands)', 'Jersey (United Kingdom)', 'Cyprus', 'Anguilla (United Kingdom)', 'Isle of Man (United Kingdom)', 'Russia[note 11]', 'Wallis & Futuna (France)', 'Saba (Netherlands)', 'United States', 'Transnistria[note 3]', 'Vatican City[note 1]', 'Somaliland[note 8]', 'United States Virgin Islands (United States)', 'France', 'Turks and Caicos Islands (United Kingdom)', 'Cook Islands (New Zealand)', 'Democratic Republic of the Congo', 'Mayotte (France)', 'Tanzania', 'China', 'Guam (United States)', 'Falkland Islands (United Kingdom)', 'Guernsey (United Kingdom)', 'Northern Mariana Islands (United States)', 'Netherlands', 'Serbia', 'Republic of the Congo', 'Pitcairn Islands (United Kingdom)', 'Timor-Leste', 'Kosovo[note 2]', 'Eswatini (Swaziland)', 'Niue (New Zealand)', 'Denmark', 'Gibraltar (United Kingdom)', 'Saint Helena, Ascension and Tristan da Cunha (United Kingdom)', 'Finland', 'Pakistan', 'American Samoa (United States)', 'Sint Eustatius (Netherlands)', 'Bermuda (United Kingdom)', 'India', 'Abkhazia[note 6]', 'United Kingdom', 'Cayman Islands (United Kingdom)', 'Puerto Rico (United States)', 'Moldova', 'French Guiana (France)', 'South Ossetia[note 9]', 'Guadeloupe (France)', 'Ukraine [note 5]', 'Georgia', 'Svalbard and Jan Mayen (Norway)', 'Somalia', 'Tokelau (New Zealand)', 'Réunion (France)', 'Uruguay[note 7][clarification needed]', 'Martinique (France)', 'Federated States of Micronesia', 'Montserrat (United Kingdom)', 'Artsakh[note 10]', 'Antarctica'}

{'Pakistan[e]', 'India[c]', 'DR Congo', 'Falkland Islands (UK)', 'United Kingdom[h]', 'Wallis and Futuna (France)', 'Puerto Rico (US)', 'East Timor', 'Republic of Artsakh[z]', 'China[b]', 'Taiwan[l]', 'South Ossetia[aa]', 'Tokelau (NZ)', 'World', 'Ukraine[j]', 'Netherlands[m]', 'Saint Helena, Ascensionand Tristan da Cunha (UK)', 'American Samoa (US)', 'Tanzania[i]', 'Finland[q]', 'Vatican City[ab]', 'U.S. Virgin Islands (US)', 'Morocco[k]', 'Transnistria[w]', 'Somalia[n]', 'Abkhazia[y]', 'Cayman Islands (UK)', 'Gibraltar (UK)', 'Serbia[o]', 'Uruguay', 'Cook Islands (NZ)', 'Northern Cyprus[x]', 'Eswatini', 'Guernsey (UK)', 'Congo', 'Russia[f]', 'Bermuda (UK)', 'Pitcairn Islands (UK)', 'Guam (US)', 'Montserrat (UK)', 'Moldova[s]', 'Kosovo[t]', 'Denmark[p]', 'Jersey (UK)', 'France[g]', 'Isle of Man (UK)', 'Cyprus[u]', 'Anguilla (UK)', 'Georgia[r]', 'Western Sahara[v]', 'Northern Mariana Islands (US)', 'F.S. Micronesia', 'United States[d]', 'Turks and Caicos Islands (UK)', 'Niue (NZ)', 'British Virgin Islands (UK)'}

Some of that is just their notes being added, and an inconsistency of naming, e.g. United Kingdom vs UK. but the population only set also seems to break up things a little differently - they consider the countries within the Netherlands to be separate, for instance. This is messier than I thought, and I don't know how to reconcile the population explosion.

Okay, I just noticed something.

print(TABLE(by_population.tail(1)))

Rank	Country(or dependent territory)	Population	% of world	Date	Source(official or UN)
nan	World	7814267000	100%	19 Sep 2020	UN Projection[202]

They copied the header here too, but pandas was able to parse the numbers (probably because the headers only have one row).

world_2 = by_population.iloc[-1]
by_population = by_population[:-1]

population_2 = by_population[Columns.population].sum()
print(f"{world[Columns.population] - population_2:,}")
print(f"{cleaned[Columns.population].sum() - population_2:,}")
print(f"{population_2:,}")

121,019,395
-47,416,289
7,693,262,605

The second table is short too, but it still has more than the original table… Maybe I just don't care about the numbers, as long as they're rankable. Although the fact that the tables have different numbers of countries isn't reassuring…

I think I'll see if making a combined table myself would be better. But first let's see what percentage of the population is missing.

print(f"By Population Table: {(by_population.Population/world_2.Population).sum() * 100:0.2f} %")
print(f"By Density Table: {(cleaned.Population/world_2.Population).sum() * 100:0.2f} %")

By Population Table: 98.45 %
By Density Table: 97.84 %

Countries by Area

URL = "https://www.wikiwand.com/en/List_of_countries_and_dependencies_by_area"
browser.get(URL)

tables = pandas.read_html(browser.page_source)

by_area = tables[0]
print(TABLE(by_area.head(1)))

Rank	Country / Dependency	Total in km2 (mi2)	Land in km2 (mi2)	Water in km2 (mi2)	% water	Notes
nan	World	510,072,000 (196,940,000)	148,940,000 (57,510,000)	361,132,000 (139,434,000)	70.8	nan

So once again we have a summation in there, but we have a bigger problem in that they combine the units in the same columns…

print(len(by_area))

Okay, so now we have another problem in that all three sets have differing numbers of entries. I'm guessing that trying to reconcile the country names is going to be a nightmare so I think I'll have to stick with the original table. I just looked at the Wikipedia page and this table includes "unrecognized states" which is probably why it has more countries. Maybe it's these unrecognized states that makes up the shortcoming in the population and area. Maybe.

Plotting Again

By Area

Biggest

cleaned = cleaned.sort_values(by=Columns.area_cleaned)
subset = cleaned.iloc[-10:]
plot = subset.hvplot.barh(x=Columns.country_cleaned, y=Columns.area_cleaned,
                          hover_cols=[Columns.population,
                                      Columns.density_cleaned]).opts(
                                          title="Top Ten Countries by Area",
                                          width=Plot.width,
                                          height=Plot.height,
                                          fontscale=Plot.fontscale,
                                          tools=[hover],
                                          xrotation=45,
)

outcome = Embed(plot=plot, file_name="top_ten_countries_by_area_cleaned")()

print(outcome)

Okay, so it looks like I forgot to clean out the note references in the country names. Also, Russia as bigger than Antarctica? Go figure. I'm also surprised that the United States has more area than China. Actually I was surprised by this whole thing. I guess I never really looked too closely at the relative sizes of countries before, and area's a tough thing to estimate visually.

Getting Rid of the Notes

names = cleaned[Columns.country_cleaned].str.split("[", expand=True)[0]
cleaned[Columns.country_cleaned] = names

Smallest

subset = cleaned.sort_values(by=Columns.area_cleaned, ascending=False).iloc[-10:]
plot = subset.hvplot.barh(x=Columns.country_cleaned, y=Columns.area_cleaned).opts(
    title="Bottom Ten Countries by Area",
    width=Plot.width,
    height=Plot.height,
    fontscale=Plot.fontscale,
)

outcome = Embed(plot=plot, file_name="bottom_ten_countries_by_area_cleaned")()

print(outcome)

By Population

Most Populated

subset = cleaned.sort_values(by=Columns.population).iloc[-10:]
population_columns = [Columns.area_cleaned,
                      Columns.density_cleaned]
plot = subset.hvplot.barh(x=Columns.country_cleaned, y=Columns.population,
                          hover_cols=population_columns).opts(
                                          title="Top Ten Countries by Population",
                                          width=Plot.width,
                                          height=Plot.height,
                                          fontscale=Plot.fontscale,
                                          tools=[hover],
                                          xrotation=45,
)

outcome = Embed(plot=plot, file_name="top_ten_countries_by_population")()

print(outcome)

I had encountered this list before, when reading Memory Craft by Lynne Kelly, but it still surprises me that the United States is third and that Bangladesh has more people than Russia.

Least Populous

subset = cleaned.sort_values(by=Columns.population, ascending=False).iloc[-10:]
plot = subset.hvplot.barh(x=Columns.country_cleaned, y=Columns.population,
                          hover_cols=population_columns).opts(
                              title="Bottom Ten Countries by Population",
                              width=Plot.width,
                              height=Plot.height,
                              fontscale=Plot.fontscale,
                              tools=[hover],
)

outcome = Embed(plot=plot, file_name="bottom_ten_countries_by_population")()

print(outcome)

By Density

Densest

density_columns=[Columns.population, Columns.area_cleaned]
subset = cleaned.sort_values(by=Columns.density_cleaned).iloc[-10:]
plot = subset.hvplot.barh(x=Columns.country_cleaned, y=Columns.density_cleaned,
                          hover_cols=density_columns).opts(
                              title="Top Ten Countries by Population Density",
                              width=Plot.width,
                              height=Plot.height,
                              fontscale=Plot.fontscale,
                              tools=[hover],
)

outcome = Embed(plot=plot, file_name="top_ten_countries_by_population_density")()

print(outcome)

It's sort of interesting that the two most dense countries are both known for their gambling.

Sparsest

First, because of the way I calculated Antarctica it currently has a negative number (-14) for density, which is a little odd so I'll set it to 0.

cleaned.loc[cleaned[Columns.country_cleaned]=="Antarctica", Columns.density_cleaned] = 0

subset = cleaned.sort_values(by=Columns.density_cleaned, ascending=False).iloc[-10:]
plot = subset.hvplot.barh(x=Columns.country_cleaned, y=Columns.density_cleaned,
                          hover_cols=density_columns).opts(
                              title="Bottom Ten Countries by Population Density",
                              width=Plot.width,
                              height=Plot.height,
                              fontscale=1.5,
                              tools=[hover],
                              yrotation=45,
)

outcome = Embed(plot=plot, file_name="bottom_ten_countries_by_population_density")()

print(outcome)

End

Well, the point wasn't really just to look at the list (there is a Wikipedia page there already, after all) but to have something that I could mess with locally, so I'll save it now, but first I'll do a little more cleaning.

Do I need the rank? Nah.

del(cleaned[Columns.rank])

I was going to remove the density and re-calculate it but for some reason they don't differ that much, but it might make a difference in the rankings for less dense countries so I'll keep the original column.

Since I'm going to save it to a csv I'll parse the date first.

cleaned[Columns.date] = pandas.to_datetime(cleaned[Columns.date])

Even though the table is meant for population density I think I'll save it sorted by area.

cleaned = cleaned.sort_values(by=Columns.area_cleaned, ascending=False)

And now to save it.

path = Path("apeiron/wikipedia")
assert path.is_dir()
cleaned.to_csv(path/"countries-by-area-population-density.csv", index=False)

Table of Contents

Beginning

Set Up

Imports

Plotting

Set Up Tabulate

Set Up Selenium

Middle

Grabbing Some HTML

Build the Table

The Last Two Rows

Plotting

By Area

Just the Population Table

Countries by Area

Plotting Again

By Area

Biggest

Getting Rid of the Notes

Smallest

By Population

Most Populated

Least Populous

By Density

Densest

Sparsest

End