International Olympiad in Informatics

Introduction

This is a demonstration of converting an HTML table into some (possibly) meaningful values. I'm going to pull the data from the Wikipedia page for the International Olympiad in Informatics. This is their description for it:

The International Olympiad in Informatics (IOI) is an annual competitive programming competition for secondary school students. It is the second largest olympiad, after International Mathematical Olympiad, in terms of number of participating countries (IOI 2014 saw participation of 84 countries). The first IOI was held in 1989 in Pravetz, Bulgaria.

The contest consists of two days of computer programming and problem-solving of algorithmic nature. To deal with problems involving very large amounts of data, it is necessary to have not only programmers, "but also creative coders, who can dream up what it is that the programmers need to tell the computer to do. The hard part isn't the programming, but the mathematics underneath it."[1] Students at the IOI compete on an individual basis, with up to four students competing from each participating country (with 81 countries in 2012). Students in the national teams are selected through national computing contests, such as the Australian Informatics Olympiad, British Informatics Olympiad, Indian Computing Olympiad or Bundeswettbewerb Informatik (Germany).

The International Olympiad in Informatics is one of the most prestigious computer science competitions in the world. UNESCO and IFIP are patrons.

The original assignment did some comparisons between the Summer and Winter Olympics, but since that doesn't apply here I'm going to compare it to the ACM International Collegiate Programming Contest.

Imports

# python standard library
from http import HTTPStatus
# from pypi
from requests_html import HTMLSession
from tabulate import tabulate
import pandas

Constants

class ACMPage:
    url = "https://en.wikipedia.org/wiki/ACM_International_Collegiate_Programming_Contest"
    by_year = "table.wikitable"
    by_year_index = 2
class Page:
    url = "https://en.wikipedia.org/w/index.php?title=International_Olympiad_in_Informatics&oldid=836053213"
    medals_table = 'table.wikitable.sortable'
    repeat_winners = "table.wikitable[style='margin:auto']"

Grabbing the Data

The IOI page Page

session = HTMLSession()
ioi_response = session.get(Page.url)
print(ioi_response.status_code)
assert ioi_response.status_code == HTTPStatus.OK
print(ioi_response.html.find("h1", first=True).text)

The Medals table

table = ioi_response.html.find(Page.medals_table, first=True)
medals_table = pandas.read_html(table.html, header=0,)[0]
print(tabulate(medals_table, headers='keys', showindex=False, tablefmt='orgtbl'))

We don't want the totals row so I'll delete it, along with the Total column and the Rank.

medals_table = medals_table[medals_table.Rank != "Total"]
del medals_table["Total"]
del medals_table["Rank"]
print(tabulate(medals_table, headers='keys', showindex=False, tablefmt='orgtbl'))

I guess I don't need the row totals either.

print(medals_table.describe())

Gold Silver Bronze Total count 10.000000 10.00000 10.000000 10.000000 mean 37.500000 36.60000 21.500000 95.600000 std 17.902514 8.97156 9.046178 17.715028 min 21.000000 20.00000 8.000000 49.000000 25% 24.250000 34.50000 12.750000 95.250000 50% 33.000000 37.00000 24.000000 99.500000 75% 44.000000 40.25000 28.750000 103.750000 max 77.000000 52.00000 34.000000 115.000000

The Repeat Winners Table

This table lists competitors that have won more than once.

repeat_table = ioi_response.html.find(Page.repeat_winners, first=True)
repeat_table = pandas.read_html(repeat_table.html, header=0)[0]
print(tabulate(repeat_table.head(), headers='keys', showindex=False, tablefmt='orgtbl'))

The ACM Page

acm_response = session.get(ACMPage.url)
assert acm_response.status_code == HTTPStatus.OK
print(acm_response.html.find("h1", first=True).text)

The Winner by Year

The tables on this page don't really have enough information to select them directly so you have to know which one of them to grab (e.g. the first, second, etc.).

acm_by_year = acm_response.html.find(ACMPage.by_year)[ACMPage.by_year_index]
acm_by_year = pandas.read_html(acm_by_year.html, header=0)[0]
print(tabulate(acm_by_year.head(), headers='keys', showindex=False,
	       tablefmt='orgtbl'))

Questions

Source

Wikipedia contributors. (2018, April 12). International Olympiad in Informatics. In Wikipedia, The Free Encyclopedia. Retrieved 19:21, April 19, 2018, from https://en.wikipedia.org/w/index.php?title=International_Olympiad_in_Informatics&oldid=836053213