Selenium To Soup To Pandas

Pandas has a read_html function that will automatically convert HTML tables that it finds into DataFrames, but if you need to do a little cleaning up of the tables first or are doing some exploration of the HTML it can be useful to work with it in Beautiful Soup first, and if the table is being rendered on the page with Javascript it can be useful to use Selenium to grab the page for you so that it's rendered.

Start selenium (headless) and grab the page.

from selenium import webdriver

options = webdriver.FirefoxOptions()
options.headless = True
browser = webdriver.Firefox(firefox_options=options)


Give it to Beautiful Soup.

from bs4 import BeautifulSoup

soup = BeautifulSoup(browser.page_source)

Do stuff with the soup (not shown).

Optionally grab the table you want - in this case I want the last one.

soup_table = soup.find_all("table")[-1]

But if there's one or you just want the first one you could use find instead.

soup_table = soup.find("table")

Pass it to pandas.

tables = pandas.read_html(str(soup_table))

tables is a list of DataFrames - one for each table that pandas found - even if there's only one so now it might be useful to get the one you want.

table = tables[0]

From what I understand from the documentation, pandas is using Beautiful Soup to parse the HTML, so if the tables come out okay and you don't need to mess around with the HTML tree beforehand you can just skip the soup.

import pandas

table = pandas.read_html(browser.page_source)[0]