Introduction To Altair: Countries Per Year

Setup

Imports

These initial imports are supports to make creating this post easier and aren't necessarily needed for the altair plots.

# python
from functools import partial
from pathlib import Path
from pprint import pprint

import json
import os
import re

# pypi
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from expects import be, be_true, equal, expect
from tabulate import tabulate

# monkey
from graeae.visualization.altair_helpers import output_path, save_chart

These are the ones that are really needed for the plotting. I installed both of them through pypi.

import altair
import pandas

Some Setting Up

These are some convenience objects to save a little bit of coding when saving the chart.

SLUG = "introduction-to-altair-countries-per-year"
OUTPUT_PATH = output_path(SLUG)

HEIGHT, WIDTH = 600, 800
SAVE_IT = partial(save_chart, output_path=OUTPUT_PATH, height=HEIGHT + 100)

SOUPER = partial(BeautifulSoup, features="lxml")

This is to make printing out a pandas dataframe as a table a little nicer.

TABLE = partial(tabulate,
                headers="keys",
                tablefmt="orgtbl",
                showindex=False)

The Data

load_dotenv(override=True)
table_path = Path(os.getenv("WORLD_HAPPINESS_TABLE"))

expect(table_path.is_file()).to(be_true)

table = pandas.read_csv(table_path)
print(table.shape)

(2199, 11)

The Data Columns

def column_printer(table, headers=("Column", "Type")):
    print(TABLE(
        ((column, str(table[column].dtype))
         for column in table.columns),
        headers=headers))
    return

column_printer(table)

Column	Type
Country name	object
year	int64
Life Ladder	float64
Log GDP per capita	float64
Social support	float64
Healthy life expectancy at birth	float64
Freedom to make life choices	float64
Generosity	float64
Perceptions of corruption	float64
Positive affect	float64
Negative affect	float64

For this initial post I'll only use the year, but

class Column:
    year = "year"

Counting the Years

Using Pandas' `value_counts` Method

year_counts = table.year.value_counts().reset_index().sort_values("year")

table_counts = year_counts.T
table_counts.columns = table_counts.iloc[0]
table_counts = table_counts.drop(table_counts.index[0])
print(TABLE(table_counts, showindex=True))

	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022
count	27	89	102	110	114	124	146	141	136	144	142	141	147	141	143	116	122	114

Now as a bar-chart.

value_counts_chart = altair.Chart(year_counts).mark_bar().encode(
    x="{}:N".format(Column.year),
    y="count").properties(height=HEIGHT, width=WIDTH)

VALUE_COUNTS_NAME = "value-counts-bar-chart"
VALUE_COUNTS_HTML = VALUE_COUNTS_NAME + ".html"
SAVE_IT(value_counts_chart, VALUE_COUNTS_NAME)

Using Altair's "count"

altair_counts_chart = altair.Chart(table).mark_bar().encode(
    x="{}:N".format(Column.year),
    y="count()").properties(height=HEIGHT, width=WIDTH)

ALTAIR_COUNTS_NAME = "altair-counts-bar-chart"
ALTAIR_COUNTS_HTML = ALTAIR_COUNTS_NAME + ".html"
SAVE_IT(altair_counts_chart, ALTAIR_COUNTS_NAME)

Comparing the File Sizes

The Files In Bytes

altair_counts_html = OUTPUT_PATH/(ALTAIR_COUNTS_HTML)
pandas_counts_html = OUTPUT_PATH/(VALUE_COUNTS_HTML)
print("Altair counts(): {:,} bytes".format(altair_counts_html.stat().st_size))
print("Pandas value_counts: {:,} bytes".format(pandas_counts_html.stat().st_size))

Altair counts(): 685,111 bytes
Pandas value_counts: 2,067 bytes

Here's one of the problems with altair - it passes along the entire dataset and then tells vega to work with it in the browser. So, in this case it's passing all our happiness data, even though the chart doesn't use any of the columns.

with altair_counts_html.open() as reader:
    altair_soup = SOUPER(reader)

with pandas_counts_html.open() as reader:
    pandas_soup = SOUPER(reader)

def data_printer(soup: BeautifulSoup, index:int=0) -> None:
    """Gets the data from the soup and prints the entry

    Params:

     - soup: BeautifulSoup with the HTML for the chart
     - index: which data row to show
    """
    EVERYTHING = ".*"
    EXTRA_BRACE = "(?=})"

    DATASETS_EXPRESSION = "datasets" + EVERYTHING + "}}"
    DATASET_EXPRESSION = "{" + EVERYTHING + "}" + EXTRA_BRACE

    script = soup.find_all("script")[-1].string
    dataset = re.search(DATASETS_EXPRESSION, script).group()
    dataset = re.search(DATASET_EXPRESSION, dataset).group()
    json_dataset = json.loads(dataset)
    data_key = list(json_dataset.keys())[0]
    data = json_dataset[data_key]

    print("'dataset' has {:,} data entries\n".format(len(data)))
    print("Entry {}:\n".format(index))
    pprint(data[index])
    return

data_printer(pandas_soup)

'dataset' has 18 data entries

Entry 0:

{'count': 27, 'year': 2005}

def frame_print(frame: pandas.DataFrame, index: int=0) -> None:
    """print length and one row of frame

    Params:

     - frame: data-frame to query
     - index: index of row to print
    """
    print("Frame has {:,} rows.".format(len(frame)))
    print("\nRow {}:\n".format(index))
    print(frame.iloc[0])
    return

frame_print(year_counts)

Frame has 18 rows.

Row 0:

year     2005
count      27
Name: 17, dtype: int64

data_printer(altair_soup)

'dataset' has 2,199 data entries

Entry 0:

{'Country name': 'Afghanistan',
 'Freedom to make life choices': 0.718,
 'Generosity': 0.168,
 'Healthy life expectancy at birth': 50.5,
 'Life Ladder': 3.724,
 'Log GDP per capita': 7.35,
 'Negative affect': 0.258,
 'Perceptions of corruption': 0.882,
 'Positive affect': 0.414,
 'Social support': 0.451,
 'year': 2008}

frame_print(table)

Frame has 2,199 rows.

Row 0:

Country name                        Afghanistan
year                                       2008
Life Ladder                               3.724
Log GDP per capita                         7.35
Social support                            0.451
Healthy life expectancy at birth           50.5
Freedom to make life choices              0.718
Generosity                                0.168
Perceptions of corruption                 0.882
Positive affect                           0.414
Negative affect                           0.258
Name: 0, dtype: object

There's a project called vegafusion that is supposed to help with reducing the size but it requires that you use a jupyter notebook for interactivity (it uses python to make a jupyter widget or some such) so it won't work for a static site like this one. So when using altair we have to think about what we're doing if the size of the files is going to be a problem. In most cases it probably makes sense to do the transformations in pandas first and then only pass the data to plot to altair.

See the altair documentation on Large Datasets for more information.

A Chart, Part By Part

Altair's Chart

chart = altair.Chart(year_counts)
print(type(chart))
expect(chart.data).to(be(year_counts))

<class 'altair.vegalite.v5.api.Chart'>

The Chart class is defined in altair.vegalite.v5.api. This is its docstring description:

Create a basic Altair/Vega-Lite chart.

Although it is possible to set all Chart properties as constructor attributes, it is more idiomatic to use methods such as mark_point(), encode(), transform_filter(), properties(), etc. See Altair's documentation for details and examples: http://altair-viz.github.io/.

The attributes set by the Chart class' constructor (it also accepets other keyword parameters that are passed to its parent classes) are:

data
encoding
mark
width
height

By default they're set to Undefined which is an altair-defined object (see altair.utils.schemapi), and as noted, you don't normally set the attributes using the constructor (other than data which isn't mentioned in the docstring but appears to be passed to the Chart constructor by convention).

Here's a diagram of the Chart (defined in altair.vegalite.v5.api).

nil

A Bar Chart

Once we have a chart object we tell altair that we want it to be a bar chart using the mark_bar method.

bar_chart = chart.mark_bar()
print(type(bar_chart))

<class 'altair.vegalite.v5.api.Chart'>

The mark_ methods are defined in the MarkMethodMixin class (a parent of Chart) which is defined in altair.vegalite.v5.schema.mixins module.

MarkMixin Class

Looking in the mark_bar method, there's a lot of arguments you could pass to it, but fundamentally all it's really doing is making a copy of itself, setting the mark attribute to bar and then retu+rning the copy.

print("Original Chart mark: '{}'".format(chart.mark))
print("Bar Chart mark: '{}'".format(bar_chart.mark))

expect(bar_chart).to_not(be(chart))

Original Chart mark: 'Undefined'
Bar Chart mark: 'bar'

SchemaBase

altair.utils.schemapi.

nil

There are many more methods in altair.utils.schemapi.SchemaBase but I'm highlighting copy here because it gets used quite a bit by the other classes but is defined in this somewhat obscure place. The behavior is what you'd expect so I don't see a need to go over it, but it's one of those mystery methods that just pops up when you use deep inheritance like this that makes you wonder what's going on so I'll document it here, for now.

TopLevelUnitSpec

If you look at the parents of the Chart you might notice that it doesn't have the SchemaBase as one of its parents. So how does it end up with the copy method? Well, it does have the core.TopLevelUnitSpec as one of its parents and that in turn (eventually) inherits from the SchemaBase.

nil

I didn't put in the modules for the core classes since they are fairly deep.

Encoded

The encode method is where we tell altair which columns match which parts of the chart. In this case we're only setting the x and y axes.

encoded = bar_chart.encode(
    x="{}:N".format(Column.year),
    y="count")

print(type(encoded))

<class 'altair.vegalite.v5.api.Chart'>

_EncodingMixin

The encode method is defined in the _EncodingMixin class, one of the Chart's parents.

nil

The encoding method takes in whatever combination of positional and keyword arguments you pass into it and then:

copies the Chart
updates the chart's encoding attribute
sets the copy's encoding attribute to an instance of the altair.vegalite.v5.schema.FacetedEncoding class.
returns the copy

print(encoded.encoding)

FacetedEncoding({
  x: X({
    shorthand: 'year:N'
  }),
  y: Y({
    shorthand: 'count'
  })
})

Properties

propertied = encoded.properties(height=HEIGHT, width=WIDTH)
print(type(propertied))

<class 'altair.vegalite.v5.api.Chart'>

nil

Note: This is a huge class with more methods than I'm showing here. The only ones we've encountered so far are to_dict, save and properties. I used to_dict to show that the chart has all the data from the pandas DataFrame and save is buried in the code that saves the chart to display it in this post - properties is the only one we're really interested in here.

The first thing to note about the properties method is that it doesn't define any arguments, it takes in any keyword arguments (and only keyword arguments, no positional arguments) and values for the arguments. Then:

it makes a copy of the chart
validates the arguments (unless the argument is the data)
sets the arguments as attributes of the copy.
returns the copy

Since we passed in height and width to the properties method, we get back a copy of our bar chart with the height and width set on the copy (as well as the "mark" which we set earlier with mark_bar).

print(propertied.mark)
print(propertied.width)
print(propertied.height)
expect(propertied.mark).to(equal("bar"))
expect(propertied.width).to(equal(WIDTH))
expect(propertied.height).to(equal(HEIGHT))

bar
800
600

HVPlot

Links

The Posts In This Series

Starting Post

Tutorial Sources

Fundamentals of Visualization with Dr. Danielle Albers Szafir: Coursera course where the original tutorial came from.
InfoVis Course from Marian Dork: Iniversity of Applied Sciences Potsdam that the Coursera course cites as the source for their tutorial.

The Data

World Happiness Report Dataset [Internet]. [cited 2023 Dec 11]. Available from: https://www.kaggle.com/datasets/unsdsn/world-happiness
Inc G. Gallup.com. 2009 [cited 2023 Dec 11]. Understanding How Gallup Uses the Cantril Scale. Available from: https://news.gallup.com/poll/122453/Understanding-Gallup-Uses-Cantril-Scale.aspx
World Happiness Report [Internet]. 2023 [cited 2023 Dec 11]. Available from: https://worldhappiness.report/

Altair

Layered and Multi-View Charts — Vega-Altair 5.2.0 documentation [Internet]. [cited 2023 Dec 12]. Available from: https://altair-viz.github.io/user_guide/compound_charts.html
Large Datasets — Vega-Altair 5.2.0 documentation [Internet]. [cited 2024 Mar 8]. Available from: https://altair-viz.github.io/user_guide/large_datasets.html

Table of Contents

Setup

Imports

Some Setting Up

The Data

The Data Columns

Counting the Years

Using Pandas' value_counts Method

Using Altair's "count"

Comparing the File Sizes

The Files In Bytes

A Chart, Part By Part

Altair's Chart

A Bar Chart

SchemaBase

TopLevelUnitSpec

Encoded

_EncodingMixin

Properties

HVPlot

Links

The Posts In This Series

Tutorial Sources

The Data

Altair

Using Pandas' `value_counts` Method