High School Friendship Networks

The Departure

This looks at data provided by SocioPatterns that looks a the interactions between students at a High School in Marseilles, France, during the month of December in 2013. In particular, I'm going to look at the Friendship Networks data - wherein the students reported who they were friends with.

Imports

From Python
from argparse import Namespace
from collections import Counter
from functools import partial
from pathlib import Path
import os
From PyPi
from bokeh.models import HoverTool, TapTool
from dotenv import load_dotenv
from holoviews import dim, opts
from holoviews.operation.datashader import bundle_graph
import holoviews
import hvplot.pandas
import networkx
import pandas as pandas
My Stuff
from graeae.timers import Timer
from graeae.tables import CountPercentage
from graeae.visualization import EmbedHoloview

Load the Dotenv

load_dotenv(".env")

Build the Timer

TIMER = Timer()

Setup The Plotting

holoviews.extension("bokeh")
SLUG = "high-school-friendship-networks/"
output = Path("../../files/posts/networks/" + SLUG)
Embed = partial(EmbedHoloview, folder_path=output)
Plot = Namespace(
    width = 1000,
    height = 800,
    graph_width = 800,
    graph_height = 800,
    small = 550,
    fontsize = 18,
    edge_color = "RoyalBlue",
    padding = dict(x=(-1.1, 1.1),
                   y=(-1.1, 1.1)),
)

The Initiation

Meta-Data

Let's take a look at the meta-data before loading it into pandas.

HIGH_SCHOOL = Path(os.environ.get("HIGH_SCHOOL")).expanduser()
assert HIGH_SCHOOL.is_dir()

#+begin_src python :session highschool :results none
class Files:
    metadata = "metadata_2013.txt"
    contact_diaries = "Contact-diaries-network_data_2013.csv"
    facebook = "Facebook-known-pairs_data_2013.csv"
    friendship = "Friendship-network_data_2013.csv"
    high_school = "High-School_data_2013.csv"
metadata_path = HIGH_SCHOOL.joinpath(Files.metadata)
assert metadata_path.is_file()
with metadata_path.open() as reader:
    for line in range(5):
        print(reader.readline(), end="")
650     2BIO1   F
498     2BIO1   F
627     2BIO1   F
857     2BIO1   F
487     2BIO1   F

This file has the meta-data for the students. The three columns are the student's ID, class, and gender. I don't know what class refers to here. From an American perspective this could mean a particular subject, e.g. Biology 101, or it could mean the number of years that the student has been in attendance, e.g. Freshman. The names suggest that they are subject related, but I don't know this for certain.

meta_data = pandas.read_csv(metadata_path, sep="\t", 
                            names=["id", "class", "gender"])
meta_data.loc[:, "class"] = meta_data["class"].astype("category")
meta_data.loc[:, "gender"] = meta_data.gender.astype("category")
print(len(meta_data))
print(len(meta_data.id.unique()))
329
329

So there's one entry for each student, meaning the student's can't belong to more than one class.

Classes

First a table of the counts.

table = CountPercentage(meta_data["class"]).holoviews_table.opts(
    width=Plot.width,
    height=256,
    fontsize=Plot.fontsize,
)
Embed(plot=table, file_name="class_table", height_in_pixels=270)()

Figure Missing

The fact that there are nine classes makes it seem like it's not likely they are 'classes' in the sense of freshman, sophomores, etc. At the same time, since each student only has one class, it doesn't seem like they are 'classes' in the sense of "Biology 101".

Now a bar-plot to look at how the classes are distributed.

grouped = meta_data.groupby(["class", "gender"]).agg(
    {"class": "count", "gender": "count"})
grouped.columns = ["class_count", "gender_count"]
grouped = grouped.reset_index()
grouped.loc[:, "class"]= grouped["class"].astype(str)
plot = grouped.hvplot.bar("class", "class_count", title="Class Counts by Gender", 
                          stacked=True,
                          by="gender", height=Plot.height, 
                          width=Plot.width,
                          ylabel="Count",
                          xlabel="Class",
                          tools=["hover"],
                          fontsize=Plot.fontsize).opts(xrotation=90)
Embed(plot=plot, file_name="gender_counts_stacked", height_in_pixels=Plot.height)()

Figure Missing

Link to Plot

This is a look at the same thing except not stacked.

plot = grouped.hvplot.bar(title="Class Counts by Gender", x="class", 
                          y="class_count",
                          xlabel="Class",
                          ylabel="Count",
                          by="gender", height=Plot.height, width=Plot.width, 
                          tools=["hover"],
                          fontsize=Plot.fontsize).opts(xrotation=90)
Embed(plot=plot, file_name="gender_counts", height_in_pixels=Plot.height)()

Figure Missing

Link to Plot

Strangely, the classes that start with 2BIO are more female while the others are more male.

Gender

A stacked bar plot to get a sense of not just the distribution among genders but among classes.

plot = grouped.hvplot.bar(title="Gender Counts", x="gender", y="gender_count",
                          stacked=True,
                          by="class", 
                          xlabel="Count",
                          ylabel="Gender",
                          fontsize=Plot.fontsize,
                          width=Plot.width,
                          height=Plot.height).opts(
                              legend_position="top_right",
                              xrotation=90, 
                              xlabel="Gender and Class")
Embed(plot=plot, file_name="class_counts_stacked", height_in_pixels=Plot.height)()

Figure Missing

Link to Plot

A non-stacked bar plot to get a better sense of how the genders fill the different classes.

plot = grouped.hvplot.bar(title="Gender Counts", x="gender", y="gender_count",
                          xlabel="Gender",
                          ylabel="Count",
                          by="class", 
                          height=Plot.height,
                          width=Plot.width,
                          fontsize=Plot.fontsize).opts(
                              xrotation=90, xlabel="Gender and Class")
Embed(plot=plot, file_name="class_counts", height_in_pixels=Plot.height)()

Figure Missing

Link to Plot

It looks like there were a little more males than females, but not a whole lot more.

The Friendship Network

This is a dataset that shows whether a student identified another student as their friend.

friendship_path = HIGH_SCHOOL.joinpath(Files.friendship)
assert friendship_path.is_file()
with friendship_path.open() as reader:
    for line in range(5):
        print(reader.readline(), end="")
1 55
1 205
1 272
1 494
1 779

The first column is the person who reported who his or her friends were and the second column is the person that was identified as a friend.

friendship_data = pandas.read_csv(friendship_path, delimiter=" ", 
                                  names=["reporter", "friend"])
friendship_data = friendship_data.dropna()

Looking at the Friendship Network

with TIMER:
    friendship_graph = networkx.convert_matrix.from_pandas_edgelist(
        friendship_data, "reporter", "friend", 
        create_using=networkx.DiGraph)
Started: 2019-04-29 12:04:26.556414
Ended: 2019-04-29 12:04:26.558285
Elapsed: 0:00:00.001871
genders = dict(zip(meta_data.id, meta_data.gender))
classes = dict(zip(meta_data.id, meta_data["class"]))
for node in friendship_graph.nodes:
    friendship_graph.nodes[node]["gender"] = genders[node]
    friendship_graph.nodes[node]["class"] = classes[node]

Plotting

Friendship Network Circular

By Gender
hover = HoverTool(
    tooltips = [
        ("Student Number", "@index"),
        ("Gender", "@gender"),
        ("Class", "@class"),
    ],
)

plot = holoviews.Graph.from_networkx(friendship_graph,
                                     networkx.circular_layout).redim.range(**Plot.padding).options(
                                         node_color=dim("gender"), cmap="Set1",
                                         tools=[hover, TapTool()],
                                         fontsize=Plot.fontsize,
                                         width=Plot.graph_width,
                                         height=Plot.graph_height,
                                         edge_line_color=Plot.edge_color,
                                         title="Friendship Network by Gender",
                                         xaxis=None,
                                         yaxis=None,
                                         directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_circular")()

Figure Missing

It's a little hard to see what's going on here, other than to note that you can see some people are more popular than others. The red nodes are male, the green nodes are female, and the blue is "unknown". Strangely, when I did the distributions earlier there were seven "unknown" but there's only one here…

print(meta_data[meta_data.gender=="Unknown"])
      id  class   gender
320   34     MP  Unknown
321   41     MP  Unknown
322  243     MP  Unknown
323  420     MP  Unknown
324   58    PC*  Unknown
325  209    PC*  Unknown
326  979  2BIO2  Unknown

There are seven id's, so there are really are seven unknowns, but for some reason the circle graph doesn't expose any other than the first (student 34). Maybe not all the students are in the data?

students = set(meta_data.id.unique())
reporters = set(friendship_data.reporter.unique())
print(f"Number of students in the meta-data: {len(students)}")
print(f"Number of students who reported who their friends were: {len(reporters)}")
Number of students in the meta-data: 329
Number of students who reported who their friends were: 133

So, it looks like not everyone took part in the survey.

reported = set(friendship_data.friend.unique())
print(f"Students not in graph: {len(students - (reporters & reported))}")
Students not in graph: 199

Okay, so not all the students are part of the study.

By Class
hover = HoverTool(
    tooltips = [
         ("Gender", "@gender"),
         ("Class", "@class"),
    ],
)

plot = holoviews.Graph.from_networkx(friendship_graph,
                                     networkx.circular_layout).opts(
                                         node_color=dim("class"), cmap="Set1",
                                         tools=[hover],
                                         fontsize=Plot.fontsize,
                                         width=800,
                                         height=800,
                                         edge_line_color=Plot.edge_color,
                                         title="Friendship Network by Class",
                                         xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_circular_class")()

Figure Missing

Link to Plot

Unfortunately there's a bug in HoloViews so I can't show a legend with Graphs, but I suppose since I don't know what the classes are, that doesn't really mean anything here.

Spring Layout

Class
plot = holoviews.Graph.from_networkx(friendship_graph, networkx.spring_layout, ).opts(
    node_color=dim("class"), cmap="Set1",
    tools=["hover"],
    width=800,
    height=800,
    edge_line_color=Plot.edge_color,
    title="Friendship Network By Class",
    xaxis=None, yaxis=None, directed=True,
    legend_position="right"
).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_class_spring", height_in_pixels=810)()

Figure Missing

Link to Plot

Unlike the circular plot, this plot shows that there are disconnected neighborhoods within the network and there seems to be a clustering by class.

Gender
plot = holoviews.Graph.from_networkx(friendship_graph, networkx.spring_layout, ).opts(
                                         node_color=dim("gender"), cmap="Set1",
                                         tools=["hover"],
                                         width=800,
                                         height=800,
                                         edge_line_color=Plot.edge_color,
                                         title="Friendship Network By Gender",
                                         xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_gender_spring", height_in_pixels=810)()

Figure Missing

Link to Plot

Interestingly, this view seems to show that there is also some clustering by gender.

Kawada Kamai Layout

Class
plot = holoviews.Graph.from_networkx(friendship_graph, networkx.kamada_kawai_layout, ).opts(
    node_color=dim("class"), cmap="Set1",
    tools=["hover", TapTool()],
    width=Plot.graph_width,
    height=Plot.graph_height,
    edge_line_color=Plot.edge_color,
    title="Friendship Network By Class (Kamada-Kawai)",
    xaxis=None, yaxis=None, directed=True,
    legend_position="right"
).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_class_kawada_kamai", height_in_pixels=810)()

Figure Missing

Link to Plot

This has more space between the nodes so it's a little easier to see the groups. Strangely, there's no isolated neighborhoods the way there is with the spring layout.

Gender
plot = holoviews.Graph.from_networkx(friendship_graph, networkx.kamada_kawai_layout, ).opts(
                                         node_color=dim("gender"), cmap="Set1",
                                         tools=["hover"],
                                         width=800,
                                         height=800,
                                         edge_line_color=Plot.edge_color,
                                         title="Friendship Network By Gender (Kamada Kawai)",
                                         xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_gender_kamada_kawai", height_in_pixels=810)()

Figure Missing

Link to Plot

Gender and Class Side-By-Side

graph = holoviews.Graph.from_networkx(friendship_graph, networkx.kamada_kawai_layout)
class_plot = graph.opts(
    node_color=dim("class"), cmap="Set1",
    tools=["hover"],
    width=Plot.small,
    height=Plot.small,
    edge_color=Plot.edge_color,
    title="Class",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
gender = holoviews.Graph.from_networkx(friendship_graph, networkx.kamada_kawai_layout)
gender = gender.opts(
    node_color=dim("gender"), cmap="Set1",
    tools=["hover"],
    width=Plot.small,
    height=Plot.small,
    title="Gender",
    edge_color=Plot.edge_color,
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
layout = (class_plot + gender).opts(title="Friendship Network")
Embed(plot=layout, file_name="friendship_network_class_vs_gender", height_in_pixels=Plot.small+200)()

Figure Missing

It looks like the Class might be more significant in clustering than the Gender.

Degree Distribution

Total Degrees

degree_sequence = sorted((degree for node, degree in friendship_graph.degree()))
degree_counts = Counter(degree_sequence)
degrees, counts = zip(*degree_counts.items())
table = holoviews.Table({"Degrees": degrees, "Count": counts}, ["Degrees"], ["Count"])
plot = table.to.bars(kdims=["Degrees"], vdims=["Count"]).opts(
    width=Plot.width,
    height=Plot.height,
    fontsize=Plot.fontsize,
    title="Degree Distribution",
    tools=["hover"],
)
Embed(plot=plot, file_name="degree_distribution")()

Figure Missing

The Total Degrees for a node is the number of edges attached to it (either coming in or going out).

In-Degree Distribution

in_degree_sequence = sorted((degree for node, degree in friendship_graph.in_degree))
in_degree_counts = Counter(in_degree_sequence)
in_degrees, in_counts = zip(*in_degree_counts.items())

in_data = pandas.DataFrame.from_dict({"Degrees": in_degrees, "Count": in_counts})
in_data["Direction"] = "in"

plot = in_data.hvplot.bar(x="Degrees", y="Count").opts(
    width=Plot.width,
    height=Plot.height,
    fontsize=Plot.fontsize,
    title="In-Degree Distribution",
    tools=["hover"],
)
Embed(plot=plot, file_name="in_degree_distribution")()

Figure Missing

The in-degree represents the number of times a student (the node) was identified by someone else as a friend. Three people weren't identified as friends at all and the most common count was 2, although someone was identified 15 times.

Out-Degree Distribution

out_degree_sequence = sorted((degree for node, degree in friendship_graph.out_degree))
out_degree_counts = Counter(out_degree_sequence)
out_degrees, out_counts = zip(*out_degree_counts.items())

out_data = pandas.DataFrame.from_dict({"Degrees": out_degrees, "Count": out_counts})
out_data["Direction"] = "out"
# table = holoviews.Table(, ["Degrees"], ["Count"])
plot = out_data.hvplot.bar(x="Degrees", y="Count").opts(
    width=Plot.width,
    height=Plot.height,
    fontsize=Plot.fontsize,
    title="Out-Degree Distribution",
    tools=["hover"],
)
Embed(plot=plot, file_name="out_degree_distribution")()

Figure Missing

The out-degree is the number of other students a student identified as a friend.

In and Out Degrees

in_out = pandas.concat([in_data, 
                        out_data]).sort_values(by="Degrees")
plot = in_out.hvplot.bar(x="Degrees", y="Count", by="Direction").opts(
    width=Plot.width,
    height=Plot.height,
    fontsize=Plot.fontsize,
    title="In and Out-Degree Distribution",
    tools=["hover"],
)
Embed(plot=plot, file_name="in_and_out_degree_distribution")()

Figure Missing

Despite the fact that I sorted the data by degrees, the actual plot seems to have also sort it by degrees but treating them as strings instead of integers. I'm going to add the in-degree, out-degree and in-degree - out-degree (in minus out) as data for the nodes so that they'll be available in the plots.

for node in friendship_graph.nodes:
    friendship_graph.nodes[node]["In-Degree"] = friendship_graph.in_degree[node]
    friendship_graph.nodes[node]["Out-Degree"] = friendship_graph.out_degree[node]
    friendship_graph.nodes[node]["In-Out"] = friendship_graph.in_degree[node] - friendship_graph.out_degree[node]

Bokeh seemed to indicate that you could set the thickness of the edges using weights, but this doesn't seem to work, HoloViews appears to have changed something and I couldn't figure out how to make it work.

for start, end in friendship_graph.edges:
    friendship_graph[start][end]["in_weight"] = friendship_graph.in_degree[end]
    friendship_graph[start][end]["out_weight"] = friendship_graph.out_degree[start]
    friendship_graph[start][end]["weight"] = friendship_graph.in_degree[end]

Popularity

plot = holoviews.Graph.from_networkx(friendship_graph, networkx.spring_layout)
plot = plot.opts(
    node_color=dim("In-Degree"), cmap="Plasma",
    tools=["hover"],
    width=800,
    height=800,
    edge_color_index="In-Degree",
    edge_cmap="Spectral",
    title="Friendship Network In-Degree",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_in_degree_spring", height_in_pixels=810)()

Figure Missing

The color of the nodes is related to the number of in-degrees it has (which represents the number of other students that stated a node was their friend). If it is dark purple then there are fewer in-degrees. If it is yellow than there are many in-degrees. So the yellow nodes are popular and the dark purple nodes not so much.

plot = bundle_graph(holoviews.Graph.from_networkx(friendship_graph, networkx.spring_layout))
plot = plot.opts(
    node_color=dim("In-Degree"), cmap="Plasma",
    tools=["hover"],
    width=800,
    height=800,
    edge_color_index="In-Degree",
    edge_cmap="Spectral",
    title="Friendship Network In-Degree",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_in_degree_bundled_spring", height_in_pixels=810)()

Figure Missing

plot = holoviews.Graph.from_networkx(friendship_graph, networkx.circular_layout)
plot = plot.opts(
    node_color=dim("In-Degree"), cmap="Plasma",
    tools=["hover"],
    width=800,
    height=800,
    edge_color_index="In-Degree",
    edge_cmap="Spectral",
    title="Friendship Network In-Degree",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_in_degree_circular", height_in_pixels=810)()

Figure Missing

plot = bundle_graph(holoviews.Graph.from_networkx(friendship_graph, networkx.circular_layout))
plot = plot.opts(
    node_color=dim("In-Degree"), cmap="Plasma",
    tools=["hover"],
    width=800,
    height=800,
    edge_color_index="In-Degree",
    edge_cmap="Spectral",
    title="Friendship Network In-Degree",
    directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_in_degree_bundled_circular", height_in_pixels=810)()

Figure Missing

plot = bundle_graph(holoviews.Graph.from_networkx(friendship_graph, networkx.kamada_kawai_layout))
plot = plot.opts(
    node_color=dim("In-Degree"), cmap="Plasma",
    tools=["hover"],
    width=800,
    height=800,
    edge_color_index="In-Degree",
    edge_cmap="Spectral",
    title="Friendship Network In-Degree (Kamada-Kawai)",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_in_degree_bundled_kamada_kawai", height_in_pixels=810)()

Figure Missing

Gregariousness

The out-degree is the number of times a student identified other students as friends. I'll interpret this as gregariousness (or maybe neediness).

plot = holoviews.Graph.from_networkx(friendship_graph, networkx.spring_layout)
plot = plot.opts(
    node_color=dim("Out-Degree"), cmap="Plasma",
    tools=["hover"],
    width=800,
    height=800,
    edge_color_index="Out-Degree",
    edge_cmap="Spectral",
    title="Friendship Network Out-Degree",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_out_degree_spring", height_in_pixels=810)()

Figure Missing

The color of the nodes is related to the number of out-degrees it has (which represents the number of students that a stated node identified as their friend). If it is dark purple than there are fewer out-degrees (loners?). If it is yellow than there are many out-degrees (the consider many to be their friends).

plot = bundle_graph(holoviews.Graph.from_networkx(friendship_graph, networkx.spring_layout))
plot = plot.opts(
    node_color=dim("Out-Degree"), cmap="Plasma",
    tools=["hover"],
    width=800,
    height=800,
    edge_color_index="Out-Degree",
    edge_cmap="Spectral",
    title="Friendship Network Out-Degree",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_out_degree_bundled_spring", height_in_pixels=810)()

Figure Missing

plot = holoviews.Graph.from_networkx(friendship_graph, networkx.circular_layout)
plot = plot.opts(
    node_color=dim("Out-Degree"), cmap="Plasma",
    tools=["hover"],
    width=800,
    height=800,
    edge_alpha=0.25,
    edge_color_index="Out-Degree",
    edge_cmap="Spectral",
    title="Friendship Network Out-Degree",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_out_degree_circular", height_in_pixels=810)()

Figure Missing

plot = bundle_graph(holoviews.Graph.from_networkx(friendship_graph, networkx.circular_layout))
plot = plot.opts(
    node_color=dim("Out-Degree"), cmap="Plasma",
    tools=["hover"],
    width=800,
    height=800,
    edge_color_index="Out-Degree",
    edge_cmap="Spectral",
    title="Friendship Network Out-Degree",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_out_degree_bundled_circular", height_in_pixels=810)()

Figure Missing

plot = bundle_graph(holoviews.Graph.from_networkx(friendship_graph, networkx.kamada_kawai_layout))
plot = plot.opts(
    node_color=dim("Out-Degree"), cmap="Plasma",
    tools=["hover"],
    width=800,
    height=800,
    edge_color_index="Out-Degree",
    edge_cmap="Spectral",
    title="Friendship Network Out-Degree (Kamada-Kawai)",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_out_degree_bundled_kamada_kawai", height_in_pixels=810)()

Figure Missing

plot = holoviews.Graph.from_networkx(friendship_graph, networkx.kamada_kawai_layout)
plot = plot.opts(
    node_color=dim("Out-Degree"), cmap="Plasma",
    tools=["hover"],
    width=800,
    height=800,
    edge_color_index="Out-Degree",
    edge_cmap="Spectral",
    title="Friendship Network Out-Degree (Kamada-Kawai)",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_out_degree_unbundled_kamada_kawai", height_in_pixels=810)()

Figure Missing

In Vs Out

out = holoviews.Graph.from_networkx(friendship_graph, networkx.kamada_kawai_layout)
out = out.opts(
    node_color=dim("Out-Degree"), cmap="Plasma",
    tools=["hover"],
    width=Plot.small,
    height=Plot.small,
    edge_color=Plot.edge_color,
    title="Out Degree",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
in_degree = holoviews.Graph.from_networkx(friendship_graph, networkx.kamada_kawai_layout)
in_degree = in_degree.opts(
    node_color=dim("In-Degree"), cmap="Plasma",
    tools=["hover"],
    width=Plot.small,
    height=Plot.small,
    title="In Degree",
    edge_color=Plot.edge_color,
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
layout = (in_degree + out).opts(title="Friendship Network")
Embed(plot=layout, file_name="friendship_network_degrees_with_in_vs_out", height_in_pixels=Plot.small+200)()

Figure Missing

It looks like the student with the highest in-degree isn't the student with the highest out-degree. It might be coincidental, but the student with the highest in-degree was female while the student with the highest out-degree was male.

Perception

The In Degree - Out Degree tells us how a student's perception of how many friends she has compares to how many people really think she's their friend. If it's negative than she thinks she has more friends than she has (delusional? optimistic?) and if it's positive than she has more friends than she thinks she does (modest? low self-esteem?).

plot = holoviews.Graph.from_networkx(friendship_graph, networkx.spring_layout)
plot = plot.opts(
    node_color=dim("In-Out"), cmap="Plasma",
    tools=["hover"],
    width=800,
    height=800,
    edge_color_index="Out-Degree",
    edge_cmap="Spectral",
    title="Friendship Network In Degree - Out Degree",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_in_minus_out_spring", height_in_pixels=810)()

Figure Missing

The dark-purple nodes have the most out-degrees compared to their in-degrees and the yellow-nodes have the most in-degrees compared to out-degrees.

plot = holoviews.Graph.from_networkx(friendship_graph, networkx.kamada_kawai_layout)
plot = plot.opts(
    node_color=dim("In-Out"), cmap="Plasma",
    tools=["hover"],
    width=800,
    height=800,
    edge_color_index="Out-Degree",
    edge_cmap="Spectral",
    title="Friendship Network In Minus Out (Kamada-Kawai)",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_in_minus_out_kamada_kawac", height_in_pixels=810)()

Figure Missing

Just Degrees

for node in friendship_graph.nodes:
    friendship_graph.nodes[node]["Degree"] = friendship_graph.degree[node]
plot = holoviews.Graph.from_networkx(friendship_graph, networkx.kamada_kawai_layout)
plot = plot.opts(
    node_color=dim("Degree"), cmap="Plasma",
    tools=["hover"],
    width=800,
    height=800,
    edge_color="RoyalBlue",
    title="Friendship Network Degrees",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_degrees", height_in_pixels=810)()

Figure Missing

In and Out Groups

I'll add another synthetic metric - this time I'm going to color the nodes based on the total degree plus the in-degree minus the out-degree. If a student says they have more friends than was indicated by the other students this will reduce the degree and if they say they have fewer then this will make the degree higher.

for node in friendship_graph.nodes:
    friendship_graph.nodes[node]["In+Out"] = (
        friendship_graph.degree[node] + (
            friendship_graph.in_degree[node] 
            - friendship_graph.out_degree[node]))
plot = holoviews.Graph.from_networkx(friendship_graph, networkx.kamada_kawai_layout)
plot = plot.opts(
    node_color=dim("In+Out"), cmap="Plasma",
    tools=["hover"],
    width=800,
    height=800,
    title="Friendship Network Degrees + (In - Out) (Kamada-Kawai)",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_degrees_plus_in_minus_out_kamada_kawai", height_in_pixels=810)()

Figure Missing

The more yellow a node is, the higher the total degree (minus the over-counting of friends), indicating a larger friendship network, and the bluer the node is, the lower the total degree indicating fewer friends.

in_out = holoviews.Graph.from_networkx(friendship_graph, networkx.kamada_kawai_layout)
in_out = in_out.opts(
    node_color=dim("In+Out"), cmap="Plasma",
    tools=["hover"],
    width=Plot.small,
    height=Plot.small,
    edge_color=Plot.edge_color,
    title="Degree + (In - Out)",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
degree = holoviews.Graph.from_networkx(friendship_graph, networkx.kamada_kawai_layout)
degree = degree.opts(
    node_color=dim("Degree"), cmap="Plasma",
    tools=["hover"],
    width=Plot.small,
    height=Plot.small,
    title="Degree",
    edge_color=Plot.edge_color,
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
layout = (in_out + degree).opts(title="Friendship Network")
Embed(plot=layout, file_name="friendship_network_degrees_with_in_out", height_in_pixels=Plot.small+200)()

Figure Missing

It looks like adding the penalty (which penalizes over-reporting friends and rewards under-reporting) reduced the brightness of some of the nodes and made one degree look the most sociable. With the penalty, the brightest node is the female student that had the highest in-degree, without the penalty the brightest node is the male student who had the highest out-degree.

plot = holoviews.Graph.from_networkx(friendship_graph, networkx.spring_layout)
plot = plot.opts(
    node_color=dim("In+Out"), cmap="Plasma",
    tools=["hover"],
    width=800,
    height=800,
    title="Friendship Network Degrees + (In - Out) (Spring Layout)",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
Embed(plot=plot, file_name="friendship_network_degrees_plus_in_minus_out_spring", height_in_pixels=810)()

Figure Missing

The Return

The Submission

I think the single most-interesting plot is the Degrees plus (in-degree - out-degree), but that might just be because I like the metric I came up with. A more informative plot might just be to plot the in and out degrees side by side.

out = holoviews.Graph.from_networkx(friendship_graph, networkx.kamada_kawai_layout)
out = out.opts(
    node_color=dim("Out-Degree"), cmap="Plasma",
    tools=["hover"],
    width=Plot.small,
    height=Plot.small,
    edge_color=Plot.edge_color,
    title="Out Degree",
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
in_degree = holoviews.Graph.from_networkx(friendship_graph, networkx.kamada_kawai_layout)
in_degree = in_degree.opts(
    node_color=dim("In-Degree"), cmap="Plasma",
    tools=["hover"],
    width=Plot.small,
    height=Plot.small,
    title="In Degree",
    edge_color=Plot.edge_color,
    xaxis=None, yaxis=None, directed=True).redim.range(**Plot.padding)
layout = (in_degree + out).opts(title="Friendship Network")
Embed(plot=layout, file_name="friendship_network_degrees_with_in_vs_out", 
      height_in_pixels=Plot.small+200, 
      add_link=True)()

Figure Missing

Link to Plot

Citations

  • R. Mastrandrea, J. Fournet, A. Barrat,

Contact patterns in a high school: a comparison between data collected using wearable sensors, contact diaries and friendship surveys. PLoS ONE 10(9): e0136497 (2015)

Amazon Related Products

Beginning

Description

This is an examination of data from the Stanford Large Network Dataset Collection. There are two parts to the dataset:

From the site's description:

Network was collected by crawling Amazon website. It is based on Customers Who Bought This Item Also Bought feature of the Amazon website. If a product i is frequently co-purchased with product j, the graph contains a directed edge from i to j.

The data was collected in June 01 2003.

Setting Up

Imports

From Python
from collections import Counter
from functools import partial
from pathlib import Path
import os
From PyPi
from bokeh.models import HoverTool
from dotenv import load_dotenv
from holoviews import opts
from holoviews.operation.datashader import datashade, bundle_graph
import holoviews
import modin.pandas as pandas
import networkx
import pandas as pandas_og
My Stuff
from graeae.timers import Timer
from graeae.visualization import EmbedHoloview

Load the Dotenv

load_dotenv(".env")

Build the Timer

TIMER = Timer()

Setup The Bokeh

holoviews.extension("bokeh")
output = Path("../../files/posts/networks/amazon-related-products/")
Embed = partial(EmbedHoloview, folder_path=output)

Load The Data

Normally I'd load this stuff in pandas, but let's take a look at the data first.

data_path = Path(os.environ.get("AMAZON_DATA")).expanduser()
assert data_path.is_file()
with data_path.open() as reader:
    for line in range(6):
        print(reader.readline(), end="")
# Directed graph (each unordered pair of nodes is saved once): Amazon0601.txt 
# Amazon product co-purchaisng network from June 01 2003
# Nodes: 403394 Edges: 3387388
# FromNodeId    ToNodeId
0       1
0       2

So it isn't a CSV, it's just a lot of pairs of nodes (edge definitions) separated by tabs (a Tab-Separated-Values file). So I guess we can load it with pandas. What about the meta-data?

metadata_path = Path(os.environ.get("AMAZON_METADATA")).expanduser()
assert metadata_path.is_file()
with metadata_path.open() as reader:
    for line in range(10):
        print(reader.readline(), end="")
# Full information about Amazon Share the Love products
Total items: 548552

Id:   0
ASIN: 0771044445
  discontinued product

Id:   1
ASIN: 0827229534
  title: Patterns of Preaching: A Sermon Sampler

So, it looks like there might be some information here to make it more human-readable, which would be useful, but there are a lot of points, so I don't know how useful it would be, but it turns out NetworkX will create a graph from a pandas dataframe edge-list so we might as well make use of everything at first and throw it away later if it's not useful.

Load the TSV

Take One (with modin)
with TIMER:
    edges = pandas.read_csv(data_path, delimiter="\t", 
                            names=["Source", "Target"], 
                            skiprows=4)
print(edges.shape)
Started: 2019-03-23 15:18:59.565073
Ended: 2019-03-23 15:18:59.751270
Elapsed: 0:00:00.186197
(3387392, 2)

If we look at the header we can see that there should be 3,387,388 edges, not 3,387,392… it looks like we picked up four extras.

print(edges.head())
                                              Source    Target
0  # Directed graph (each unordered pair of nodes...       NaN
1  # Amazon product co-purchaisng network from Ju...       NaN
2                     # Nodes: 403394 Edges: 3387388       NaN
3                                       # FromNodeId  ToNodeId
4                                                  0         1

It looks lke the skiprows argument doesn't worlk like I thought it does. According to the documentation:

skiprows : list-like, int or callable, optional Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

So I don't know why that didn't work, but it didn't… oh, well.

edges = edges.iloc[4:]
assert len(edges) == 3387388

We have the edges (and it was fairly painless), now how do we add the labels? Since we have edge-pairs it doesn't really make sense, now that I think about it. Maybe later it will make sense to re-map the ID numbers to titles.

with TIMER:
    id_graph = networkx.convert_matrix.from_pandas_edgelist(
        edges, "Source", "Target", 
        create_using=networkx.DiGraph)
Started: 2019-03-23 15:30:07.750906
Ended: 2019-03-23 15:31:08.065932
Elapsed: 0:01:00.315026
print(id_graph.order())
464045

So, we have a problem here since, according to the header, there should be 403,394 nodes, not 464,045.

nodes = set(edges.Source.unique()) | set(edges.Target.unique())
print(len(nodes))
464045
print(edges.head())
print(edges.tail())
  Source Target
4      0      1
5      0      2
6      0      3
7      0      4
8      0      5
         Source  Target
3387387  403392  121379
3387388  403392  190663
3387389  403393  318438
3387390  403393  326962
3387391  403393  403383
print(edges.Source.dtype)
object

I think that that initial problem with skipping rows might have messed things up a little.

with TIMER:
    edges.loc[:, "Source"] = edges.Source.astype(int)
    edges.loc[:, "Target"] = edges.Target.astype(int)
    print(len(set(edges.Source.unique()) | set(edges.Target.unique())))

That produced an error:

Exception: Internal Error. Please email bug_reports@modin.org with the traceback and command that caused this error.

Maybe modin is still not ready for prime time.

Take Two (with pandas)
with TIMER:
    edges = pandas_og.read_csv(data_path, delimiter="\t", 
                               names=["source", "target"], 
                               skiprows=4)
assert len(edges) == 3387388
Started: 2019-03-24 13:59:14.697145
Ended: 2019-03-24 13:59:15.142454
Elapsed: 0:00:00.445309

So that fixes the skiprows problem, first off.

with TIMER:
    id_graph = networkx.convert_matrix.from_pandas_edgelist(
        edges, "source", "target", 
        create_using=networkx.DiGraph)
Started: 2019-03-24 13:59:19.883328
Ended: 2019-03-24 13:59:26.128711
Elapsed: 0:00:06.245383
print(id_graph.order())
assert id_graph.order() == 403394
403394

So, using pandas fixes both our problems…

Middle

Reducing Our Data Set

It turns out that this graph is just too big - it takes too long to create and the browser won't load it even if you do build it. We'll need to reduce it somehow.

Load the Metadata

I didn't show enough lines to really see what a metadata entry looks like. Here's the full entry for the node with ID 1.

with metadata_path.open() as reader:
    for line in range(7):
        reader.readline()

    for line in range(12):
        print(reader.readline(), end="")
Id:   1
ASIN: 0827229534
  title: Patterns of Preaching: A Sermon Sampler
  group: Book
  salesrank: 396585
  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X
  categories: 2
   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]
  reviews: total: 2  downloaded: 2  avg rating: 5
    2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9
    2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5

So we may be able to cut it down using groups, categories, or maybe even the "cutomer" ratings.

with metadata_path.open() as lines:
    groups = Counter([line.split()[-1] for line in lines if "group:" in line])
print(set(groups))
{'CE', 'Toy', 'Book', 'DVD', 'Games', 'Product', 'Sports', 'Remediation', 'Software', 'Video', 'Music'}
for group, count in groups.items():
    print("{}: {:,}".format(group, count))
Book: 393,561
Music: 103,144
DVD: 19,828
Video: 26,131
Toy: 8
Games: 1
Software: 5
Product: 1
CE: 4
Sports: 1
Remediation: 1
hover = HoverTool(tooltips=[
    ("Group", "@Group"),
    ("Count", "@Count{0,0}"),
],
                  formatters={"Count": "numeral"},
                  mode="vline",
)
plot = holoviews.Bars(groups.items(),
                     holoviews.Dimension("Group"), "Count").opts(
                         width=1000,
                         height=800,
                         tools=[hover],
                         title="Product Groups")
Embed(plot=plot, file_name="product_groups")()

Figure Missing

It looks like just using the books alone would be too large, and maybe the music as well. I think I'll have to do a little more exploration.

with metadata_path.open() as lines:
    categories = Counter([line.strip() for line in lines if "|Book" in line])
for category, count in categories.most_common(10):
    print("{}: {:,}".format(category, count))
|Books[283155]|Subjects[1000]|Business & Investing[3]|General[2612]: 18,437
|Books[283155]|Subjects[1000]|Reference[21]|General[408268]: 13,703
|Books[283155]|Subjects[1000]|Biographies & Memoirs[2]|General[2375]: 12,243
|Books[283155]|Subjects[1000]|Nonfiction[53]|Social Sciences[11232]|Sociology[11288]|General[11289]: 11,779
|Books[283155]|Subjects[1000]|Literature & Fiction[17]|General[10125]|Contemporary[10129]: 11,448
|Books[283155]|Subjects[1000]|Children's Books[4]|Ages 4-8[2785]|General[170062]: 11,439
|Books[283155]|Subjects[1000]|Nonfiction[53]|Education[10605]|General[10635]: 8,762
|Books[283155]|Subjects[1000]|Computers & Internet[5]|General[657762]: 8,661
|Books[283155]|Subjects[1000]|Health, Mind & Body[10]|Psychology & Counseling[11119]|General[11175]: 8,064
|Books[283155]|Subjects[1000]|Children's Books[4]|Ages 9-12[2786]|General[170063]: 8,008

It turns out there's 11,463 categories. Maybe too much to look at. What if we only look at the third token?

with metadata_path.open() as lines:
    categories = Counter([line.split("|")[3] for line in lines if "|Book" in line])
for category, count in categories.most_common(10):
    print("{}: {:,}".format(category, count))
Children's Books[4]: 134,299
Nonfiction[53]: 106,977
Religion & Spirituality[22]: 93,690
Literature & Fiction[17]: 84,721
Business & Investing[3]: 74,125
Professional & Technical[173507]: 67,693
Computers & Internet[5]: 66,742
Health, Mind & Body[10]: 66,380
Reference[21]: 50,115
History[9]: 48,131
hover = HoverTool(tooltips=[
    ("Subject", "@Subject"),
    ("Count", "@Count{0,0}"),
],
                  formatters={"Count": "numeral"},
                  mode="vline",
)
plot = holoviews.Bars(categories.items(),
                     holoviews.Dimension("Subject"), "Count").opts(
                         width=1000,
                         height=800,
                         tools=[hover],
                         xrotation=90,
                         title="Book Subjects")
Embed(plot=plot, file_name="book_subjects")()

Figure Missing

One problem I just realized is that if I try to use a single group or category and it turns out that the edges cross-over to another group or category then I'll be missing nodes that I need… This is turning out to be more than I want to deal with at this point - since I'm just trying to figure out how to make a visualization…

End

Citation

  • J. Leskovec, L. Adamic and B. Adamic. The Dynamics of Viral Marketing. ACM Transactions on the Web (ACM TWEB), 1(1), 2007.

Walking Through the PyViz Network Graphs Examples

Beginning

This is a walk-through of the pyviz networks graphs examples so I can check that I can replicate them.

Imports

Python

from pathlib import Path
from functools import partial
import os

PyPi

from bokeh.models import HoverTool
from dotenv import load_dotenv
from holoviews import opts, dim
from holoviews.operation.datashader import bundle_graph

import holoviews
import networkx
import numpy
import pandas

My Stuff

from graeae.visualization import EmbedBokeh
from graeae.tables import CountPercentage

Setup

The Plotting

class Plot:
    width = 800
    height = 800
    padding = 0.1
    fontsize = 24
holoviews.extension("bokeh")
defaults = dict(width=Plot.width, height=Plot.height, padding=Plot.padding, xaxis=None, yaxis=None,
                fontsize=Plot.fontsize)
holoviews.opts.defaults(
    opts.EdgePaths(**defaults),
    opts.Graph(**defaults),
    opts.Nodes(**defaults),
)

The Bokeh Embedder

OUTPUT_FOLDER = Path("../../files/posts/libraries/"
                     "walking-through-the-pyviz-network-graphs-examples/")
Embed = partial(EmbedBokeh, folder_path=OUTPUT_FOLDER)

Middle

A Simple Graph

You can define a graph by giving two arrays which together represent pairs of nodes connected by an edge (they call the first array the nodes and the second the edges). As an example we can create a graph where all the nodes (other than the first node) have an edge to the first node (Node 0 in this case).

NODE_COUNT = 10
NODES = numpy.arange(NODE_COUNT)
EDGES = numpy.ones(NODE_COUNT) * 5

graph = holoviews.Graph(((EDGES, NODES),)).opts(title="We're All Connected to Five")
Embed(graph, "first_graph")()

The arguments to the Graph looks odd, and it's not obvious from the docstring why you need to pass in a tuple of tuples, but you do.

If you hover over the nodes you'll see that they all connect to the one with index 5, the value we gave to the edges array.

Re-Using the Nodes and Edges

The Graph keeps the Nodes and EdgePaths as separate objects that are themselves plotable.

dimensions = dict(width=400, height=400)
plot = (graph.nodes + graph.edgepaths).opts(title="Nodes and Edges", 
                                            fontsize=Plot.fontsize, 
                                            width=400, height=400)
Embed(plot, "nodes_and_edges")()

Note that changing the title worked (although it lost the font-size unless I explicitly passed it in) but changing the plot size didn't, even when I tried the redim method instead. This might be another one of those problematic HoloViews things that I need to figure out.

Explicit Paths

The next example is supposedly about suppyling explicit paths instead of letting HoloViews create them but there's so much unexplained that I'm not going to bother trying to understand it for now.

def bezier(start, end, control, steps=numpy.linspace(0, 1, 100)):
    return ((1 - steps)**2 * start 
            + 2 * (1 - steps) * steps * control 
            + steps**2 * end)

x, y = graph.nodes.array([0, 1]).T

paths = []
for node_index in NODES:
    ex, ey = x[node_index], y[node_index]
    paths.append(numpy.column_stack([bezier(x[0], ex, 0), bezier(y[0], ey, 0)]))

padding = dict(x=(-1.2, 1.2), y=(-1.2, 1.2))
bezier_graph = holoviews.Graph(((EDGES, NODES), (x, y, NODES), paths)).redim.range(**padding)
Embed(bezier_graph, "bezier_graph")()

Hover

By default the hover tool will give you information about the Node you mouse-over, but you can set it to give you the nodes connected to an edge instead.

graph = graph.options(inspection_policy="edges", 
                      width=Plot.width, 
                      height=Plot.height, 
                      title="Edge Hover")
Embed(graph, "edge_hove_tool")()

There were several things to note here:

  • Although I wasn't able to change the size when I plotted the nodes and edges side by side, this plot came out at the small size I had tried to use until I set it myself
  • The hover is now triggered by hovering over the edges
  • Although this is an undirected graph, they define the nodes as 'start' and 'end'

You can change the colors for hover activation as well.

graph = graph.options(node_hover_fill_color="red",
                      edge_hover_line_color="pink")
Embed(graph, "colored_hover")()

More Information

If we create the Nodes ourselves we can add labels to them.

First get the values that the Graph created for the x and y coordinates of the nodes.

X, Y = graph.nodes.array([0, 1]).T

There's no explanation for the method that I could find, other than the docstring, so I'll just take that on faith. I don't know how you would get the values yourself without first creating a Graph then taking the values and creating a new one, which doesn't seem right.

Now create the labels in the same order as the nodes that they are labeling.

node_labels = ["Barbara"] * 5 + ["Gloria"] + ["Barbara"] * 4

Now create the nodes and the new graph. Note that the first argument is a tuple and the nodes gets passed in as that mystery argument that I was wondering about earlier.

hover = HoverTool(
    tooltips=[
        ("Name", "@Name")
    ]
)
nodes = holoviews.Nodes((X, Y, NODES, node_labels), vdims="Name")
graph = holoviews.Graph(((EDGES, NODES), nodes)).opts( 
    title="Labeled Nodes", 
    node_color="Name",
    tools=[hover],
    cmap="Set1")
Embed(graph, "named_nodes")()

Edge Weights and Colors

We can also change the thickness and color of the edges based on their weights. First we need to create the edge-weights.

edge_weights = numpy.random.rand(len(EDGES))

numpy.random.rand creates an array of numbers ranging from 0 to 1 (but not 1).

Once again we have to create a new graph, this time passing in the edge-weights (and I'll add the nodes too).

graph = holoviews.Graph(((EDGES, NODES, edge_weights), nodes), vdims="Weight").opts(
    opts.Graph(
    title="Edge Weights",
    inspection_policy="edges",
    node_color="Name",
    cmap="Set1",
    edge_color="Weight", 
    edge_cmap="viridis", 
    edge_line_width=holoviews.dim("Weight") * 10
        )
)
Embed(graph, "edge_weights")()

Note that the edge-hover doesn't use our node-labels, which is sort of disappointing. I wonder if there's a way to fix that.

Using A Dataset

I previously created the Nodes using the values from a Graph that had the Nodes, which seemed kind of circular, it turns out that the way to do it is to pass in a Dataset with the labels.

node_labels = ("Anne Barbara Carol Donna Eleanor "
               "Francis Gloria Helen Iris Janet").split()
node_stuff = holoviews.Dataset(node_labels, vdims="Name")
graph = holoviews.Graph(
    (
        (EDGES, NODES, edge_weights), 
        node_stuff), 
    vdims="Weight").opts(
    opts.Graph(
        title="Nodes From a Dataset",
        node_color="Name",
        cmap="Set1",
        edge_color="Weight", 
        edge_cmap="plasma",
        edge_line_width=holoviews.dim("Weight") * 10,
        tools=[hover],
    )
)

Embed(graph, "dataset_node_labels")()

Using NetworkX

Karate Club

First we'll load in the included karate club graph that comes with networkx. This was a graph created from the members of a karate club that eventually splintered in two. The nodes are members and an edge between two nodes meant that they socialized outside of the club.

karate_graph = networkx.karate_club_graph()

This isn't explained in the example, but each node in the graph is a dict that has a some information set on it that you can reference.

print(karate_graph.node[0])
{'club': 'Mr. Hi'}

We're going to re-use these options again so I'll store them in a variable.

karate_options = opts.Graph(title="NetworkX Karate Club",
                            node_color=holoviews.dim("club"), cmap="Set1", 
                            xlabel="", ylabel="")

Now we can create the HoloViews graph from the karate club graph.

graph = holoviews.Graph.from_networkx(karate_graph, 
                                      networkx.layout.circular_layout).opts(
                                          karate_options,
                                      )

Embed(graph, "karate_club")()

Animating the Layout

numpy.random.seed(0)

def get_graph(iteration: int) -> holoviews.Graph:
    """Creates a graph laid out using Fruchterman-Reingold force-direction

    Args:
     iteration: maximum number of iterations to run the algorithm
    """
    return holoviews.Graph.from_networkx(karate_graph,
                                         networkx.spring_layout, 
                                         iterations=iteration)

holo_map = holoviews.HoloMap({iteration: get_graph(iteration).opts(karate_options)
                              for iteration in range(0, 55, 5)},
                             kdims="Iterations")


filename = "animated_spring.html"
path = OUTPUT_FOLDER.joinpath(filename)
holoviews.save(holo_map, path)

print('''#+begin_export html
<object type="text/html" data="{}" style="width:100%" height={}>
  <p>Failed to load</p>
</object>
#+end_export'''.format(filename, Plot.height
))

Failed to load

This is the first time I managed to get the holoviews widgets to work inside a post - I'm going to need to update how I do this.

Facebook

In this section we'll work with a graph of Facebook users who have been associated with a 'circle' (of friends).

Turn off the x and y axes.

options = dict(width=800, height=800, xaxis=None, yaxis=None)
opts.defaults(opts.Nodes(**options), opts.Graph(**options))

Read in the dataframes for the edges and nodes.

load_dotenv(".env")
edge_path = Path(os.environ.get("FACEBOOK_EDGES")).expanduser()
node_path = Path(os.environ.get("FACEBOOK_NODES")).expanduser()

assert edge_path.is_file()
assert node_path.is_file()

edge_data = pandas.read_csv(edge_path)
node_data = pandas.read_csv(node_path)

print(edge_data.shape)
print(node_data.shape)
print(node_data.iloc[0])

So you can see that the circle is part of the node data.

circle_counts = CountPercentage(node_data.circle)
circle_counts()
Value Count Percentage
circle15 119 35.74
None 56 16.82
circle16 31 9.31
circle11 29 8.71
circle0 15 4.50
circle19 13 3.90
circle4 10 3.00
circle2 9 2.70
circle6 8 2.40
circle17 8 2.40
circle9 7 2.10
circle20 6 1.80
circle13 5 1.50
circle10 4 1.20
circle23 3 0.90
circle3 3 0.90
circle14 2 0.60
circle7 2 0.60
circle12 1 0.30
circle18 1 0.30
circle22 1 0.30

For some reason the second most circle value is None.

Set up the colors.

black = "#000000"
colors = [black] + holoviews.Cycle("Category20").values

Setup the nodes and the graph.

nodes = holoviews.Nodes(node_data).sort()
graph = holoviews.Graph((edge_data, nodes), label="Facebook Circles").opts(
    cmap=colors,
    node_size=10,
    edge_line_width=1,
    node_line_color="gray",
    node_color="circle",
)

Embed(graph, "facebook_circles")()

It's interesting that the most populus circle circle15 is the first circle after we sort, is the sort working by value counts? I tried setting the second color to red, but this got assigned to circle0 which isn't the second most common, so apparently not, but then why did it work for circle15?

Bundling

bundled = bundle_graph(graph).opts(title="Bundled Facebook Circles")
Embed(bundled, "bundled_graph")()

This view helps to eliminate some of the noise from the edges, but I wonder if it does that at the expense of disproportionately diminishing the presence of circle15 (the black circles).

Circles

We can select circles from the graph instead of showing all of them.

circles = circle_counts.table.Value.iloc[:5]
circle_plots = [bundled.select(circle=name, selection_mode="nodes") for name in circles]
plot = circle_plots[0]
for next_plot in circle_plots[1:]:
    plot *= next_plot
plot = plot.opts(title="Top Five Bundled Circles")
Embed(plot, "bundled_graph_circles")()

This view helps to eliminate some of the noise from the edges, but I wonder if it does that at the expense of disproportionately diminishing the presence of circle15 (the black circles).

Amazon Related Products Problem Description

Requirements

  1. Create and upload a visualization
  2. Note if a subset of the data was used
  3. Note particular features of note in the visualization
  4. Describe what the data and visualization show

Rubric

Proximate Layout

How well are related items placed near each other? Do the edges cross or do items overlap when perhaps they do not need to? Are the crossings distracting?

Design of the Visulization

Dose the visualization effectively use the assignment of variables to elements and design of a visualization?

Contest

How interesting is the result? Does this represent an interesting choice of data and/or an interesting way to display the data?

Walking Through the Pyviz Workflow Example

Beginning

This is a replication of (most of) the pyviz workflow introduction to see if I can replicate it. Nothing new here.

Set Up

Imports

From Python.

from functools import partial
from pathlib import Path
import os

From PyPi.

from bokeh.models import CrosshairTool, HoverTool
from dotenv import load_dotenv
from tabulate import tabulate

import holoviews
import hvplot.pandas
import numpy
import pandas

My Stuff.

from graeae.visualization import EmbedBokeh

Presets

Load the dotenv.

load_dotenv(".env")

Set up the bokeh.

OUTPUT_FOLDER = Path("../../files/posts/libraries/walking-through-the-pyviz-workflow-example/")
Embed = partial(EmbedBokeh, folder_path=OUTPUT_FOLDER)
holoviews.extension("bokeh", width=800)

Set up the table printer.

TABLE = partial(tabulate, tablefmt="orgtbl", headers="keys", showindex=False)

Constants

class Plots:
    width = 1000
    height = 600
    font_size = 18

Load Some Data

Measles and pertusis by state.

diseases = pandas.read_csv(Path(os.environ.get("DISEASES")))

What does the data look like?

print(TABLE(diseases.head(5)))
Year Week State measles pertussis
1928 1 Alabama 3.67 nan
1928 2 Alabama 6.25 nan
1928 3 Alabama 7.95 nan
1928 4 Alabama 12.58 nan
1928 5 Alabama 8.03 nan
print(TABLE(diseases.sample(5)))
Year Week State measles pertussis
1945 18 Tennessee 2.26 0.66
1988 7 Mississippi nan nan
1973 14 North Dakota 0 nan
1963 29 Montana 5.55 nan
1932 13 Utah 0.39 nan

It looks like this data set has weekly measles… counts? I don't know why the values are fractions. It also looks like pertusis (whooping cough) is missing values.

Looking at the Missing Data

missing_measles = diseases[diseases.measles.isna()]
print("Missing Measles: {:.2f} %".format(
    100 * len(missing_measles)/len(diseases)))

missing_pertussis = diseases[diseases.pertussis.isna()]
print("Missing Pertussis: {:.2f} %".format(
    100 * len(missing_pertussis)/len(diseases)))
Missing Measles: 34.84 %
Missing Pertussis: 51.04 %

It looks like they both have missing values, but there are more missing pertussis values. I'll look at how the missing values are distributed.

  • Measles
    decade_hover = HoverTool(
        tooltips=[
            ("Missing", "@Year_count{0,0}"),
            ("Decade Center Year", "@Year{0}"),
        ],
        formatters={
            "Count": "numeral",
            "Year": "numeral",
        },
        mode="vline",
    )
    
    histogram = (missing_measles[(missing_measles.Year >1929) 
                                 & (missing_measles.Year < 2011)]
                 .hvplot.hist("Year", bins=8, tools=[decade_hover],
                              width=Plots.width, height=Plots.height,
                              ylabel="Count",
                              title="Missing Weekly Measles Reports by Decade")
                 .opts(fontsize=Plots.font_size))
    Embed(histogram, "missing_measles_histogram")()
    

    Since we have roughly eight decades of data I trimmed the years to exactly eight and then made a histogram with eight bins to get the counts per decade. Since this is a post about doing the plotting and not really about measles or pertussis I should probably mention that the way I got the Year_count variable name for the HoverTool was by printing the histogram object after I'd plotted it once.

    print(histogram)
    
    :Histogram   [Year]   (Year_count)
    
    year_hover = HoverTool(
        tooltips=[
            ("Missing", "@count{0,0}"),
            ("Year", "@Year{0}"),
        ],
        formatters={
            "Count": "numeral",
            "Year": "numeral",
        },
        mode="vline",
    )
    
    year_counts = (missing_measles.groupby("Year")
                   .agg({"Year": "count"})
                   .rename(columns={"Year": "count"}).reset_index())
    plot = (year_counts.hvplot(tools=[year_hover], x="Year", y="count",
                              width=Plots.width, height=Plots.height,
                              xlabel="Missing Count",
                              title="Missing Measles Reports by Year")
            .opts(fontsize=Plots.font_size))
    Embed(plot, "aggregated_missing_measles")()
    

    It looks like there are more missing values from 1981 onward. That kind of surprised me, but I guess that prior to this latest outbreak the reporting might have become less necessary since measles was less common (it was declared eliminated from the Americas in 2016).

    print(histogram)
    
    :Histogram   [Year]   (Year_count)
    
    print(len(diseases.State.unique()) * len(diseases.Week.unique()))
    
    2652
    

    By 2003 there are 2,652 missing values, which is our maxmimum value so it looks like there was no reporting in this data set from this year forward.

  • Pertussis
    histogram = (missing_pertussis[(missing_pertussis.Year > 1929) 
                                  & (missing_pertussis.Year < 2011)]
                 .hvplot.hist("Year", bins=8, 
                              ylabel="Count of Missing",
                              title="Missing Pertussis by Decade", 
                              tools=[decade_hover])
                 .opts(fontsize=Plots.font_size))
    Embed(histogram, "missing_pertussis_distribution")()
    

    Strangely the missing reports seem to peak in the 1960s. This seems problematic if you're going to look at the incident rates, but I'm only going to look at measles anyway.

Middle

Looking at Measles

By Year

hover = HoverTool(
    tooltips=[
        ("Measles", "@measles{0,0}"),
        ("Year", "@Year"),
    ],
    formatters={"measles": "numeral"},
    mode="vline",
)

measles_by_year = diseases[["Year", "measles"]].groupby("Year").agg(numpy.sum)
plot = measles_by_year.hvplot(title="Measles In the U.S. by Year", 
                              xlabel="Year", 
                              ylabel="Cases", 
                              width=1000,
                              tools=[hover])
Embed(plot, "measles_by_year")()

You can see from the plot that cases of measles have dropped dramatically over the years, with a particularly sharp drop in the 1960's.

Vaccines Enter The Picture

According to the CDC, the Edmonston-D vaccine was released in the United States in 1963 and the Edmonston-Enders vaccine (which is still currently in use) was released in 1968. Let's draw some lines to mark when certain events happened.

first = holoviews.VLine(1963).opts(color="black", alpha=0.5)
first_label = holoviews.Text(1963 - 1, 27000, "Vaccine Introduced", 
                             halign="right")
current = holoviews.VLine(1968).opts(color="black", alpha=0.5)
current_label = holoviews.Text(1968 + 1, 27000, "Newer Vaccine", halign="left")
highest = holoviews.VLine(int(measles_by_year.idxmax())).opts(color="red")
highest_label = holoviews.Text(int(measles_by_year.idxmax()) + 1, 27000, 
                               "Year of the Most Cases", halign="left")
lowest = holoviews.VLine(int(measles_by_year.idxmin())).opts(color="blue", 
                                                             alpha=0.5)
lowest_label = holoviews.Text(int(measles_by_year.idxmin()) - 1, 27000, 
                              "Zero Cases", halign="right")

plot_2 = (plot
          * first * first_label 
          * current * current_label 
          * highest * highest_label 
          * lowest * lowest_label).opts(fontsize=Plots.font_size)
Embed(plot_2, "measles_with_landmarks")()

It does look like the introduction of the vaccine(s) had a dramatic effect on the incidence of measles in the United States.

It appears that there were zero cases in 2002, but the actual value is 0.31, but my formatter cuts off the decimal place. 2003 is the first true zero, but as we saw above, this is also the first N/A value. Maybe N/A means zero, not missing. It's hard to say without some documentation about the data.

Measles By State

This creates a dropdown menu so we can see the states' measles cases separately. It doesn't work in this template so I'm saving it as a separate page.

measles_by_state = diseases.groupby(["Year", "State"])["measles"].sum()
states_plot = measles_by_state.hvplot(x="Year", groupby="State", width=800, dynamic=False)
file_name = "measles_by_state.html"
holoviews.save(states_plot, OUTPUT_FOLDER.joinpath(file_name))
print("[[file:{}][Link to plot]]".format(file_name))

Link to plot

Oregon Vs Hawaii

Instead of looking at all the states one at a time, we can choose two states and lay them out side by side to make them easier to compare. The addition sign is used to make plots next to each other.

hover = HoverTool(
    tooltips=[
        ("Measles Cases", "@measles{0,0}"),
        ("Year", "@Year"),
    ],
    formatters={"measles": "numeral"},
    mode="vline",
)

oregon_plot = states_plot["Oregon"].relabel("Measles in Oregon").opts(
    tools=[hover],
    width=550, 
    fontsize=Plots.font_size)
hawaii_plot = states_plot["Hawaii"].relabel("Measles in Hawaii").opts(
    tools=[hover],
    width=550, 
    fontsize=Plots.font_size)
plot = (oregon_plot * first * current * current_label * first_label
        + hawaii_plot * first * first_label * current * current_label)
Embed(plot, "oregon_vs_hawaii")()

I don't know why but the labels don't work. This is one of the problems with HoloViews, I think - they make some things really easy but the minute you step outside of what they have documented there's no way to figure out what's going on and how to fix it (or do it in the first place). It's an impressive programming feat but not documented enough to be as useful as it might be. More of a thing for quick sketching after which you have to switch back over to bokeh if you want anything other than the canned views (kind of like Excel… except with less documentation).

Surprisingly (to me), Hawaii had more cases in their peak years and huge swings. Perhaps since it's an island the sailors and other travelers introduced epidemics. Or maybe they weren't as good at keeping records back then.

According to the 1960 Census Count Oregon was quite a bit more populous than Hawaii.

Four States

Since Oregon had a much larger population that Hawaii I thought I'd plot the states closer to it in size. The four states nearest to Hawaii in population (from the same 1960 popurlation report) are (in descending order):

  • Montana (674,767)
  • Idaho (667,191)
  • Hawaii (632,772)
  • North Dakota (632,446)

While the side-by side plot lets you see groupings and heights you can't easily compare values for each year so this time I'll put them all on the same plot. HoloViews automatically creates a legend which lets you dim a line by clicking on it in the legend.

hover = HoverTool(
    tooltips=[
        ("Measles", "@measles{0,0}"),
        ("Year", "@Year"),
        ("State", "@State"),
    ],
    formatters={"measles": "numeral"},
)

states = ["Montana", "Idaho", "Hawaii", "North Dakota"]
start_year, end_year = 1930, 2005
plot = (measles_by_state.loc[start_year:end_year, states].hvplot(
    by="State",
    title="Measles 1930 - 2005",
    tools=[hover],
    fontsize=Plots.font_size,
    width=Plots.width) 
        * first * first_label 
        * current * current_label)
Embed(plot, "four_states_measles")()

Surprisingly Hawaii had the highest values in 1951 and 1955 and it looks like they had a really bad outbreak from 1955 through 1958, although Montana had higher values overall. Once again, the introduction of a vaccine seems to have a dramatic effect, although there continued to be mini outbreaks in the 1970s.

Faceting

Another way to compare the states is to plot them side-by side.

hover = HoverTool(
    tooltips=[
        ("Measles", "@measles{0,0}"),
        ("Year", "@Year"),
    ],
    formatters={"measles": "numeral"},
)

crosshairs = CrosshairTool()
plot = (measles_by_state.loc[start_year:end_year, states].hvplot(
    x="Year", col="State", width=300, height=200, rot=90, tools=[crosshairs])
        * first * first_label * current * current_label).opts(title="Measles By Year")
Embed(plot, "faceted_measles_states")()

This makes it harder to compare year by year, but it looks kind of elegant and it's clearer that Hawaii only had this one intense period recorded (starting in 1950) before the vaccine came out, while Idaho and North Dakota had this steady low-level amount of cases and something was much worse in Montana (although, really, their pattern looks closer to that of the United States as a whole) and all of them benefitted from the vaccines.

As far as the plot goes, it might not be obvious but the main difference with what I did to get this plot is that the State column was assigned to the col argument instead of the by argument. I also had to set the title using the opts for the total plot, setting it for the hvplot didn't do anything.

Bar Chart

This time I'll plot the counts for the states using a bar-chart instead of a line-plot.

hover = HoverTool(
    tooltips=[
        ("Measles", "@measles{0,0}"),
        ("Year", "@Year"),
        ("State", "@State"),
    ],
    formatters={"measles": "numeral"},
    mode="vline",
)

plot = measles_by_state.loc[1960:1970, states].hvplot.bar(
    "Year",
    height=Plots.height,
    width=Plots.width,
    fontsize=Plots.font_size,
    title="Measles Count by Year",
    by="State", 
    tools=[hover],
    rot=90)
Embed(plot, "measles_bar_chart")()

This gives pretty much the same information as the earlier line-plot, except it makes it easier to see a states' case-count for a given year. On the other hand it's harder to really see the year-to-year patterns and you can't turn off states to highlight other states. This looks like something business people would use more than what scientists would use. I think there is a certain aesthetic advantage to it which is traded off by the extra trend information given by a line plot. It does have a major advantage in the way it simplifies the output, though. It also allows you to use the "vline" option so the user only has to be on the same x-axis as a bar to trigger the pop-up. When I did this with the line plots the ones for the different states ended up covering each other up (since they all shared the same x-value).

I tried to add the previous vertical lines to indicate when vaccines were introduced but a bar-plot uses different inputs so it raised an error. It's probably less important for a bar-chart anyway, since you aren't emphasizing the time line quite so much as with a line plot.

print(first)
print(plot)
:VLine   [x,y]
:Bars   [Year,State]   (measles)

I think (guess) that it needs both the year and state, while the line plot only needed the year.

Nationwide Mean With Error Bars

error = diseases.groupby("Year").agg({"measles": [numpy.mean, numpy.std]}).xs(
    "measles", axis=1)
plot = (error.hvplot(y="mean", 
                     title="Mean National Measles Cases By Year") 
        * holoviews.ErrorBars(error, "Year").redim.range(mean=(0, None)) 
        * first * first_label * current * current_label).opts(
            fontsize=Plots.font_size,
            width=Plots.width,
        )
Embed(plot, "error_bars")()

So looking at this we can see that the national mean actually paints a slightly different picture from the raw counts for measles. My guess would be that the different populations for each state offset each other enough that the mean is flattened out.

Actually, in thinking about it, the mean is really a weekly mean for the year so… I'll have to think about this.

GISS Zone Anomalies

Introduction

This is a look at the Godard Institute for Space Studies' surface temperature data. I'll look at the mean difference between the yearly mean and the historical "normal" mean.

Set Up

Imports

Python

from functools import partial
from pathlib import Path
import os

PyPi

from dotenv import load_dotenv
from holoviews import opts
import holoviews
from holoviews.plotting.links import RangeToolLink
from bokeh.models import CrosshairTool, HoverTool
from bokeh.palettes import Category20

import hvplot.pandas
import pandas
import seaborn

This Project

from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh
from graeae.tables import CountPercentage

Load Dotenv

load_dotenv(".env")

HoloViews Backend

holoviews.extension("bokeh", width=800)

The Embedder

OUTPUT_PATH = Path("../../files/posts/giss/giss-zone-anomalies/")
Embed = partial(EmbedBokeh, 
                folder_path=OUTPUT_PATH)

Some Settings

class Plots:
    width = 1200
    height = 600
    font = "Open Sans"
    font_size = "24pt"
    line_width = 3
    tools =  ["hover"]
    blue = seaborn.color_palette()[0]
    light_blue = Category20[3][1]
    red = seaborn.color_palette()[3]
    yellow = seaborn.color_palette()[1]
    green = seaborn.color_palette()[2]
    gray = seaborn.color_palette()[7]

The Data

The data is the Combined Land-Surface Air and Sea-Surface Water Temperature Anomolies' Zonal Annual Means which shows the different annual mean for each zone in a given year.

zone_path = Path(os.environ.get("ZONES")).expanduser()
assert zone_path.is_file()
with zone_path.open() as reader:
    giss = pandas.read_csv(reader)
print(giss.describe())
              Year        Glob        NHem        SHem     24N-90N  \
count   139.000000  139.000000  139.000000  139.000000  139.000000   
mean   1949.000000    0.032302    0.056043    0.008561    0.077698   
std      40.269923    0.336896    0.393435    0.301848    0.464606   
min    1880.000000   -0.490000   -0.540000   -0.490000   -0.580000   
25%    1914.500000   -0.200000   -0.220000   -0.235000   -0.280000   
50%    1949.000000   -0.070000   -0.010000   -0.080000    0.020000   
75%    1983.500000    0.215000    0.210000    0.265000    0.255000   
max    2018.000000    0.980000    1.260000    0.710000    1.500000   

          24S-24N     90S-24S     64N-90N     44N-64N     24N-44N     EQU-24N  \
count  139.000000  139.000000  139.000000  139.000000  139.000000  139.000000   
mean     0.036115   -0.018561    0.111079    0.117770    0.027698    0.027626   
std      0.331384    0.295695    0.917715    0.516729    0.356416    0.326111   
min     -0.650000   -0.470000   -1.640000   -0.710000   -0.590000   -0.720000   
25%     -0.215000   -0.250000   -0.545000   -0.270000   -0.200000   -0.230000   
50%     -0.030000   -0.100000    0.020000    0.000000   -0.070000    0.000000   
75%      0.255000    0.230000    0.660000    0.360000    0.135000    0.240000   
max      0.970000    0.700000    3.050000    1.440000    1.060000    0.930000   

          24S-EQU     44S-24S     64S-44S     90S-64S  
count  139.000000  139.000000  139.000000  139.000000  
mean     0.045683    0.020432   -0.069353   -0.078129  
std      0.343385    0.312688    0.269380    0.732359  
min     -0.580000   -0.430000   -0.540000   -2.570000  
25%     -0.210000   -0.220000   -0.265000   -0.490000  
50%     -0.030000   -0.080000   -0.090000    0.050000  
75%      0.290000    0.260000    0.180000    0.410000  
max      1.020000    0.780000    0.450000    1.270000  

So we have years from 1880 through 2018.

print(giss.iloc[0])
Year       1880.00
Glob         -0.18
NHem         -0.31
SHem         -0.06
24N-90N      -0.38
24S-24N      -0.17
90S-24S      -0.01
64N-90N      -0.97
44N-64N      -0.47
24N-44N      -0.25
EQU-24N      -0.21
24S-EQU      -0.13
44S-24S      -0.04
64S-44S       0.05
90S-64S       0.67
Name: 0, dtype: float64

Tidying the Data

I'm going to use the locations for names so I'll make them something more meaningful for me. I don't know if these are, strictly speaking, the right names, but it's close enough.

name_remap = {
    "Glob": "Global",
    "NHem": "Northern Hemisphere",
    "SHem": "Southern Hemisphere",
    "24N-90N": "Northern Extratropics",
    "24S-24N": "Tropics",
    "90S-24S": "Southern Extratropics",
    "64N-90N": "North Frigid",
    "44N-64N": "North Temperate",
    "24N-44N": "North Sub-Tropic",
    "EQU-24N": "Tropic of Cancer",
    "24S-EQU": "Tropic of Capricorn",
    "44S-24S": "South Sub-Tropic",
    "64S-44S": "South Temperate",
    "90S-64S": "South Frigid",
}
giss = giss.rename(columns=name_remap)
print(giss.columns)
Index(['Year', 'Global', 'Northern Hemisphere', 'Southern Hemisphere',
       'Northern Extratropics', 'Tropics', 'Southern Extratropics',
       'North Frigid', 'North Temperate', 'North Sub-Tropic',
       'Tropic of Cancer', 'Tropic of Capricorn', 'South Sub-Tropic',
       'South Temperate', 'South Frigid'],
      dtype='object')
giss_tidy = giss.melt(id_vars=["Year"], var_name="location", 
                      value_name="Difference From Normal")
print(giss_tidy.iloc[0])
Year                        1880
location                  Global
Difference From Normal     -0.18
Name: 0, dtype: object
location = CountPercentage(giss_tidy.location)
# plot = location.table.hvplot.table()
# Embed(plot, "location_table")()
location()
Value Count Percentage
Southern Extratropics 139 7.14
Tropics 139 7.14
North Temperate 139 7.14
North Frigid 139 7.14
Northern Extratropics 139 7.14
North Sub-Tropic 139 7.14
South Sub-Tropic 139 7.14
Southern Hemisphere 139 7.14
Global 139 7.14
South Frigid 139 7.14
South Temperate 139 7.14
Tropic of Cancer 139 7.14
Northern Hemisphere 139 7.14
Tropic of Capricorn 139 7.14

The Frigid North

This is a plot of the temperature anomalies for the Frigid North using a Range Finder. You can select a range of years using the mini-plot below the main one.

# The Tools
hover = HoverTool(
    tooltips = [
        ("Year", "@Year"),
        ("Temperature Anomaly (C)", "@{North Frigid}{0.00}"),
],
    formatters={"North Frigid": "numeral"},
    mode="vline",
)

# The Parts
curve = holoviews.Curve(
    giss, "Year",
    ("North Frigid", "Temperature Anomaly (C)"),
)

main = curve.relabel("North Frigid Zone").opts(width=Plots.width, 
                                               height=Plots.height, 
                                               labelled=["y"],
                                               fontsize=Plots.font_size,
                                               color=Plots.blue,
                                               tools=["pan", "zoom_in", 
                                                      hover, "reset"])
range_finder = curve.opts(width=Plots.width, height=100, yaxis=None, default_tools=[])
line = holoviews.HLine(0)
RangeToolLink(range_finder, main)

layout = (main * line + range_finder).cols(1)
layout = layout.opts(
    opts.Layout(shared_axes=False, merge_tools=False),
    opts.HLine(color="black", alpha=0.2, line_width=2))
Embed(layout, "north_frigid_zone")()

This plot is kind of unusual in that the column name has a space in it ("North Frigid") so in the hover tool you have to put braces around it.

If we were only looking at one zone I would think this is the most useful type, but when I tried to add multiple zones I couldn't get it to work. HoloViews… I also tried to change the font and couldn't figure out how to do it. I really don't get what bokeh and HoloViews are doing with their documentation.

All Zones Dropdown

This creates a plot with all the zones but only showing one at a time (you select which one using a dropdown). To get it you specify the dropdown column using groupby.

# the Tools
hover = HoverTool(
tooltips = [
    ("Year", "@Year"),
    ("Difference from Normal", "@Difference From Normal"),
],
    formatters={"Difference From Normal": "numeral"},
    mode="vline",
)

plot = giss_tidy.hvplot("Year", groupby="location", width=Plots.width, dynamic=False, tools=[hover])
line = holoviews.HLine(0)
layout = plot * line
file_name = "zones.html"

plot_file = OUTPUT_PATH.joinpath(file_name)
holoviews.save(layout, plot_file)
print("[[file:{}][Link To Plot]]".format(file_name))

Link To Plot

All Zones Overlay

The dropdown is a convenient way to look at different zones individually but make it hard to compare different zones. I'm going to put all of them on one plot first to get an overall picture.

plot = giss_tidy.hvplot("Year", "Difference From Normal", 
                        by="location", 
                        title="Difference by Zone", 
                        ylabel="Difference From Normal",
                        width=Plots.width, 
                        height=Plots.height).opts(
                            fontsize=Plots.font_size,
                        )
line = holoviews.HLine(0)
layout = plot * line
Embed(layout, "all_zones")()

This is kind of a messy plot, but it does show that the North Frigid and South Frigid zones seem to have the largest anomalies.

Hemisphere Spikes

hemisphere_map = {
    'Global': "Global", 
    'Northern Hemisphere': "Northern", 
    'Southern Hemisphere': "Southern",
    'Northern Extratropics': "Northern", 
    'Tropics': "Tropics", 
    'Southern Extratropics': "Southern",
    'North Frigid': "Northern", 
    'North Temperate': "Northern", 
    'North Sub-Tropic': "Northern",
    'Tropic of Cancer': "Northern", 
    'Tropic of Capricorn': "Southern", 
    'South Sub-Tropic': "Southern",
    'South Temperate': "Southern", 
    'South Frigid': "Southern",
}

giss_tidy["hemisphere"] = giss_tidy.location.map(hemisphere_map)

Separate out the hemispheres for plotting.

northern = giss_tidy[giss_tidy.hemisphere=="Northern"]
southern = giss_tidy[giss_tidy.hemisphere=="Southern"]
global_anomalies = giss_tidy[giss_tidy.location=="Global"]

Plot.

# The Tools
hover = HoverTool(
tooltips = [
    ("Year", "@Year"),
    ("Difference from Normal", "@{Difference From Normal}"),
    ("Hemisphere", "@Hemisphere"),
],
    formatters={"Difference From Normal": "numeral"},
)

northern_spikes = holoviews.Spikes(northern, "Year", ("Difference From Normal", 
                                                      "Difference From Normal"),
                                   group="Hemisphere",
                                   label="Northern")
southern_spikes = holoviews.Spikes(southern, "Year", ("Difference From Normal",
                                                      "Difference From Normal"), 
                                   group="Hemisphere",
                                   label="Southern")
overlay = (northern_spikes
           * holoviews.Scatter(northern_spikes)
           * southern_spikes
           * holoviews.Scatter(southern_spikes)
           )
overlay.opts(
    opts.Spikes("Hemisphere.Northern", color=Plots.blue),
    opts.Spikes("Hemisphere.Southern", 
                fontsize=Plots.font_size,
                color=Plots.red, 
                width=Plots.width,
                height=Plots.height),
    opts.Scatter(size=10, tools=[hover]),
)
line = holoviews.HLine(0)
layout = overlay * line
bokeh_layout = holoviews.render(layout)
bokeh_layout.title.text_font = "open sans"
bokeh_layout.legend.location = "top_left"
embed = Embed(bokeh_layout, "all_hemispheres")
embed._figure = bokeh_layout
embed()

This one is a little hard to understand at first. All of the zones are still being plotted, but they are grouped by their hemisphere, so they look different from just the Hemisphere temperatures that come later. Mostly what you're seeing is a highlightt of the most anomalous zone for each year.

I also couldn't figure out how to get the zone information into the hovertool. With hvPlot you can pass in extra columns, but the Spikes didn't like it when I tried… such a frustrating library.

Anyway, it looks like the Northern Hemisphere started out below average while the Southern started out above average, then during the baseline period it sort of flipped, and now they're both above average, but the Northern is even more so.

Hemispheres

This plot shows what happens when you average the zones by hemisphere.

hemispheres = ["Global", "Northern Hemisphere", "Southern Hemisphere"]
hemispheres = giss_tidy[giss_tidy.location.isin(hemispheres)]
plot = hemispheres.hvplot("Year", "Difference From Normal", 
                          ylabel="Difference From Normal (C)",
                          by="location", 
                          title="Difference by Hemisphere",
                          fontsize=Plots.font_size,
                          width=Plots.width, height=Plots.height)
line = holoviews.HLine(0)
layout = plot * line
Embed(layout, "hemispheres")()

This shows a different view than the Zones grouped by Hemisphere, as all the values start out below normal and move in sort of a linear fashion upwards. This probably is the result of the fact that the Frigid Zones are more extreme in their differences than the other zones, so averaging all of the zones supresses them. At least that's my initial thought. Looking at the previous plot, though, it looks quite different - the lowest value in the Hemispheres plot comes in 1917, while the zones grouped by hemisphere puts it at 1930 and 1917 actually doesn't look so special. Maybe they're calculated differently.

Extratropics

zones = ["Global", "Northern Extratropics", "Southern Extratropics", "Tropics"]
zones = giss_tidy[giss_tidy.location.isin(zones)]
plot = zones.hvplot("Year", "Difference From Normal", by="location",
                    ylabel="Difference From Normal (C)",
                    fontsize=Plots.font_size,
                    title="Tropics and Extratropics", 
                    width=Plots.width, height=Plots.height)
line = holoviews.HLine(0)
layout = plot * line
Embed(layout, "tropics")()

The Zones

I did a previous plot "by zone" but in that case I was including all the zones given in the data. In this case we're only looking at the non-overlapping zones - so this chops up the planet by the smallest latitudinal zones that don't overlapy.

zones = ["North Frigid", "North Temperate", "North Sub-Tropic", 
         "Tropic of Cancer", "Tropic of Capricorn",
         "South Frigid", "South Temperate", "South Sub-Tropic"]
zones =  giss_tidy[giss_tidy.location.isin(zones)]
plot = zones.hvplot("Year", "Difference From Normal", by="location",
                    ylabel="Difference From Normal (C)",
                    title="Difference By (Non-Overlapping) Zones", 
                    fontsize=Plots.font_size,
                    width=Plots.width, height=Plots.height)
line = holoviews.HLine(0)
layout = plot * line
Embed(layout, "zones")()

Tabbed Plots

giss_tidy = giss_tidy.rename(columns={"location": "Location"})
class PlotGroups:
    hemisphere = "Hemishpere"
    tropics = "Tropics"
    climate_zones = "Zones"

group_map = {
    "Global": PlotGroups.hemisphere,
    "Northern Hemisphere": PlotGroups.hemisphere,
    "Southern Hemisphere": PlotGroups.hemisphere,
    "Northern Extratropics": PlotGroups.tropics,
    "Southern Extratropics": PlotGroups.tropics,
    "Tropics": PlotGroups.tropics,
    "North Frigid": PlotGroups.climate_zones,
    "North Sub-Tropic": PlotGroups.climate_zones,
    "North Temperate": PlotGroups.climate_zones,
    "Tropic of Cancer": PlotGroups.climate_zones,
    "Tropic of Capricorn": PlotGroups.climate_zones,
    "South Temperate": PlotGroups.climate_zones,
    "South Sub-Tropic": PlotGroups.climate_zones,
    "South Frigid": PlotGroups.climate_zones,
}
giss_tidy["plot_group"] = giss_tidy.Location.map(group_map)
hemispheric = giss_tidy[giss_tidy.plot_group==PlotGroups.hemisphere]
tropical = giss_tidy[giss_tidy.plot_group==PlotGroups.tropics]
zones = giss_tidy[giss_tidy.plot_group==PlotGroups.climate_zones]

assert (len(hemispheric.Location.unique()) 
        + len(tropical.Location.unique()) 
        + len(zones.Location.unique())) == len(giss_tidy.Location.unique())
assert len(hemispheric.Location.unique()) == 3
assert len(tropical.Location.unique()) == 3
assert len(zones.Location.unique()) == 8, zones.Location.unique()

cross_hair = CrosshairTool()
plot = (hemispheric.hvplot(x="Year", y="Difference From Normal", 
                           ylabel="Difference From Historical Mean",
                           by="Location",
                           width=Plots.width, 
                           height=Plots.height,
                           tools=[cross_hair]
).relabel("Hemispheres")
        * tropical.hvplot(x="Year", y="Difference From Normal",
                          ylabel="Difference From Historical Mean",                          
                          by="Location",
                          width=Plots.width, 
                          height=Plots.height,
                          tools=[cross_hair]).relabel(
                              "Tropics vs Extra-Tropics")
        * zones.hvplot(x="Year", y="Difference From Normal",
                       ylabel="Difference From Historical Mean",                       
                       by="Location",
                       width=Plots.width, 
                       height=Plots.height,
                       tools=[cross_hair],
                       ).relabel("Climate Zones")
).opts(tabs=True, fontsize=Plots.font_size)

Embed(plot, "tabbed_plots")()

HoloViews Gridded Data

The Beginning

This is a reproduction of the HoloViews Gridded Datasets example in the Getting Started section of their documentation to make sure that I can do it myself. The Gridded Data that they're referring to is data with more than two dimensions (a table has two dimensions - columns and rows). One way to display this type of data might be to use three dimensions (assuming there's only three), but the suggestion here is to use a slider to step through the data instead (so the third dimension is time or something similar).

The Middle

Set Up

Imports

  • Python
    from functools import partial
    from pathlib import Path
    import os
    
  • PyPi
    from dotenv import load_dotenv
    import holoviews
    from holoviews import opts
    import numpy
    
  • This Project
    from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh
    

The Dotenv

load_dotenv(".env")

Bokeh

#+begin_src python :session holoviews :results none
holoviews.extension("bokeh")

The Embedder

files_path = Path("../../files/posts/libraries/holoviews-gridded-data/")
Embed = partial(
    EmbedBokeh,
    folder_path=files_path)

The Output Path

This is where to store files and images that get created and need to get grabbed by nikola.

OUTPUT_PATH = Path("../../files/posts/libraries/holoviews-gridded-data")
assert OUTPUT_PATH.is_dir()

The Data

data_path = Path(os.environ.get("PHOTONS")).expanduser()
assert data_path.is_file()
data = numpy.load(str(data_path))
calcium = data["Calcium"]
print(calcium.shape)
(62, 111, 50)

The numpy.load function loads a pickled dict-like object that doesn't unpickle the data until you retrieve it (e.g. when I created the calcium variable). The data is a 3D array of images with the third dimension representing time (it's a set of images that change over time) so there are 50 time-steps (I don't know what the imagaes actually are or what the time-intervals are).

I'm going to load the data into a HoloViews Dataset, but is going to need time, x, and y coordinates to render the images. To make it simpler I'll just pass in integer arrays with zero-indexed indices for the images. Since this isn't really a dataset that I care about I don't know what the orientation of the image is, but the example set it up as a wide image.

data_set = holoviews.Dataset((numpy.arange(50), numpy.arange(111), numpy.arange(62), calcium),
                             ["Time", "x", "y"], "Fluoresence")
print(data_set)
:Dataset   [Time,x,y]   (Fluoresence)

Note that holoviews treats labels in lists differently than tuples - if it's a tuple then it thinks it represents a (<variable>, <label>) pair and a list is treated as just labels, which is what we want here.

The example says that really we should have used xarray to load the data, so this is how to convert it to and xarray.

print(data_set.clone(datatype=["xarray"]).data)
<xarray.Dataset>
Dimensions:      (Time: 50, x: 111, y: 62)
Coordinates:
  * Time         (Time) int64 0 1 2 3 4 5 6 7 8 9 ... 41 42 43 44 45 46 47 48 49
  * x            (x) int64 0 1 2 3 4 5 6 7 8 ... 103 104 105 106 107 108 109 110
  * y            (y) int64 0 1 2 3 4 5 6 7 8 9 ... 52 53 54 55 56 57 58 59 60 61
Data variables:
    Fluoresence  (y, x, Time) uint16 386 441 196 318 525 ... 806 801 899 583 774

But we want to use a Dataset, not an x-array so this was just for illustration…

Build It

Set Up the Plot

Before doing the plot I'll set the defaults for it.

opts.defaults(
    opts.GridSpace(shared_xaxis=True, shared_yaxis=True),
    opts.Image(cmap="viridis", width=400, height=400),
    opts.Labels(text_color="white", text_font_size="8pt", text_align="left", text_baseline="bottom"),
    opts.Path(color="white"),
    opts.Spread(width=600, tools=["hover"]),
    opts.Overlay(show_legend=False)
)

Note that if you don't setup the backend with holoviews.extension then the opts won't have any of the attributes like GridSpace, Image, etc.

plot = data_set.to(holoviews.Image, ["x", "y"]).hist()
file_name = "grid_image.html"
output = OUTPUT_PATH.joinpath(file_name)
holoviews.save(plot, output)
print("[[file:{}][Link to the plot.]]".format(file_name))

Link to the plot.

Zoom In

HoloViews provides a way to select out Regions of Interest (ROI). The pickle we loaded earlier has coordinates for rectangular bounding boxes in it (under the ROIs key).

regions_of_interest = data["ROIs"]
bounds = holoviews.Path([holoviews.Bounds(tuple(region)) for region in regions_of_interest])
print(regions_of_interest.shape)
(147, 4)
labels = holoviews.Labels([(roi[0], roi[1], i) for i, roi in enumerate(regions_of_interest)])
plot = (data_set[21].to(holoviews.Image, ['x', 'y']) * bounds * labels).relabel('Time: 21')
file_name = "bounds.html"
output = OUTPUT_PATH.joinpath(file_name)
Embed(plot, file_name)()

Select the Facet

x0, y0, x1, y1 = regions_of_interest[60]
roi = data_set.select(x=(x0, x1), y=(y0, y1), time=(250, 280)).relabel('ROI #60')
plot = roi.to(holoviews.Image, ['x', 'y'])
file_name = "selection.html"
output = OUTPUT_PATH.joinpath(file_name)
holoviews.save(plot, output)
print("[[file:{}][link]]".format(file_name))

link

Mean and Standard Deviation

agg = roi.aggregate('Time', numpy.mean, spreadfn=numpy.std)
plot = holoviews.Spread(agg) * holoviews.Curve(agg)
file_name = "spread.html"
output = OUTPUT_PATH.joinpath(file_name)
holoviews.save(plot, output)
print("[[file:{}][Link]]".format(file_name))

Link

Save It

The End

Aggregating Tabular Data

Set Up

Imports

Python

from functools import partial
from pathlib import Path
import os

PyPi

from dotenv import load_dotenv
from holoviews import opts
from tabulate import tabulate
import holoviews
import numpy
import pandas

My Projects

from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh

Holoviews Bokeh

I don't know why but you have to specify that you're using bokeh, even though it looks like it's working when you don't.

holoviews.extension("bokeh")

The Embedder

files_path = Path("../../files/posts/libraries/aggregating-tabular-data/")
Embed = partial(
    EmbedBokeh,
    folder_path=files_path)

Dotenv

I have the path to the data-set in a .env file so I'll load it into the environment dictionary.

load_dotenv(".env")

Load the Data

This is the same measles/pertusis data that I used before.

path = Path(os.environ.get("DISEASES")).expanduser()
assert path.is_file()
with path.open() as reader:
    diseases = pandas.read_csv(path)

Convert the DataFrame to a Dataset

key_dimensions = "Year State".split()
value_dimensions = [("measles", "Measles Incidence"), ("pertussis", "Pertusis Incidence")]
dataset = holoviews.Dataset(diseases, key_dimensions, value_dimensions)

Aggregate the Data

While I had aggregated the data before, this time I'm going to pass in the "Year" column as an argument so it won't keep the states separate.

aggregator = dataset.aggregate("Year", function=numpy.mean, spreadfn=numpy.std)
error_bars = holoviews.ErrorBars(aggregator, vdims=["measles", "measles_std"]).iloc[::2]
overlay = (holoviews.Curve(aggregator) * error_bars).redim.range(measles=(0, None))
plot = overlay.opts(height=500, width=1000, tools=["hover"])
Embed(plot, "northwest_measles_aggregated")()

Selecting Tabular Data

Set Up

Imports

Python

from functools import partial
from pathlib import Path
import os

PyPi

from dotenv import load_dotenv
from holoviews import opts
from tabulate import tabulate
import holoviews
import numpy
import pandas

My Projects

from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh

Holoviews Bokeh

I don't know why but you have to specify that you're using bokeh, even though it looks like it's working when you don't.

holoviews.extension("bokeh")

The Embedder

files_path = Path("../../files/posts/libraries/selecting-tabular-data/")
Embed = partial(
    EmbedBokeh,
    folder_path=files_path)

Dotenv

I have the path to the data-set in a .env file so I'll load it into the environment dictionary.

load_dotenv(".env")

Load the Data

This is the same measles/pertusis data that I used before.

path = Path(os.environ.get("DISEASES")).expanduser()
assert path.is_file()
with path.open() as reader:
    diseases = pandas.read_csv(path)
print(orgtable(diseases.head()))
Year Week State measles pertussis
1928 1 Alabama 3.67 nan
1928 2 Alabama 6.25 nan
1928 3 Alabama 7.95 nan
1928 4 Alabama 12.58 nan
1928 5 Alabama 8.03 nan

Convert the DataFrame to a Dataset

key_dimensions = "Year State".split()
value_dimensions = [("measles", "Measles Incidence"), ("pertussis", "Pertusis Incidence")]
dataset = holoviews.Dataset(diseases, key_dimensions, value_dimensions)

Aggregate The Data

dataset = dataset.aggregate(function=numpy.mean)
print(dataset)
print(dataset.shape)
:Dataset   [Year,State]   (measles,pertussis)
(4284, 4)

Plot a Subset

northwest = ["Oregon", "Washington", "Idaho"]
bars = dataset.select(State=northwest, Year=(1928, 1938)).to(
    holoviews.Bars, ["Year", "State"], "measles").sort()
plot = bars.opts(
    opts.Bars(width=800, height=400, tools=["hover"], xrotation=90, show_legend=False)
)
Embed(plot, "northwest_measles")()

As with all things HoloViews, there are many things that are unclear here - but the one that really tripped me up was the selection of years. Although I passed in a tuple it used it as a range (start, stop) so my original plot had a century instead of two years. Oh, well.

Interesting how Oregon spiked up in 1935 and 1936 then dropped down in 1937. According to the CDC, the measles vaccine didn't come out until 1963, so I guess those ebb-and-flows are just the normal way diseases cycle through populations. Oregon and Washington probably had more immigrants than Idaho, as well as higher population densities in their main cities, which might account for their higher rates. Hard to say without corellating data.

Facets

The bar-plot is okay for comparing the cities within any given year but they are hard to get trends from. Here's how to use selecting to plot lines for each city.

grouped = dataset.select(State=northwest, Year=(1928, 2011)).to(holoviews.Curve, "Year", "measles")
gridspace = grouped.grid("State")
plot = gridspace.opts(
    opts.Curve(width=200, color="crimson", tools=["hover"])
)
Embed(plot, "northwest_measles_grid")()

Overlays

While the side-by-side plots are clearer than the bar-plots, it's harder to compare the cities year-by-year, so it might be better to plot them over each other.

overlay = grouped.overlay("State")
plot = overlay.opts(
    opts.Curve(height=500, width=1000, color=holoviews.Cycle(values=["crimson", "slateblue", "cadetblue"]))
)
Embed(plot, "northwest_measles_overlay")()