Looking at random graphs

Imports

# python standard library
import os
import pickle

# from pypi
import networkx
import numpy
import pandas

from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import (
    ExtraTreesClassifier,
    RandomForestClassifier,
    )
from sklearn.feature_selection import (
    RFECV,
    SelectFromModel,
)
from sklearn.model_selection import (
    GridSearchCV,
    StratifiedKFold,
    train_test_split,
    )
% matplotlib inline

Part 1 - Random Graph Identification

For the first part of this assignment you will analyze randomly generated graphs and determine which algorithm created them.

Load the data

part_one_graphs = pickle.load(open('A4_graphs','rb'))
print(len(part_one_graphs))
print(type(part_one_graphs[0]))

part_one_graphs is a list containing 5 networkx graphs. Each of these graphs were generated by one of three possible algorithms:

  • Preferential Attachment (`'PA'`)
  • Small World with low probability of rewiring (`'SWL'`)
  • Small World with high probability of rewiring (`'SWH'`)

Analyze each of the 5 graphs and determine which of the three algorithms generated the graph.

The `graphidentification` function should return a list of length 5 where each element in the list is either `'PA'`, `'SWL'`, or `'SWH'`.

Graph Identification

def graph_identification():
    """Identifies the type of graph each of the graphs is

    Returns:
     list: string identifiers for the type of graph
    """
    graph_types = []
    for graph in part_one_graphs:
        path = networkx.average_shortest_path_length(graph)
        coefficient = networkx.average_clustering(graph)
        if path > 6:
            if coefficient < 0.5:
                graph_types.append("SW_L")
            else:
                raise Exception("unexpected type")
        else:
            if coefficient < 0.5:
                graph_types.append("PA")
            else:
                graph_types.append("SW_H")
    return graph_types

This was marked wrong by the grader.

Part 2 - Company Emails

For the second part of this assignment you will be working with a company's email network where each node corresponds to a person at the company, and each edge indicates that at least one email has been sent between two people.

The network also contains the node attributes `Department` and `ManagementSalary`.

`Department` indicates the department in the company which the person belongs to, and `ManagementSalary` indicates whether that person is receiving a managment position salary.

email = networkx.read_gpickle('email_prediction.txt')
print(networkx.info(email))

Part 2A - Salary Prediction

Using network `email`, identify the people in the network with missing values for the node attribute `ManagementSalary` and predict whether or not these individuals are receiving a managment position salary.

To accomplish this, you will need to create a matrix of node features using networkx, train a sklearn classifier on nodes that have `ManagementSalary` data, and predict a probability of the node receiving a managment salary for nodes where `ManagementSalary` is missing.

Your predictions will need to be given as the probability that the corresponding employee is receiving a managment position salary.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of 0.75 or higher will receive full points.

Using your trained classifier, return a series of length 252 with the data being the probability of receiving managment salary, and the index being the node id.

1       1.0
2       0.0
5       0.8
8       1.0
    ...
996     0.7
1000    0.5
1001    0.0
Length: 252, dtype: float64

The Data Frame

if not os.path.isfile("email_data.h5"):
    data = pandas.DataFrame(index=email.nodes())
    data["department"] = pandas.Series(networkx.get_node_attributes(email, "Department"))
    data["management"] = pandas.Series(networkx.get_node_attributes(email, "ManagementSalary"))
    data["clustering"] = pandas.Series(networkx.clustering(email))
    data["degree"] = pandas.Series(email.degree())
    data["degree_centrality"] = pandas.Series(networkx.degree_centrality(email))
    data["closeness_centrality"] = pandas.Series(networkx.closeness_centrality(email))
    data["betweenness_centrality"] = pandas.Series(networkx.betweenness_centrality(email))
    data["pagerank"] = pandas.Series(networkx.pagerank(email))
    _, authority = networkx.hits(email)
    data["authority"] = pandas.Series(authority)
    data.to_hdf("email_data.h5","df" )
else:
    data = pandas.read_hdf('email_data.h5', "df")
print(data.head())    
print(data.management.unique())
print(data.department.unique())

Department Dummy Variables

Even though I don't think it's going to prove useful, the department feature is actually categorical, despite the use of integers so we'll have to use One-Hot-Encoding to add dummy variables for it.

dummies_data = pandas.get_dummies(data, columns=["department"])
print(dummies_data.head(1))

Separating the Training and Prediction Sets

We're going to use the model to predict what the missing management values are so I'm going to separate the missing and non-missing sets.

training_data = dummies_data[pandas.notnull(dummies_data.management)]
prediction_data = dummies_data[pandas.isnull(dummies_data.management)]
print(training_data.shape)
print(prediction_data.shape)

The problem description tells us that the answer should have 252 entries so this is a safe assertion.

assert len(prediction_data) == 252

Training and Target Data

To train the model we'll need to separate out the management column (and remove it entirely from the prediction set).

non_management = [column for column in training_data.columns if column != "management"]
y_train = training_data.management
x_train = training_data[non_management]
x_predict = prediction_data[non_management]

Scaling

I don't think the Random Forest model that I'm going to use needs it, but I'm going to standardize the data.

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_predict = pandas.DataFrame(scaler.transform(x_predict), index=x_predict.index)

Feature Selection

Since we now have so many features, I'm going to do some feature selection.

print(x_train.shape)
print(x_predict.shape)
trees = ExtraTreesClassifier(n_estimators=10)
eliminator = RFECV(estimator=trees, cv=StratifiedKFold(10), scoring="roc_auc")
eliminator.fit(x_train, y_train)
x_train_reduced = eliminator.transform(x_train)
x_predict_reduced = pandas.DataFrame(eliminator.transform(x_predict), index=x_predict.index)
print(x_train_reduced.shape)
print(x_predict_reduced.shape)

When I used the train-test-split training model it left 17 columns. I wonder if using the whole training set messes it up.

Logistic Regression

model = LogisticRegressionCV(penalty="l1", scoring="roc_auc",
                             solver="liblinear", cv=StratifiedKFold(10))
model.fit(x_train_reduced, y_train)
print(model.scores_[1.0].mean())
print(model.scores_[1.0].std())

It seems to be doing much worse than when I used the train-test split.

Random Forests

parameter_grid = dict(n_estimators=range(10, 100, 10))
search = GridSearchCV(RandomForestClassifier(), parameter_grid,
                      cv=StratifiedKFold(10), scoring="roc_auc")
search.fit(x_train_reduced, y_train)
print(search.best_score_)
class RandomForest(object):
    """builds the random forest

    Args:
     x_train(array): data to train on
     y_train(array): targets for training
     start (int): start value for number of estimators
     stop (int): upper value for range of estimators
     step (int): increment for range of estimators
     folds (int): K-folds for cross-validation    
    """
    def __init__(self, x_train, y_train,
                 start=10, stop=100, step=10, folds=10):
        self.x_train = x_train
        self.y_train = y_train
        self.start = start
        self.stop = stop
        self.step = step
        self.folds = folds
        self._parameters = None
        self._search = None
        self._model = None
        return

    @property
    def parameters(self):
        """parameters for the grid-search"""
        if self._parameters is None:
            self._parameters = dict(n_estimators=range(self.start,
                                                       self.stop,
                                                       self.step))
        return self._parameters

    @property
    def search(self):
        """fitted grid search to find hyper-parameters"""
        if self._search is None:
            self._search = GridSearchCV(RandomForestClassifier(),
                                        self.parameters,
                                        cv=StratifiedKFold(self.folds),
                                        scoring="roc_auc")
            self._search.fit(self.x_train, self.y_train)
        return self._search

    @property
    def model(self):
        """best model found by the grid search"""
        if self._model is None:
            self._model = self.search.best_estimator_
        return self._model

Data Loader

Since having all these org-babel things around makes things kind of hard I'm going to make a class to bundle everything together.

class DataLoader(object):
    """loads and transforms the data
    Args:
     estimators (int): number of trees to use for feature elimination
    """
    def __init__(self, estimators=10):
        self.estimators = estimators
        self._data = None
        self._dummies_data = None
        self._training_data = None
        self._prediction_data = None
        self._non_management = None
        self._y_train = None
        self._x_train = None
        self._x_predict = None
        self._scaler = None
        self._x_train_scaled = None
        self._x_predict_scaled = None
        self._eliminator = None
        self._x_train_reduced = None
        self._x_predict_reduced = None
        return

    @property
    def data(self):
        """The initial data"""
        if self._data is None:
            if not os.path.isfile("email_data.h5"):
                data = pandas.DataFrame(index=email.nodes())
                data["department"] = pandas.Series(networkx.get_node_attributes(email, "Department"))
                data["management"] = pandas.Series(networkx.get_node_attributes(email, "ManagementSalary"))
                data["clustering"] = pandas.Series(networkx.clustering(email))
                data["degree"] = pandas.Series(email.degree())
                data["degree_centrality"] = pandas.Series(networkx.degree_centrality(email))
                data["closeness_centrality"] = pandas.Series(networkx.closeness_centrality(email))
                data["betweenness_centrality"] = pandas.Series(networkx.betweenness_centrality(email))
                data["pagerank"] = pandas.Series(networkx.pagerank(email))
                _, authority = networkx.hits(email)
                data["authority"] = pandas.Series(authority)
                data.to_hdf("email_data.h5","df" )
                self._data = data
            else:
                self._data = pandas.read_hdf('email_data.h5', "df")
        return self._data

    @property
    def dummies_data(self):
        """one-hot-encoded data"""
        if self._dummies_data is None:
            self._dummies_data = pandas.get_dummies(self.data, columns=["department"])
        return self._dummies_data

    @property
    def training_data(self):
        """data with management information"""
        if self._training_data is None:
            self._training_data = self.dummies_data[pandas.notnull(
                self.dummies_data.management)]
        return self._training_data

    @property
    def prediction_data(self):
        """data missing management information"""
        if self._prediction_data is None:
            self._prediction_data = self.dummies_data[pandas.isnull(
                self.dummies_data.management)]
            assert len(self._prediction_data) == 252
        return self._prediction_data

    @property
    def non_management(self):
        """list of columns minus management"""
        if self._non_management is None:
            self._non_management = [
                column for column in self.training_data.columns
                if column != "management"]
        return self._non_management

    @property
    def y_train(self):
        """target-data for training"""
        if self._y_train is None:
            self._y_train = self.training_data.management
        return self._y_train

    @property
    def x_train(self):
        """data for training"""
        if self._x_train is None:
            self._x_train = self.training_data[self.non_management]
        return self._x_train

    @property
    def x_predict(self):
        """set to make predictions"""
        if self._x_predict is None:
            self._x_predict = self.prediction_data[self.non_management]
        return self._x_predict

    @property
    def scaler(self):
        """standard scaler"""
        if self._scaler is None:
            self._scaler = StandardScaler()
        return self._scaler

    @property
    def x_train_scaled(self):
        """training data scaled to 1 std, 0 mean"""
        if self._x_train_scaled is None:
            self._x_train_scaled = self.scaler.fit_transform(self.x_train)
        return self._x_train_scaled

    @property
    def x_predict_scaled(self):
        """prediction data with mean 0, std 1

        The answer requires the index so this is a dataframe
        instead of an array

        Returns:
         pandas.DataFrame: scaled data with index preserved
        """
        if self._x_predict_scaled is None:
            self._x_predict_scaled = pandas.DataFrame(
                self.scaler.transform(self.x_predict),
                index=self.x_predict.index)
        return self._x_predict_scaled

    @property
    def eliminator(self):
        """recursive feature eliminator"""
        if self._eliminator is None:
            trees = ExtraTreesClassifier(n_estimators=10)
            self._eliminator = RFECV(estimator=trees, cv=StratifiedKFold(10), 
                                     scoring="roc_auc")
            self._eliminator.fit(self.x_train_scaled, self.y_train)
        return self._eliminator

    @property
    def x_train_reduced(self):
        """training data with features eliminated"""
        if self._x_train_reduced is None:
            self._x_train_reduced = self.eliminator.transform(
                self.x_train_scaled)
        return self._x_train_reduced

    @property
    def x_predict_reduced(self):
        """prediction data with features eliminated"""
        if self._x_predict_reduced is None:
            self._x_predict_reduced = pandas.DataFrame(
                self.eliminator.transform(self.x_predict_scaled),
                index=self.x_predict_scaled.index)
        return self._x_predict_reduced

Submission

def salary_predictions():
    """Prediction that employee is management

    Calculates the probability that an employee is management

    Returns:
     pandas.Series: Node ID, probability of node
    """
    data = DataLoader()
    forest = RandomForest(data.x_train_reduced, data.y_train)
    # probabilites is an array with rows of 
    # [<probability not management>, <probability management>]
    # see forest.model.classes_ to see what each entry represents
    probabilities = forest.model.predict_proba(data.x_predict_reduced)
    return pandas.Series(probabilities[:, 1], index=data.x_predict_reduced.index)
output = salary_predictions()
print(output.head())
assert all(output.index == DataLoader().prediction_data.index)
assert len(output) == 252

Part 2B - New Connections Prediction

For the last part of this assignment, you will predict future connections between employees of the network. The future connections information has been loaded into the variable `futureconnections`. The index is a tuple indicating a pair of nodes that currently do not have a connection, and the `Future Connection` column indicates if an edge between those two nodes will exist in the future, where a value of 1.0 indicates a future connection.

future_connections = pandas.read_csv('Future_Connections.csv', index_col=0, converters={0: eval})
print(future_connections.head(10))
print(future_connections['Future Connection'].value_counts())

Using network `G` and `futureconnections`, identify the edges in `futureconnections` with missing values and predict whether or not these edges will have a future connection.

To accomplish this, you will need to create a matrix of features for the edges found in `futureconnections` using networkx, train a sklearn classifier on those edges in `futureconnections` that have `Future Connection` data, and predict a probability of the edge being a future connection for those edges in `futureconnections` where `Future Connection` is missing.

Your predictions will need to be given as the probability of the corresponding edge being a future connection.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of 0.75 or higher will receive full points.

Using your trained classifier, return a series of length 122112 with the data being the probability of the edge being a future connection, and the index being the edge as represented by a tuple of nodes.

(107, 348)    0.35
(542, 751)    0.40
(20, 426)     0.55
(50, 989)     0.35
          ...
(939, 940)    0.15
(555, 905)    0.35
(75, 101)     0.65
Length: 122112, dtype: float64

Add Network Features

class Futures(object):
    target = "Future Connection"
    data_file = "Future_Connections.csv"
    graph_file = "email_prediction.txt"
    networkx_data_index = 2
    folds = 10
class DataNames(object):
    resource_allocation = 'resource_allocation'
    jaccard = 'jaccard_coefficient'
    adamic = "adamic_adar"
    preferential = "preferential_attachment"
def add_networkx_data(adder, name, graph=email, frame=future_connections):
    """Adds networkx data to the frame

    The networkx link-prediction functions return generators of triples:
     (first-node, second-node, value)

    This will use the index of the frame that's passed in as the source of 
    node-pairs for the networkx function (called `ebunch` in the networkx
    documentation) and the add only the value we want back to the frame

    Args:
     adder: networkx function to call to get the new data
     name: column-name to add to the frame
     graph: networkx graph to pass to the function
     frame (pandas.DataFrame): frame with node-pairs as index to add data to
    """
    frame[name] = [output[Futures.networkx_data_index]
                   for output in adder(graph, frame.index)]
    return frame
Adding A Resource Allocation Index
add_networkx_data(networkx.resource_allocation_index,
                  DataNames.resource_allocation)
print(future_connections.head(1))
Adding the Jaccard Coefficient
add_networkx_data(networkx.jaccard_coefficient, DataNames.jaccard)
print(future_connections.head(1))
Adamic Adar
add_networkx_data(networkx.adamic_adar_index, DataNames.adamic)
print(future_connections.head(1))
Preferential Attachment
add_networkx_data(networkx.preferential_attachment, DataNames.preferential)
print(future_connections.head(1))

Setup the Training and Testing Data

Separating the Edges Without 'Future Connection' Values

We are going to train on the values in the data with predictions and then make predictions for those that don't.

prediction_set = future_connections[future_connections[Futures.target].isnull()]
training_set = future_connections[future_connections[Futures.target].notnull()]
print(prediction_set.shape)
print(training_set.shape)
assert len(prediction_set) + len(training_set) == len(future_connections)

Separate the Target and Training Sets

non_target = [column for column in future_connections.columns
              if column != Futures.target]
x_train = training_set[non_target]
y_train = training_set[Futures.target]
x_predict = prediction_set[non_target]
assert all(x_train.columns == x_predict.columns)
assert len(x_train) == len(x_test)

Scaling the Data

To enable the use of linear models I'm going to scale the data so the mean is 0 and the variance is 1.

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_predict_scaled = scaler.transform(x_predict)

x_train_frame = pandas.DataFrame(x_train_scaled, columns=x_train.columns)
x_predict_frame = pandas.DataFrame(x_predict_scaled, columns=x_predict.columns)
print(training.describe())
print(predictions.describe())

Feature Selection

To reduce the dimensionality I'm going to use model-based selection with Extra Trees.

estimator = ExtraTreesClassifier()
estimator.fit(x_train_scaled, y_train)
selector = SelectFromModel(estimator, prefit=True)
x_train_trees_sfm = selector.transform(x_train_scaled)
x_predict_sfm = selector.transform(x_predict_scaled)
print(estimator.feature_importances_)
print(x_train_trees_sfm.shape)

Missing Future Connections

model = LogisticRegressionCV(n_jobs=-1, scoring='roc_auc', solver='liblinear',
                             cv=StratifiedKFold())
model.fit(x_train_trees_sfm, y_train)
for scores in model.scores_[1.0]:
    print(max(scores))
print(model.classes_)
def new_connections_predictions():    
    probabilities = model.predict_proba(x_predict_sfm)
    return pandas.Series(probabilities[:, 1], index=prediction_set.index)
outcome = new_connections_predictions()
assert len(outcome) == 122112, len(outcome)
print(outcome.head())

Selecting the E-Mail Model

Imports

# pypi
from sklearn.ensemble import (
    ExtraTreesClassifier,
    RandomForestClassifier,
    )
from sklearn.feature_selection import (
    RFECV,
    SelectFromModel,
)
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import (
    GridSearchCV,
    StratifiedKFold,
    train_test_split,
    )

from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
import matplotlib.pyplot as pyplot
import mglearn
import numpy
import pandas
import seaborn
% matplotlib inline
seaborn.set_style("whitegrid")

The Data

data = pandas.read_hdf("email_data.h5", "df")
cleaned_data = data[pandas.notnull(data.management)]
print(cleaned_data.head())

Department

Even though I don't think it's going to prove useful, the department feature is actually categorical, despite the use of integers so we'll have to add dummy variables for it.

cleaned_data = pandas.get_dummies(cleaned_data, columns=["department"])
print(cleaned_data.head(1))

Splitting the Data

For evaluation purposes I'll use the traditional train-test split.

x_data = cleaned_data.loc[:, cleaned_data.columns != "management"]

y_data = cleaned_data.management

print(x_data.head())
print(y_data.head())
print(y_data.value_counts())
seaborn.countplot(x='management', data=cleaned_data)

It looks like the management data is unbalanced, so I'll do a stratified split.

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, stratify=y_data)
print(x_train.shape)
print(y_test.shape)
seaborn.countplot(y_train)

Looks close enough for government work.

Standardizing the Data

The linear models expect the data to be standardized, so to make the comparisons fair I'll standardize the data first. First, a look at the data before scaling.

print(x_train.describe())

Now I'll scale it.

scaler = StandardScaler()
scaler.fit(x_train)
x_train = pandas.DataFrame(scaler.transform(x_train), columns=x_train.columns)
x_test = scaler.transform(x_test)

Now the means should be near 0 (very small) and the standard deviations should be around 1.

print(x_train.describe())

Dummy Classifier

As a baseline I'll use a Dummy Classifier which uses a simple rule rather than the input data to make predictions.

parameter_grid = dict(strategy=["stratified", 'most_frequent', 'prior', 'uniform'])

Now we'll do a grid search.

grid_search = GridSearchCV(DummyClassifier(), parameter_grid,
                           cv=StratifiedKFold(10), scoring="roc_auc")
grid_search.fit(x_train, y_train)
BASELINE = grid_search.score(x_test, y_test)
print(grid_search.best_params_)
print(BASELINE)

It looks like it chose the stratified strategy, which should predict that the instances are all non-managers. Our baseline AUC score is 0.5 (0.47 now?).

results = pandas.DataFrame(grid_search.cv_results_)
print(results.head(1))
figure = pyplot.figure()
axe = figure.gca()
strategies = parameter_grid["strategy"]
x = pyplot.xticks(list(range(len(strategies))), strategies)
axe.plot(range(len(strategies)), results.mean_test_score)
axe.set_title("Dummy Classifier Strategy Vs AUC")
axe.set_xlabel("strategy")
axe.set_ylabel("AUC Score")

So it looks like all the strategies except stratified did the same - and even the stratified did basically the same if you round it off.

Feature Selection

I'm going to need to do some feature reduction, but figuring out what is important and what isn't is something I'm going to have to leave to the machine. I'm going to assume that the features thrown out by logistic regression with l1 penalization are unimportant.

logistic_model = LogisticRegressionCV(penalty='l1',
                                      solver='liblinear', scoring="roc_auc")
logistic_model.fit(x_train, y_train)
model = SelectFromModel(logistic_model, prefit=True)

x_train_positive = model.transform(x_train)
x_test_positive = model.transform(x_test)
print(logistic_model.score(x_test, y_test))

Logistic Regression with L1 penalty seems to do reasonably well even without feature selection.

logistic_model.fit(x_train_positive, y_train)
print(logistic_model.score(x_test_positive, y_test))

It looks like feature selection didn't really help here.

print(x_train.shape)
print(x_train_positive.shape)
print(model.ranking_)

As a double-check I'll use a tree-based, recursive feature-elimination version.

trees = ExtraTreesClassifier(n_estimators=10)
eliminator = RFECV(estimator=trees, cv=StratifiedKFold(10), scoring="roc_auc")
eliminator.fit(x_train, y_train)
x_train_trees = eliminator.transform(x_train)
x_test_trees = eliminator.transform(x_test)
print(x_train_trees.shape)
print(eliminator.ranking_)

This eliminated many more columns than the Logistic Regression version did.

warning this seem to change every time you run it - the randomness changes it. Only the elimination of the first column seems to do as well as not running it at all.

Fit and Display

This is a convenience function so I can fit and display the scores for the models.

def fit_and_display(model, identifier):
    """Fit and display the scores

    Args:
     model: The instantiated model to fit
     identifier (str): something to output at the beginning
    """
    print(identifier)
    print("=" * len(identifier))
    model.fit(x_train, y_train)
    print("\nX-train")
    print("Score: {:.2f}".format(model.score(x_test, y_test)))
    print("\nX-Train Positive")
    model.fit(x_train_positive, y_train)
    print("Score: {:.2f}".format(model.score(x_test_positive, y_test)))
    print("\nX-Train Trees")
    model.fit(x_train_trees, y_train)
    print("Score: {:.2f}".format(model.score(x_test_trees, y_test)))
    print("\nBest Training Score: {}".format(search.best_score_))
    return

Logistic Regression

L1 Penalty

model = LogisticRegressionCV(penalty="l1", scoring="roc_auc", solver="liblinear")
fit_and_display(model, "Logistic Regression L1")

I've already run the Logistic Regression using a 'l1' but I'll try it again with 'l2' to see if it improved.

model = LogisticRegressionCV(scoring="roc_auc", solver="liblinear")
fit_and_display(model, "LogisticRegression")

L1 seems to do better than L1 overall, although it doesn't do as well with the recursively data form some reason.

Random Forests

I'll try a Random Forest classifier next.

parameter_grid = dict(n_estimators=range(10, 100, 10))
search = GridSearchCV(RandomForestClassifier(), parameter_grid,
                      cv=StratifiedKFold(10), scoring="roc_auc")
fit_and_display(search, "Random Forest")

This seems to have done much better than the logistic regression did. My logistic-regression feature reduction doesn't seem to help.

class RandomForest(object):
    """trains a random forest on the x-test-trees set

    Args:
     start (int): first n-estimators value to use
     stop (int): last n-estimators value (minus step)
     step (int): amount to increment estimators
     folds (int): Cross-validation-folds to usen

    Returns:
     GridSearchCV: grid-search with the best estimator
    """

    def __init__(self, start, stop, step, folds=10):
        self.start = start
        self.stop = stop
        self.step = step
        self.folds = folds
        self._search = None
        self._parameter_grid = None
        return

    @property
    def parameter_grid(self):
        """dict of the number of estimators to use"""
        if self._parameter_grid is None:
            self._parameter_grid = dict(n_estimators=list(range(self.start,
                                                                self.stop,
                                                                self.step)))
        return self._parameter_grid

    @property
    def search(self):
        """grid-search cv object"""
        if self._search is None:
            self._search = GridSearchCV(RandomForestClassifier(),
                                        self.parameter_grid,
                                        cv=StratifiedKFold(self.folds),
                                        scoring="roc_auc")
        return self._search    

    def fit(self):
        """fits the model to the tree-based reduced-feature data"""
        self.search.fit(x_train_trees, y_train)
        print(self.search.score(x_test_trees, y_test))
        print(self.search.best_estimator_.feature_importances_)
        print(self.search.best_params_)
        return

    def plot(self):
        """Plots estimators vs AUC scores"""
        figure = pyplot.figure()
        axe = figure.gca()
        axe.plot(self.parameter_grid["n_estimators"],
                 self.search.cv_results_["mean_test_score"])
        axe.set_title("Estimator Count vs AUC")
        axe.set_xlabel("Number of estimators (trees)")
        axe.set_ylabel("Mean AUC Score")
        return
search = RandomForest(10, 100, 10)
search.fit()

Not a lot of variance in the importance of the features.

search.plot()

Would things get better with more trees?

search = RandomForest(150, 250, 10)
search.fit()
search.plot()

In this case the test-score was better, although the training scores don't look much better. I guess it's the randomness coming into play again. I'll try a long run instead.

search = RandomForest(10, 500, 10)
search.fit()
search.plot()

The test-score for the best estimator is actually a little worse than it was for the previous case, although it's qute a small difference.

K Nearest Neighbors

parameters = dict(n_neighbors=range(10, 20),
                  weights=["uniform", "distance"],
                  p=[1, 2],
                  leaf_size=range(10, 50, 10))

search = GridSearchCV(KNeighborsClassifier(), parameters, scoring="roc_auc")
search.fit(x_train_trees, y_train)
print(search.score(x_test_trees, y_test))
print(search.best_params_)

This doesn't seem to do so well, although I'm not as experienced at using it so I might be using bad parameters.

Support Vector Classifier (SVC)

parameters = dict(C=numpy.arange(.1, 1, 0.1), gamma=range(1, 10, 1),
                  kernel=["linear", 'rbf', 'sigmoid'])
search = GridSearchCV(SVC(class_weight='balanced'), parameters, scoring='roc_auc')
fit_and_display(search, "SVC")
print(search.score(x_test_trees, y_test))
print(search.best_params_)

Now that the data is scaled, the svc does much better, alhough still not as well as the random forest.

Future E-Mail

This will select a model to predict whether an edge in the email-network that currently doesn't have an edge will have one in the future.

Tangle

<<imports>>

<<futures>>

<<data-names>>

<<files>>

<<training>>

<<load-graph>>

<<load-future>>

<<add-networkx-data>>

<<future-connections>>

<<jaccard-coefficient>>

<<adamic-adar>>

<<preferential-attachment>>

<<save-future-connections>>

<<split-future-connections>>

<<train-test-predict>>

<<train-test-split>>

<<scaled-data>>

<<pickle-it>>

<<unpickle-it>>

<<lr-rfs>>

<<trees-rfs>>

<<lr-sfm>>

<<trees-fsm>>

<<scores-identifiers>>

<<fit-and-print>>

<<data-sets>>

<<key-by-value>>

<<fit-and-print-all>>

<<logistic-model>>

<<fit-grid-search>>

<<fit-grid-searches>>

<<random-forests>>

<<extra-trees>>

Imports

# python standard library
import os
import pickle

# pypi
import networkx
import pandas
import seaborn

from numba import jit

from sklearn.ensemble import (
    ExtraTreesClassifier,
    RandomForestClassifier,
    )
from sklearn.feature_selection import (
    RFECV,
    SelectFromModel,
    )
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import (
    train_test_split,
    GridSearchCV,
    StratifiedKFold,
    )
from sklearn.preprocessing import StandardScaler
% matplotlib inline
seaborn.set_style("whitegrid")

Constants

class Futures(object):
    target = "Future Connection"
    data_file = "Future_Connections.csv"
    graph_file = "email_prediction.txt"
    networkx_data_index = 2
    folds = 10
class DataNames(object):
    resource_allocation = 'resource_allocation'
    jaccard = 'jaccard_coefficient'
    adamic = "adamic_adar"
    preferential = "preferential_attachment"
class Files(object):
    """File-names for data persistence"""
    future_training_data = 'future_training_data.csv'
    future_selection_outcomes = 'future_selection_outcomes.pkl'
    future_model_selection = "future_model_cvs.pkl"
class Training(object):
    """data-pickles"""
    x_train_lr_rfs = "x_train_lr_rfs.pkl"
    x_test_lr_rfs = "x_test_lr_rfs.pkl"
    x_train_trees_rfs = "x_train_trees_rfs.pkl"
    x_test_trees_rfs = "x_test_trees_rfs.pkl"
    x_train_lr_sfm = "x_train_lr_sfm.pkl"
    x_test_lr_sfm = "x_test_lr_sfm.pkl"
    x_train_trees_sfm = "x_train_trees_sfm.pkl"
    x_test_trees_sfm = "x_test_trees_sfm.pkl"

The Email-Graph

To get the features for the models we'll need to use the email-graph.

email = networkx.read_gpickle(Futures.graph_file)

The Data

The Given Data

We're given a csv file with the training and prediction data in it ('FutureConnections.csv').

head Future_Connections.csv
echo

Org-mode converted it to a table, but it's actually a CSV. The first line of data looks like this.

"(6, 840)",0.0
future_connections_pre_loaded = os.path.isfile(Files.future_training_data)
if future_connections_pre_loaded:
    future_connections = pandas.read_csv(Files.future_training_data,
                                         index_col=0)
else:
    future_connections = pandas.read_csv(Futures.data_file,
                                         index_col=0,
                                         converters={0: eval})

So, we're loading the node-pairs (edges) as the index of the data-frame and explicitly telling pandas that the Future Connection values should be converted , which I don't think is necessary, but this came with the problem statement so I'll just leave it in in case there's some side-effect I'm not aware of.

print(future_connections[Futures.target].value_counts())

This is a fairly big (and lopsided) data-set.

seaborn.countplot(x=Futures.target, data=future_connections)

Adding networkx features

To create features to train the model and make predictions, I'm going to use the networkx link prediction algorithms.

Add Networkx Data

This is a function to get networkx data and add it to the data-frame. It won't work for the community-based algorithms.

def add_networkx_data(adder, name, graph=email, frame=future_connections):
    """Adds networkx data to the frame

    The networkx link-prediction functions return generators of triples:
     (first-node, second-node, value)

    This will use the index of the frame that's passed in as the source of 
    node-pairs for the networkx function (called `ebunch` in the networkx
    documentation) and the add only the value we want back to the frame

    Args:
     adder: networkx function to call to get the new data
     name: column-name to add to the frame
     graph: networkx graph to pass to the function
     frame (pandas.DataFrame): frame with node-pairs as index to add data to
    """
    frame[name] = [output[Futures.networkx_data_index]
                   for output in adder(graph, frame.index)]
    return frame

Adding A Resource Allocation Index

if not future_connections_pre_loaded:
    add_networkx_data(networkx.resource_allocation_index,
                      DataNames.resource_allocation)
print(future_connections.head(1))

Adding the Jaccard Coefficient

if not future_connections_pre_loaded:
    add_networkx_data(networkx.jaccard_coefficient, DataNames.jaccard)
print(future_connections.head(1))

Adamic Adar

if not future_connections_pre_loaded:
    add_networkx_data(networkx.adamic_adar_index, DataNames.adamic)
print(future_connections.head(1))

Preferential Attachment

if not future_connections_pre_loaded:
    add_networkx_data(networkx.preferential_attachment, DataNames.preferential)
print(future_connections.head(1))

Community-Based Link Prediction

This requires identifying 'communities' first, so I'll defer it for now.

#add_networkx_data(networkx.cn_soundarajan_hopcroft, DataNames.common_neighbors)

These three all require communities for them to work (so I'm skipping them):

  • cnsoundarajanhopcroft
  • raindexsoundarajanhopcroft
  • withinintercluster

Saving the Data

future_connections.to_csv(Files.future_training_data)

Setup the Training and Testing Data

Separating the Edges Without 'Future Connection' Values

We are going to train on the values in the data with predictions and then make predictions for those that don't. For model selection we don't need the set missing predictions, but I'll separate it out anyway to be complete.

prediction_set = future_connections[future_connections[Futures.target].isnull()]
training_set = future_connections[future_connections[Futures.target].notnull()]
print(prediction_set.shape)
print(training_set.shape)
assert len(prediction_set) + len(training_set) == len(future_connections)

Separate the Target and Training Sets

non_target = [column for column in future_connections.columns
              if column != Futures.target]
training = training_set[non_target]
testing = training_set[Futures.target]
predictions = prediction_set[non_target]
assert all(training.columns == predictions.columns)
assert len(training) == len(testing)

Setting Up the Testing and Training Sets

x_train, x_test, y_train, y_test = train_test_split(training, testing, stratify=testing)
print(x_train.shape)
print(x_test.shape)
seaborn.countplot(y_train)
seaborn.countplot(y_test)

Scaling the Data

To enable the use of linear models I'm going to scale the data so the mean is 0 and the variance is 1.

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

x_train = pandas.DataFrame(x_train, columns=training.columns)
x_test = pandas.DataFrame(x_test, columns=training.columns)
print(x_train.describe())
print(x_test.describe())

Feature Selection

To reduce the dimensionality I'm going to use recursive feature selection and model-based selection.

def pickle_it(thing, name):
    """saves the thing as a pickle"""
    with open(name, "wb") as writer:
        pickle.dump(thing, writer)
def unpickle_it(name):
    """loads the object from the file-name

    Args:
     name (str): name of binary pickle file

    Returns:
     obj: unpickled object
    """
    with open(name, 'rb') as reader:
        thing = pickle.load(reader)
    return thing

RFECV with Logistic Regression

if os.path.isfile(Training.x_train_lr_rfs):
    x_train_lr_rfs = unpickle_it(Training.x_train_lr_rfs)
    x_test_lr_rfs = unpickle_it(Training.x_test_lr_rfs)
else:
    estimator = LogisticRegressionCV(n_jobs=-1)
    selector = RFECV(estimator, scoring='roc_auc',
                     n_jobs=-1,
                     cv=StratifiedKFold(Futures.folds))
    x_train_lr_rfs = selector.fit_transform(x_train, y_train)
    x_test_lr_rfs = selector.transform(x_test)
    pickle_it(x_train_lr_rfs, Training.x_train_lr_rfs)
    pickle_it(x_test_lr_rfs, Training.x_test_lr_rfs)
    print(selector.ranking_)

It looks like it only discarded preferential attachment.

RFECV with Extra Trees

if os.path.isfile(Training.x_train_trees_rfs):
    x_train_trees_rfs = unpickle_it(Training.x_train_trees_rfs)
    x_test_trees_rfs = unpickle_it(Training.x_test_trees_rfs)
else:
    estimator = ExtraTreesClassifier()
    selector = RFECV(estimator, scoring='roc_auc', n_jobs=-1, cv=StratifiedKFold(Futures.folds))
    x_train_trees_rfs = selector.fit_transform(x_train, y_train)
    x_test_trees_rfs = selector.transform(x_test)
    pickle_it(x_train_trees_rfs, Training.x_train_trees_rfs)
    pickle_it(x_test_trees_rfs, Training.x_test_trees_rfs)
    print(selector.ranking_)

Strangely, the Extra Trees Classifier didn't remove any columns…

Select Model Logistic Regression

if os.path.isfile(Training.x_train_lr_sfm):
    x_train_lr_sfm = unpickle_it(Training.x_train_lr_sfm)
    x_test_lr_sfm = unpickle_it(Training.x_test_lr_sfm)
else:
    estimator = LogisticRegressionCV(
        n_jobs=-1, scoring='roc_auc',
        cv=StratifiedKFold(Futures.folds)).fit(x_train,
                                               y_train)
    selector = SelectFromModel(estimator, prefit=True)
    x_train_lr_sfm = selector.transform(x_train)
    x_test_lr_sfm = selector.transform(x_test)
    pickle_it(x_train_lr_sfm, Training.x_train_lr_sfm)
    pickle_it(x_test_lr_sfm, Training.x_test_lr_sfm)
    print(estimator.coef_)
print(x_train_lr_sfm.shape)

This was more aggressive, cutting out half the features. It looks like it kept Jaccard Coefficient and Adamic Adar and got rid of Resource Allocation and Preferential Attachment.

Select Model Extra Trees

if os.path.isfile(Training.x_train_trees_sfm):
    x_train_trees_sfm = unpickle_it(Training.x_train_trees_sfm)
    x_test_trees_sfm = unpickle_it(Training.x_test_trees_sfm)
else:
    estimator = ExtraTreesClassifier()
    estimator.fit(x_train, y_train)
    selector = SelectFromModel(estimator, prefit=True)
    x_train_trees_sfm = selector.transform(x_train)
    x_test_trees_sfm = selector.transform(x_test)
    pickle_it(x_train_trees_sfm, Training.x_train_trees_sfm)
    pickle_it(x_test_trees_sfm, Training.x_test_trees_sfm)
    print(estimator.feature_importances_)
print(x_train_trees_sfm.shape)

This is sometimes more aggressive, keeping only the Adamic Adar feature… But maybe that's all you need, we'll see. Then again, other times it isn't as aggressive, only trimming two columns, and this tiem it only trimmed one…

Fitting the Models

Persistent Storage

The outcomes will be stored in a dictionary called scores with descriptions of the best model and feature-selection mapped to their testing-score.

if os.path.isfile(Files.future_model_selection):
    with open(Files.future_model_selection, 'rb') as pkl:
        scores = pickle.load(pkl)
else:
    scores = {}
def fit_and_print(estimator, x_train, x_test):
    """fits the estimator to the data

    Args:
     estimator: model to fit
     x_train: scaled data to fit model to
     x_test: data to test the model with

    Returns:
     tuple: model fit to the data, test score
    """
    model = estimator.fit(x_train, y_train)
    test_score = model.score(x_test, y_test)
    print("Mean Cross-Validation Score: {:.2f}".format(model.scores_[1].mean()))
    print("Testing Score: {:.2f}".format(test_score))
    return model, test_score
data_sets = {("extra trees", 'select from model') : (x_train_trees_sfm, x_test_trees_sfm),
             ("extra trees", 'recursive feature selection') : (x_train_trees_rfs, x_test_trees_rfs),
             ('logistic regression', "recursive feature selection") : (x_train_lr_rfs, x_test_lr_rfs),
             ('logistic regression', "select from model") : (x_train_lr_sfm, x_test_lr_sfm)}
def key_by_value(source, search_value):
    """Find the key in a dict that matches a value

    Args:
     source (dict): dictionary with value to search for
     search_value: value to search for

    Returns:
     object: key in source that matched value
    """
    for key, value in source.items():
        if value == search_value:
            return key
    return
def fit_and_print_all(model, model_name):
    """Fits the model against all data instances

    Args:
     model: model to fit to the data sets
     model_name: identifier for the outcomes
    """
    for data_set, x in data_sets.items():
        selector, method = data_set
        train, test = x
        key = ','.join([model_name, selector, method])
        print("Training Shape: {}".format(train.shape))
        if key not in scores:
            print(key)
            fitted, score = fit_and_print(model, train, test)
            scores[key] = score
        else:
            score = scores[key]
            print("{}: {:.3f}".format(key, score))
        print()

    best_score = max(scores.values())
    best_key = key_by_value(scores, best_score)
    print("Best Model So Far: {}, Score={:.2f}".format(
        best_key,
        best_score))
    with open(Files.future_model_selection, 'wb') as writer:
        pickle.dump(scores, writer)
    return

Logistic Regression

logistic_model = LogisticRegressionCV(n_jobs=-1, scoring="roc_auc",
                                      solver='liblinear',
                                      cv=StratifiedKFold(Futures.folds))
fit_and_print_all(logistic_model, "Logistic Regression")

Fit Grid Search

Since the Logistic Regression had its own cross-validation I didn't use a grid search, but for the forests I'll use one to figure out the best number of estimators. I'll have to look into what the other parameters do to figure out whether they're going to be useful.

def fit_grid_search(estimator, parameters, x_train, x_test):
    """Fits the estimator using grid search

    Args:
     estimator: Model to fit
     parameters (dict): hyper-parameters for the grid search
     x_train (array): the training data input
     x_test (array): data to evaluate the best model with

    Returns: 
     tuple: Best Model, best model score
    """
    search = GridSearchCV(estimator, parameters, n_jobs=-1, scoring='roc_auc',
                          cv=StratifiedKFold(Futures.folds))
    search.fit(x_train, y_train)
    best_model = search.best_estimator_
    test_score = best_model.score(x_test, y_test)
    print("Mean of Mean Cross-Validation Scores: {:.2f}".format(
        search.cv_results_["mean_train_score"].mean()))
    print("Mean of Cross-Validation Score STDs: {:.2f}".format(
        search.cv_results_["std_train_score"].mean()))
    print("Testing Score: {:.2f}".format(test_score))
    return best_model, test_score
def fit_grid_searches(estimator, parameters, name, data_sets=data_sets):
    """Fits the estimator against all the data-sets

    Args:
     estimator: instance of model to test
     parameters: dict of grid-search parameters
     name: identifier for the model
    """
    for data_set, x in data_sets.items():
        selector, method = data_set
        train, test = x
        key = ",".join([name, selector, method])
        if key not in scores:
            print(key)
            fitted, score = fit_grid_search(estimator, parameters, train, test)
            scores[key] = score
        else:
            score = scores[key]
            print("{}: {:.2f}".format(key, score))
        print()
    best = max(scores.values())
    best_key = key_by_value(scores, best)
    print("Best Model So Far: {}, Score={:.2f}".format(best_key, best))
    with open(Files.future_model_selection, 'wb') as writer:
        pickle.dump(scores, writer)
    return

Random Forests

parameters = dict(n_estimators = list(range(10, 200, 10)))
forest = RandomForestClassifier()
fit_grid_searches(forest, parameters, "Random Forest")

Extra Trees

scores = {k:v for k,v in scores.items() if not k.startswith('Extra Trees,extra trees')}
parameters = dict(n_estimators = list(range(10, 200, 10)))
trees = ExtraTreesClassifier()
fit_grid_searches(trees, parameters, "Extra Trees")

Friends and Politics

Node Centrality is a measure of the importance of a node to a network. This will explore measures of centrality using two networks, a friendship network, and a blog network.

Part 1 - Friendships

This will look at a network of friendships at a university department. Each node corresponds to a person (identified by an integer node label), and an edge indicates friendship.

Imports

import networkx

Friendships data

friendships = networkx.read_gml('friendships.gml')
print(len(friendships))
print(networkx.is_connected(friendships))
print(networkx.is_directed(friendships))

There are 1,133 people in the friendship network, which is a connected, undirected graph.

Degree, Closeness, and Normalized Betweenness Centrality

Find the degree centrality, closeness centrality, and normalized betweenness centrality (excluding endpoints) of node 100.

  • Degree Centrality scores the nodes based on the number of links they have to other nose. The assumption is that a node with more connections should be more important.
  • Closeness Centrality uses the lengths of shortest paths to decide importance. The less distance there is between a node and the other nodes the more important it is.
  • Betweenness Centrality counts the number of shortest paths between pairs of nodes that pass through a node.
DEGREE_CENTRALITY = networkx.degree_centrality(friendships)
CLOSENESS_CENTRALITY = networkx.closeness_centrality(friendships)
BETWEENNESS_CENTRALITY = networkx.betweenness_centrality(friendships)
def node_centrality(node=100):
    """gets measures of centrality for node

    Args:
     node (int): the number (key) for the node

    Returns:
     tuple: 
      - float: degree centrality
      - float: closeness centrality
      - float: normalized betweeness centrality
    """
    return (DEGREE_CENTRALITY[node],
            CLOSENESS_CENTRALITY[node], BETWEENNESS_CENTRALITY[node])
print("Node 101:")
degree, closeness, betweenness = node_centrality()
print("Degree Centrality: {0:.4f}".format(degree))
print("Closeness Centrality: {0:.2f}".format(closeness))
print("Normalized Betweenness Centrality: {0:.6f}".format(betweenness))
def largest_node(centrality):
    """gets the node with the best (largest) score

    Args:
     centrality (dict): one of the centrality score dicts

    Returns:
     int: name of the node with the best score
    """
    return list(
        reversed(sorted((value, node)
                        for (node, value) in centrality.items())))[0][1]

Most Connected Friend

We want to contact one person in our friendship network and have him or her contact all his or her immediate friends. To have the greatest impact, this person should have the most links in the network. Which node is this?

def most_connected_friend():
    """returns the node with the best degree centrality"""
    return largest_node(DEGREE_CENTRALITY)
MOST_CONNECTED = most_connected_friend()
print("Most Connected Friend: {}".format(MOST_CONNECTED))
connected = networkx.Graph()
friends = friendships[MOST_CONNECTED]
for friend in friends:
    connected.add_edge(MOST_CONNECTED, friend)
positions = networkx.spring_layout(connected)
networkx.draw(connected, positions, with_labels=False, node_color='b', node_size=50)
networkx.draw(connected, positions, nodelist=[MOST_CONNECTED], node_color="r")

Node 105 does appear to be well connected.

Fewest Hops

We want to reach everyone in the network by having one person passing messages to his friends who can then pass it on and so forth (a six-degrees of separation type scenario) but we want the fewest number of transfers. Which friend is closest to all the people in the friendship network?

def closest_friend():
    """the node with the best closeness centrality

    Returns:
     int: Identifier for the node closest to all the other nodes
    """
    return largest_node(CLOSENESS_CENTRALITY)
CLOSEST_FRIEND = closest_friend()
print("Closest Friend: {}".format(CLOSEST_FRIEND))
positions = networkx.spring_layout(friendships)
networkx.draw(friendships, positions, node_size=1, alpha=0.25, node_color='b')
networkx.draw_networkx_nodes(friendships, positions, nodelist=[CLOSEST_FRIEND],
                             node_color='r', node_size=50)

Interesting to look at, if not the most informative.

Most Important Connection

Although the graph is connected, if you took out one persion from the network, which one would cause the most disruption (which person is in the path of the most shortest paths)?

def betweenness_centrality():
    """the node with the highest betweenness centrality

    Returns:
     int: ID of the person who sits on the most shortest paths
    """
    return largest_node(BETWEENNESS_CENTRALITY)
MOST_BETWEEN = betweenness_centrality()
print("Most Between Friend: {}".format(MOST_BETWEEN))

Part 2 - Political Blogs

Now we're going to use PageRank and Hyperlink-Induced Topic Search (HITS) to look at a directed network of political blogs, where nodes correspond to a blog and edges correspond to links between blogs.

blogs = networkx.read_gml('blogs.gml')
print(len(blogs))
print(networkx.is_directed(blogs))
networkx.draw(blogs, alpha=0.5, node_size=1, node_color='r')

Scaled Page Rank of realclearpolitics.com

PageRank scores web-pages by the number of important nodes that link directly to them. It is possible for the algorithm to get stuck if there are no edges leading out from a directed subgraph, producing erroneous page-ranks so the Scaled Page Rank uses a random-restart do decide when to occasionally jump to a new node, an idea similar to the way Stochastic Gradient Descent avoids being stuck in local minima. The Networkx pagerank uses a default of 0.85, which I will use, so it will do a random-restart about 15% of the time.

PAGE_RANK = networkx.pagerank(blogs)
def real_clear_politics_page_rank():
    """Page Rank of realclearpolitics.com

    Returns:
     float: The PageRank for the realclearpolitics blog.
    """
    return PAGE_RANK['realclearpolitics.com']
print("Real Clear Politics Page Rank: {0:.4f}".format(real_clear_politics_page_rank()))

Top Five Blogs by Page Rank

This time the PageRank scores will be used to find what it thinks are the most important blogs.

def top_five(ranks, count=5):
    """gets the top-five blogs by rank

    Args:
     count (int): number to return

    Returns:
     list [str]: names of the top blogs - most to least important
    """
    top = list(reversed(sorted((rank, node)
                               for node, rank in ranks.items())))[:count]
    return [node for rank, node in top]
def top_five_page_rank():
    """Top 5 nodes by page rank

    Returns:
     list [str]: top-five blogs by page-rank
    """
    return top_five(PAGE_RANK)
print("Top Five Blogs by PageRank")

for blog in top_five_page_rank():
    print("  - {}".format(blog))

HITS Score for Real Clear Politics

This uses the HITS algorithm to find the authority and hub scores for realclearpolitics.com. This algorithm tries to identify hubs, collections of links that directed users to important pages, and authoratative pages, pages that are deemed important because of their relevant content (as identified by the fact that they are linked to by hubs).

HUBS, AUTHORITIES = networkx.hits(blogs)
def real_clear_politics_hits():
    """HITS score for realclearpolitics.com

    Returns:
     tuple (float, float): hub score, authority score
    """
    return HUBS['realclearpolitics.com'], AUTHORITIES['realclearpolitics.com']
hub, authority = real_clear_politics_hits()
print("Real Clear Politics")
print("Hub: {0:.5f}\nAuthority: {0:.5f}".format(hub, authority))

Top 5 Blogs by Hub Score

This will find the top five blogs based on their hub scores (meaning they are the ones who link to the most authoratative sites).

def top_five_hubs():
    """Top five blogs by hub scores

    Returns:
     list (str): Names of top-five hub blogs
    """
    return top_five(HUBS)
top_five_hub_blogs = top_five_hubs()
print('Top Five Hub Blogs')
for blog in top_five_hub_blogs:
    print(" - {}".format(blog))

Top Five Blogs By Authority

This will find the top five political blogs based on how many of the hub-blogs link to them.

def top_five_authorities():
    """the top 5 blogs by authorities score

    Returns:
     list (str): names of the most authoratative blogs
    """
    return top_five(AUTHORITIES)
print("Top Five Authoratative Blogs")
authoratative_blogs = top_five_authorities()
for blog in authoratative_blogs:
    print(" - {}".format(blog))

Company E-Mail

This will go through the process of importing and analyzing an internal email communication network between employees of a mid-sized manufacturing company. Each node represents an employee and each directed edge between two nodes represents an individual email. The left node represents the sender and the right node represents the recipient.

+------+      +--------+
|Sender| -->  |Receiver|
+------+      +--------+

Tangle

This is for troubleshooting.

<<imports>>

<<data>>

<<answer-one>>

Imports

import networkx
import pandas
import seaborn
% matplotlib inline
seaborn.set_style("whitegrid")

This line must be commented out when submitting to the autograder

email_network = pandas.read_table('email_network.txt', dtype={"#Sender": str, "Recipient": str})
print(email_network.head())

1 - Load the Directed Multigraph

Using networkx, load up the directed multigraph from `emailnetwork.txt`. Make sure the node names are strings.

This function should return a directed multigraph networkx graph.

def answer_one():
    """Loads the email-network graph

    Returns:
     networkx.MultiDiGraph: the graph of the email network
    """
    # there's a bug in networkx loading MultiDiGraphs from pandas data-frames
    # so this is a work-around
    graph = networkx.MultiDiGraph()
    tuples = [(sender, recipient, {"time": time})
              for (sender, recipient, time) in email_network.values]
    graph.add_edges_from(tuples)
    return graph
one = answer_one()
networkx.draw_networkx(one)

Number of employees and emails

How many employees and emails are represented in the graph from Question 1?

This function should return a tuple (#employees, #emails).

def answer_two():
    """Counts the number of employees and emails

    Returns:
     tuple: count of employees, count of emails
    """
    one = answer_one()
    return (one.order(), one.size())
print(answer_two())

Information Routes

Part 1. Assume that information in this company can only be exchanged through email.

When an employee sends an email to another employee, a communication channel has been created, allowing the sender to provide information to the receiver, but not vice versa.

Based on the emails sent in the data, is it possible for information to go from every employee to every other employee?

Part 2. Now assume that a communication channel established by an email allows information to be exchanged both ways.

Based on the emails sent in the data, is it possible for information to go from every employee to every other employee?

This function should return a tuple of bools (part1, part2).

def answer_three():
    """decides connectivity based on emails

    First: Assume communication is not necessarily allowed both ways - 
    based on data, can every employee contact each other?

    Second: Assume any contact means there's two way communication. 
    Can every employee be contacted?

    Returns:
     tuple: (every employee contacted every other employee, every employee contacted once)
    """
    emails = answer_one()
    nodes = emails.nodes()
    other_nodes = len(nodes) - 1
    fully_connected = all((len(emails.neighbors(node)) == other_nodes for node in nodes))
    undirected = emails.to_undirected()
    all_connected = True
    for left_node in nodes:
        for right_node in nodes:
            if left_node != right_node and not undirected.has_edge(left_node, right_node):
                all_connected = False
                break
        if not all_connected:
            break
    return fully_connected, all_connected
print(answer_three())

Largest Weakly Connected Component

How many nodes are in the largest (in terms of nodes) weakly connected component?

This function should return an int.

def answer_four():
    """Count of nodes in the largest weakly connected component"""
    one = answer_one()
    return len(max(networkx.weakly_connected_component_subgraphs(one), key=len).nodes())

According to Wikipedia, a directed graph is weakly connected if replacing every directed edge with an undirected one creates a connected graph, so if the undirected graph in the next section is a connected graph, then the entire email graph is weakly connected.

print(answer_four())
undirected = one.to_undirected()
print(networkx.is_connected(undirected))

def answerfive():

return # Your Answer Here

def answersix():

return # Your Answer Here

def answerseven():

return # Your Answer Here

def answereight():

return # Your Answer Here

def answernine():

return # Your Answer Here

def answerten():

return # Your Answer Here

def answereleven():

return # Your Answer Here

def answertwelve():

return # Your Answer Here

def answerthirteen():

return # Your Answer Here

def answerfourteen():

return # Your Answer Here

Some Networkx Examples

The Departure

This is a look at different ways of creating and manipulating graphs using NetworkX.

Imports

Python

from functools import partial
from pathlib import Path
import os

PyPi

from bokeh.models import HoverTool
import holoviews
import hvplot.pandas
import networkx
import numpy
import pandas

My Stuff

from graeae.visualization import EmbedHoloview

Set Up

SLUG = "some-networkx-examples"
OUTPUT_PATH = Path("../../files/posts/networks/" + SLUG)
Embed = partial(EmbedHoloview, folder_path=OUTPUT_PATH)

The Initiation

Different Ways to Create Network Graphs

NetworkX has a few different ways to create graphs, some more flexible than others. This is a non-exhaustive showing of some of them.

Adding Edges To An Existing Graph

You can create a graph using the graph constructors. This is an example of an undirected Graph.

graph_1 = networkx.Graph()
G1.add_edges_from([(0, 1),
                   (0, 2),
                   (0, 3),
                   (0, 5),
                   (1, 3),
                   (1, 6),
                   (3, 4),
                   (4, 5),
                   (4, 7),
                   (5, 8),
                   (8, 9)])

networkx.draw_networkx(G1)

nil

Adjacency List

with open("G_adjlist.txt") as reader:
    print(reader.read())
G2 = networkx.read_adjlist('G_adjlist.txt', nodetype=int)
G2.edges()

Adjacency Matrix

G_mat = numpy.array([[0, 1, 1, 1, 0, 1, 0, 0, 0, 0],
                     [1, 0, 0, 1, 0, 0, 1, 0, 0, 0],
                     [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                     [1, 1, 0, 0, 1, 0, 0, 0, 0, 0],
                     [0, 0, 0, 1, 0, 1, 0, 1, 0, 0],
                     [1, 0, 0, 0, 1, 0, 0, 0, 1, 0],
                     [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
                     [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
                     [0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
                     [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]])
G_mat
G3 = networkx.Graph(G_mat)
G3.edges()

Edgelist

with open('G_edgelist.txt') as reader:
    print(reader.read())
G4 = networkx.read_edgelist('G_edgelist.txt', data=[('Weight', int)])

G4.edges(data=True)

Pandas DataFrame

G_df = pandas.read_csv('G_edgelist.txt', delim_whitespace=True, 
                       header=None, names=['n1', 'n2', 'weight'])
G_df
G5 = networkx.from_pandas_dataframe(G_df, 'n1', 'n2', edge_attr='weight')
G5.edges(data=True)

A Chess Example

with open('chess_graph.txt') as reader:
    count = 0
    for line in reader:
        print(line.strip())
        count += 1
        if count == 5:
            break
chess = networkx.read_edgelist('chess_graph.txt', data=[('outcome', int), ('timestamp', float)], 
                         create_using=networkx.MultiDiGraph())
chess.is_directed(), chess.is_multigraph()
chess.edges(data=True)[:5]
games_played = chess.degree()
games_played
max_value = max(games_played.values())
max_key, = [i for i in games_played.keys() if games_played[i] == max_value]

print('player {}\n{} games'.format(max_key, max_value))
df = pd.DataFrame(chess.edges(data=True), columns=['white', 'black', 'outcome'])
df.head()
df['outcome'] = df['outcome'].map(lambda x: x['outcome'])
df.head()
won_as_white = df[df['outcome']==1].groupby('white').sum()
won_as_black = df[df['outcome']==-1].groupby('black').sum()
win_count = won_as_white.add(won_as_black, fill_value=0)
print(win_count.head())
win_count.nlargest(5, 'outcome')

The Return

Company Movie Night

The Departure

This is a look at working with networks using networkx. Our scene - eight employees are trying to choose three movies to watch for company movie nights. We have two sources of data - the candidate movies and the relationship between pairs of employees. The relationships are on a scale from -100 to 100, with -100 being the strongest of enemies and 100 meaning they are best of friends. Zero either means they have no relationship (don't interact) or are indifferent about the other person.

Imports

From Python

from argparse import Namespace
from functools import partial
from pathlib import Path
import copy
import os

From PyPi

from dotenv import load_dotenv
from bokeh.models import (
    BoxSelectTool,
    Circle,
    HoverTool, 
    MultiLine,
    PanTool,
    Range1d,
    TapTool,
    WheelZoomTool,
)
from bokeh.models import Plot as BokehPlot
from bokeh.models.graphs import (from_networkx, 
                                 EdgesAndLinkedNodes, 
                                 NodesAndLinkedEdges)
from bokeh.palettes import RdBu, Spectral4
from bokeh.transform import linear_cmap
import holoviews
from holoviews import dim
import hvplot.pandas
import networkx
import pandas
import numpy
from networkx.algorithms import bipartite

My Stuff

from graeae.visualization import EmbedBokeh, EmbedHoloview

Set Up

Load Dotenv

load_dotenv(".env", override=True)

The Plotting

SLUG = 'company-movie-night/'
OUTPUT = Path("../../files/posts/networks/" + SLUG)
Embed = partial(EmbedHoloview, folder_path=OUTPUT)
EmbedB = partial(EmbedBokeh, folder_path=OUTPUT)
holoviews.extension("bokeh")
Plot = Namespace(
    node_size=15,
    edge_alpha=0.8,
    edge_color="#CCCCCC",
    edge_width=5,
    width=800,
    height=800,
    fontsize=18,
    padding=dict(x=(-1, 1), y=(-1, 1))
)

The Initiation

The Data

This Is the Set Of Employee-Relationships

employee_relationships_path = Path(os.environ.get("EMPLOYEE_RELATIONSHIPS"))
relationships_data = pandas.read_csv(
    employee_relationships_path, 
    delimiter="\t", 
    header=None,
    names="employee_1 employee_2 relationship".split())
table = holoviews.Table(relationships_data).opts(height=Plot.height)
Embed(plot=table, file_name="relationships_data")()

Figure Missing

employees = set(relationships_data.employee_1.unique()) | set(relationships_data.employee_2.unique())
print(employees)
{'Pablo', 'Georgia', 'Andy', 'Claude', 'Vincent', 'Joan', 'Lee', 'Frida'}
print(len(employees))
8
print(len(relationships_data))
28

We have eight employees and twenty-eight links. Is this a fully connected graph? The handshake problem says that the amount of links in a fully-connected network is:

\[ links = \frac{n(n-1)}{2} \]

print(len(employees) * (len(employees) - 1)/2)
28.0

It looks like our relationships data creates a fully-connected network (unless there is a duplicate which would be an error).

These Are the Movie Choices

employee_movies_path = Path(os.environ.get("EMPLOYEE_MOVIE_CHOICES"))
movies_data = pandas.read_csv(
    employee_movies_path, 
    delimiter="\t", 
    header=None,
    skiprows=1,
    names="employee movie".split())
print(f"Employee To Movie Edges: {len(movies_data)}")
print(movies_data.head())
Employee To Movie Edges: 24
  employee                            movie
0     Andy                         Anaconda
1     Andy                       Mean Girls
2     Andy                       The Matrix
3   Claude                         Anaconda
4   Claude  Monty Python and the Holy Grail
movies = set(movies_data.movie.unique())
for movie in sorted(movies):
    print(f" - {movie}")
  • Anaconda
  • Forrest Gump
  • Kung Fu Panda
  • Mean Girls
  • Monty Python and the Holy Grail
  • Snakes on a Plane
  • The Dark Knight
  • The Godfather
  • The Matrix
  • The Shawshank Redemption
  • The Social Network

The eight employees chose 11 movies between them.

Converting the DataFrames to Graphs

The Relationship Graph
relationship_graph = networkx.from_pandas_edgelist(relationships_data, 
                                                   "employee_1", "employee_2", 
                                                   edge_attr="relationship")
for index, row in relationships_data.sample(5).iterrows():
    assert relationship_graph[row["employee_1"]][row["employee_2"]]["relationship"] == row["relationship"]
expected_edges = len(relationships_data)
expected_nodes = len(employees)
print("Expected Edges: {}".format(expected_edges))
print("Expected Nodes: {}".format(expected_nodes))
assert expected_nodes == len(relationship_graph.nodes)
assert expected_edges == len(relationship_graph.edges)
Expected Edges: 28
Expected Nodes: 8
The Movie Graph
movie_graph = networkx.from_pandas_edgelist(movies_data, "employee", "movie")
for index, row in movies_data.iterrows():
    movie_graph[row.employee][row.movie]["employee"] = row.employee
    movie_graph[row.employee][row.movie]["movie"] = row.movie

for employee in movies_data.employee:    
    movie_graph.nodes[employee]["type"] = "employee"
for movie in movies_data.movie:
    movie_graph.nodes[movie]["type"] = "movie"

Plotting

def graph_plot(graph: networkx.Graph, 
               title:str,
               file_name: str,
               hover: HoverTool,
               layout=networkx.circular_layout) -> None:
    """Plot the graph in bokeh

    Args:
     graph: the graph to plot
     layout: function to layout the plot
     title: title for the plot
     hover: defined hover tool
     file_name: name to save the bokeh file (without extension)
    """
    plot = BokehPlot(plot_width=Plot.width,
                     plot_height=Plot.height,
                     x_range=Range1d(-1, 1),
                     y_range=Range1d(-1, 1)
    )

    plot.title.text = title
    plot.title.text_font_size = f"{Plot.fontsize}pt"
    plot.add_tools(hover, TapTool(), BoxSelectTool(), PanTool(), WheelZoomTool())

    renderer = from_networkx(relationship_graph, layout, 
                             scale=1, center=(0,0))
    renderer.node_renderer.glyph = Circle(size=Plot.node_size, 
                                          fill_color=Spectral4[0])
    renderer.node_renderer.selection_glyph = Circle(size=Plot.node_size, 
                                                    fill_color=Spectral4[2])
    renderer.node_renderer.hover_glyph = Circle(size=Plot.node_size, 
                                                fill_color=Spectral4[1])

    color_map = linear_cmap(field_name="relationship", 
                            palette=RdBu[11], 
                            low=100, high=-100)
    renderer.edge_renderer.glyph = MultiLine(line_color=color_map, 
                                             line_alpha=0.5,
                                             line_width=3)
    renderer.edge_renderer.selection_glyph = MultiLine(line_color=color_map, 
                                                       line_width=Plot.edge_width)
    renderer.edge_renderer.hover_glyph = MultiLine(line_color=color_map, 
                                                   line_width=Plot.edge_width)
    renderer.selection_policy = NodesAndLinkedEdges()
    renderer.inspection_policy = EdgesAndLinkedNodes()
    plot.renderers.append(renderer)
    EmbedB(plot=plot, file_name=file_name)()
    return
class RelationshipGraphPlot:
    """Plots a graph and keeps the parts so you can inspect them

    Args:
     graph: the graph to plot
     layout: function to layout the plot
     title: title for the plot
     hover: defined hover tool
     settings: namespace with the plot settings
     file_name: name to save the bokeh file (without extension)
    """
    def __init__(self, graph: networkx.Graph, 
                 title:str,
                 file_name: str,
                 hover: HoverTool,
                 settings: Namespace=Plot,
                 layout=networkx.circular_layout) -> None:
        self.graph = graph
        self.title = title
        self.file_name = file_name
        self.hover = hover
        self.settings = settings
        self.layout = layout
        self._tap_tool = None
        self._box_select_tool = None
        self._pan_tool = None
        self._wheel_zoom_tool = None
        self._plot = None
        self._renderer = None
        self._color_map = None
        return

    @property
    def tap_tool(self) -> TapTool:
        if self._tap_tool is None:
            self._tap_tool = TapTool()
        return self._tap_tool

    @property
    def box_select_tool(self) -> BoxSelectTool:
        if self._box_select_tool is None:
            self._box_select_tool = BoxSelectTool()
        return self._box_select_tool

    @property
    def pan_tool(self) -> PanTool:
        if self._pan_tool is None:
            self._pan_tool = PanTool()
        return self._pan_tool

    @property
    def wheel_zoom_tool(self) -> WheelZoomTool:
        if self._wheel_zoom_tool is None:
            self._wheel_zoom_tool = WheelZoomTool()
        return self._wheel_zoom_tool

    @property
    def plot(self) -> BokehPlot:
        if self._plot is None:
            self._plot = BokehPlot(plot_width=self.settings.width,
                                   plot_height=self.settings.height,
                                   x_range=Range1d(-1, 1),
                                   y_range=Range1d(-1, 1)
            )

            self._plot.title.text = self.title
            self._plot.title.text_font_size = f"{Plot.fontsize}pt"
            self._plot.add_tools(self.hover, 
                                 self.tap_tool, 
                                 self.box_select_tool, 
                                 self.pan_tool,
                                 self.wheel_zoom_tool)
            self._plot.renderers.append(self.renderer)
        return self._plot

    @property
    def color_map(self):
        if self._color_map is None:
            self._color_map = linear_cmap(field_name="relationship", 
                                          palette=RdBu[11], 
                                          low=100, high=-100)
        return self._color_map

    @property
    def renderer(self):
        if self._renderer is None:
            self._renderer = from_networkx(self.graph, self.layout, 
                                          scale=1, center=(0,0))
            self._renderer.node_renderer.glyph = Circle(
                size=Plot.node_size, 
                fill_color=Spectral4[0])
            self._renderer.node_renderer.selection_glyph = Circle(
                size=Plot.node_size, 
                fill_color=Spectral4[2])
            self._renderer.node_renderer.hover_glyph = Circle(
                size=Plot.node_size, 
                fill_color=Spectral4[1])

            self._renderer.edge_renderer.glyph = MultiLine(
                line_color=self.color_map, 
                line_alpha=0.5,
                line_width=3)
            self._renderer.edge_renderer.selection_glyph = MultiLine(
                line_color=self.color_map, 
                line_width=Plot.edge_width)
            self._renderer.edge_renderer.hover_glyph = MultiLine(
                line_color=self._color_map,
                line_width=Plot.edge_width)
            self._renderer.selection_policy = NodesAndLinkedEdges()
            self._renderer.inspection_policy = EdgesAndLinkedNodes()
        return self._renderer

    def __call__(self) -> None:
        EmbedB(plot=self.plot, file_name=self.file_name)()
        return

The Employee Relationship Plot

HoloViews

The employee relationship graph consists of employees as nodes and their relationshp-level as weights on the edges.

plot = holoviews.Graph.from_networkx(relationship_graph,
                                     networkx.circular_layout).opts(
                                         cmap="Set1",
                                         fontsize=Plot.fontsize,
                                         width=800,
                                         height=800,
                                         title="Company Relationship Graph",
                                         xaxis=None, yaxis=None).options(
                                             edge_color_index="relationship", 
                                             edge_cmap="Spectral").redim.range(**Plot.padding)
renderer = holoviews.render(plot)
renderer.renderers[-1].selection_policy = EdgesAndLinkedNodes()
EmbedB(plot=renderer, file_name="employee_relationships")()
Bokeh

This is the same plot using bokeh directly instead of holoviews. I wanted both the nodes and edges to trigger the HoverTool but HoloViews doesn't support this. There might be a way to add it later, but their documentation is so opaque that I decided it wasn't worth it to keep trying to figure it out, since bokeh isn't that hard to use (although their documentation isn't the best either).

Since bokeh is so verbose I'm going to step through this instead of putting it into one block.

Hover Tool

When the EdgesAndLinkedNodes class is used only the edge data is available to the hovertool (or at least I couldn't figure out how to make the Node attributes work). So these have to be available to it. You can see what's available once you've created the graph renderer (the output of from_networkx) by looking at one of the data attributes

renderer.edge_renderer.data_source.data.keys()

Which in this case returned this.

dict_keys(['relationship', 'start', 'end'])

relationship was a data-attribute I added through networkx, something else (presumably bokeh) created the start and end.

hover = HoverTool(
    tooltips=[
        ("Employee", "@start"),
        ("Employee", "@end"),
        ("Relationship", "@relationship"),
    ]
)
The Plot

This is the bokeh plot (I don't know how it differs from a figure). It's normally called Plot but I already used that name for my settings object so I called it BokehPlot.

plot = BokehPlot(plot_width=Plot.width,
                 plot_height=Plot.height,
                 x_range=Range1d(-1, 1),
                 y_range=Range1d(-1, 1)
)

plot.title.text = "Company Relationships"
plot.title.text_font_size = f"{Plot.fontsize}pt"
plot.add_tools(hover, TapTool(), BoxSelectTool(), PanTool(), WheelZoomTool())
The Graph Renderer

This part converts the networkx Graph into a bokeh Graph. This is what I was referring to earlier when I talked about inspecting the renderer to look at the available edge attributes.

renderer = from_networkx(relationship_graph, networkx.circular_layout, 
                         scale=1, center=(0,0))
Nodes

You have to set up the shapes for the nodes - I think - there might be defaults but the few examples I found set it up. It's probably not a bad idea in any case. The Spectral4 object is a list of four hex-colors. Here's the ones I used.

Object Index Color
glyph 0 Medium Blue
selection_glyph 2 Orange
hover_glyph 1 Pastel Green
renderer.node_renderer.glyph = Circle(size=Plot.node_size, fill_color=Spectral4[0])
renderer.node_renderer.selection_glyph = Circle(size=Plot.node_size, fill_color=Spectral4[2])
renderer.node_renderer.hover_glyph = Circle(size=Plot.node_size, fill_color=Spectral4[1])
Color Map

The linear_cmap maps a range of values to a palette of colors. In this case I'm mapping the relationship values to the red-blue palette (RdBu). Two things to note:

  • I chose a red-blue palette with 11 values because the odd-number puts white at the center (it goes from blue to white to red)
  • Althouh the name suggests a palette form red to blue it goes from blue to red so I had to make -100 the 'high' value so red would be a bad relationship.
color_map = linear_cmap(field_name="relationship", palette=RdBu[11], low=100, high=-100)
The Edges

Like the nodes you define the edges for the plot. This is where we get to use the color-map to make the edges match the relationship between the employees.

renderer.edge_renderer.glyph = MultiLine(line_color=color_map, 
                                         line_alpha=0.5,
                                         line_width=3)
renderer.edge_renderer.selection_glyph = MultiLine(line_color=color_map, 
                                                   line_width=Plot.edge_width)
renderer.edge_renderer.hover_glyph = MultiLine(line_color=color_map, 
                                               line_width=Plot.edge_width)
The Selection and Inspection Policies

This was the reason for doing it in bokeh in the first place. Adding these two lines makes both the edge and attached notes highlight when selected or hovered over.

renderer.selection_policy = NodesAndLinkedEdges()
renderer.inspection_policy = EdgesAndLinkedNodes()
Put It All Together

Now we just add the graph-renderer to the plot and have bokeh convert it to JavaScript and HTML.

plot.renderers.append(renderer)
EmbedB(plot=plot, file_name="company_relationships_bokeh")()

It looks like Andy might have some kind of personality problem (maybe he's the boss), while Georgia and Claude are unusually close.

Spring loaded

Bokeh raises an error if you try to re-use the hover-tool for some reason so I had to make a copy.

hover = HoverTool(
    tooltips=[
        ("Employee", "@start"),
        ("Employee", "@end"),
        ("Relationship", "@relationship"),
    ]
)

graph_plot(relationship_graph, 
           "Company Relationships", 
           "company_relationships_spring", 
           hover,
           networkx.spring_layout)

This didn't produce as interesting a result as I thought.

The Movie Plot

movie_hover = HoverTool(
    tooltips = [
        ("Employee", "@employee"),
        ("Movie", "@movie"),
    ]

)
plot = holoviews.Graph.from_networkx(movie_graph,
                                     networkx.circular_layout).opts(
                                         node_color=dim("type"),
                                         cmap="Set1",
                                         fontsize=Plot.fontsize,
                                         width=Plot.width,
                                         height=Plot.height,
                                         tools=[movie_hover, TapTool()],
                                         title="Company Movies Graph",
                                         xaxis=None, yaxis=None).options(
                                             inspection_policy="edges",
                                             edge_cmap="Spectral").redim.range(**Plot.padding)
Embed(plot=plot, file_name="company_movies_circle")()

Figure Missing

The Blue nodes are employees and the red nodes are movies.

Question 2

Using the graph from the previous question, add nodes attributes named `'type'` where movies have the value `'movie'` and employees have the value `'employee'` and return that graph.

This function should return a networkx graph with node attributes `{'type': 'movie'}` or `{'type': 'employee'}`

def answer_two():
    """Adds 'type' to nodes from movie-graph

    Returns:
     Graph: answer_one with 'type' attribute added (employee or movie)
    """
    graph = answer_one()
    new_graph = networkx.Graph()
    nodes = graph.nodes()
    employee_nodes = [node for node in nodes if node in employees]
    movie_nodes = [node for node in nodes if node in movies]
    new_graph.add_nodes_from(employee_nodes, bipartite=0, type='employee')
    new_graph.add_nodes_from(movie_nodes, bipartite=1, type="movie")
    new_graph.add_edges_from(graph.edges())
    return new_graph
two = answer_two()
two.nodes(data=True)
plot_graph(two)

Question 3

Find a weighted projection of the graph from `answer_two` which tells us how many movies different pairs of employees have in common.

This function should return a weighted projected graph.

def answer_three():
    graph = answer_two()
    assert networkx.is_bipartite(graph)
    return bipartite.weighted_projected_graph(graph, employees)
three = answer_three()
plot_graph(three)

Question 4

Suppose you'd like to find out if people that have a high relationship score also like the same types of movies.

Find the Pearson correlation ( using `DataFrame.corr()` ) between employee relationship scores and the number of movies they have in common. If two employees have no movies in common it should be treated as a 0, not a missing value, and should be included in the correlation calculation.

This function should return a float.

def answer_four():
    """calculates the pearson correlation for data

    Returns:
     float: Pearson correlation for weight and relationship_score
    """
    three = answer_three()
    relationships = pandas.read_table(
        "Employee_Relationships.txt",
        names="employee_left employee_right relationship_score".split())
    relationships["employees"] = relationships.apply(
        lambda row: tuple(sorted((row["employee_left"],
                                  row['employee_right']))), axis=1)

    weights = pandas.DataFrame(
        three.edges(data=True),
        columns="employee_left employee_right weight".split())
    weights["weight"] = weights.weight.map(lambda row: row["weight"])
    weights["employees"] = weights.apply(lambda row: tuple(sorted(
        (row["employee_left"],
         row["employee_right"]))),
                                         axis=1)

    joined = pandas.merge(relationships, weights, how="outer", 
                          on=['employees'])
    assert len(joined) == len(relationships)
    joined['weight'] = joined["weight"].fillna(0)

    data = joined[["relationship_score", "weight"]]
    correlation = data.corr()
    return correlation.relationship_score.weight
print(answer_four())

The Return

Getting Started With Prophet

The Departure

This is a first look at Prophet a time-series forecasting library. It depends on pystan, a Bayesian modelling platform, which in turn depends on numpy and cython.

Imports

Python

from pathlib import Path
import os

PyPi

from dotenv import load_dotenv
import modin.pandas as pandas

My Stuff

from graeae.timers import Timer

Setup

The Paths

load_dotenv(".env", override=True)
DATA = Path(os.environ.get("PEYTON_MANNING_WIKIPEDIA_VIEWS")).expanduser()
assert DATA.is_file()

The Timer

TIMER = Timer()

The Initiation

Load The Data

with TIMER:
    data = pandas.read_csv(DATA)
print(data.shape)
WARNING: Falling back to serializing objects of type <class 'pathlib.PosixPath'> by using pickle. This may be inefficient.
Started: 2019-04-05 15:23:56.038686
Ended: 2019-04-05 15:23:56.236422
Elapsed: 0:00:00.197736
(2905, 2)

High School Contact Networks

The Departure

This looks at data provided by SocioPatterns that looks a the interactions between students at a High School in Marseilles, France, in December of 2013.

Imports

From Python
from collections import Counter
from functools import partial
from pathlib import Path
import os
From PyPi
from bokeh.models import HoverTool
from dotenv import load_dotenv
from holoviews import dim, opts
from holoviews.operation.datashader import datashade, bundle_graph
import holoviews
import hvplot.pandas
import networkx
import pandas as pandas
My Stuff
from graeae.timers import Timer
from graeae.visualization import EmbedHoloview

Load the Dotenv

load_dotenv(".env")

Build the Timer

TIMER = Timer()

Setup The Plotting

holoviews.extension("bokeh")
SLUG = "high-school-contact-networks/"
output = Path("../../files/posts/networks/" + SLUG)
Embed = partial(EmbedHoloview, folder_path=output)
class Plot:
    """Constants for plotting"""
    width = 1000
    height = 800
    fontsize = 18

Load The Data

Let's take a look at the data before loading it into pandas.

HIGH_SCHOOL = Path(os.environ.get("HIGH_SCHOOL")).expanduser()
assert HIGH_SCHOOL.is_dir()

#+begin_src ipython :session highschool :results none
class Files:
    metadata = "metadata_2013.txt"
    contact_diaries = "Contact-diaries-network_data_2013.csv"
    facebook = "Facebook-known-pairs_data_2013.csv"
    friendship = "Friendship-network_data_2013.csv"
    high_school = "High-School_data_2013.csv"

MetaData

metadata_path = HIGH_SCHOOL.joinpath(Files.metadata)
assert metadata_path.is_file()
with metadata_path.open() as reader:
    for line in range(5):
        print(reader.readline(), end="")
650     2BIO1   F
498     2BIO1   F
627     2BIO1   F
857     2BIO1   F
487     2BIO1   F

This first file has the meta-data for the students. The three columns are the student's ID, class, and gender.

meta_data = pandas.read_csv(metadata_path, sep="\t", 
                            names=["id", "class", "gender"])
meta_data.loc[:, "class"] = meta_data["class"].astype("category")
meta_data.loc[:, "gender"] = meta_data.gender.astype("category")
Classes

First a bar-plot to look at how the classes are distributed.

grouped = meta_data.groupby(["class", "gender"]).agg(
    {"class": "count", "gender": "count"})
grouped.columns = ["class_count", "gender_count"]
grouped = grouped.reset_index()
plot = grouped.hvplot.bar(title="Class Counts by Gender", 
                          x="class", y="class_count", 
                          stacked=True,
                          by="gender", height=Plot.height, 
                          width=Plot.width,
                          ylabel="Count",
                          xlabel="Class",
                          tools=["hover"],
                          fontsize=Plot.fontsize).opts(xrotation=90)
Embed(plot=plot, file_name="gender_counts_stacked", height_in_pixels=Plot.height)()

Figure Missing

Link to Plot

This is a look at the same thing except not stacked.

plot = grouped.hvplot.bar(title="Class Counts by Gender", x="class", 
                          y="class_count",
                          xlabel="Class",
                          ylabel="Count",
                          by="gender", height=Plot.height, width=Plot.width, 
                          tools=["hover"],
                          fontsize=Plot.fontsize).opts(xrotation=90)
Embed(plot=plot, file_name="gender_counts", height_in_pixels=Plot.height)()

Figure Missing

Link to Plot

Strangely, the classes that start with 2BIO are more female while the others are more male.

Gender

A stacked bar plot to get a sense of not just the distribution among genders but among classes.

plot = grouped.hvplot.bar(title="Gender Counts", x="gender", y="gender_count",
                          stacked=True,
                          by="class", 
                          xlabel="Count",
                          ylabel="Gender",
                          fontsize=Plot.fontsize,
                          width=Plot.width,
                          height=Plot.height).opts(
                              xrotation=90, 
                              xlabel="Gender and Class")
Embed(plot=plot, file_name="class_counts_stacked", height_in_pixels=Plot.height)()

Figure Missing

Link to Plot

A non-stacked bar plot to get a better sense of how the genders fill the different classes.

plot = grouped.hvplot.bar(title="Gender Counts", x="gender", y="gender_count",
                          xlabel="Gender",
                          ylabel="Count",
                          by="class", 
                          height=Plot.height,
                          width=Plot.width,
                          fontsize=Plot.fontsize).opts(
                              xrotation=90, xlabel="Gender and Class")
Embed(plot=plot, file_name="class_counts", height_in_pixels=Plot.height)()

Figure Missing

Link to Plot

It looks like there were a little more males than females, but not a whole lot more.

The Descent

The Contact Network

This is a dataset that shows whether a student logged contact with another student.

contact_path = HIGH_SCHOOL.joinpath(Files.contact_diaries)
assert contact_path.is_file()
with contact_path.open() as reader:
    for line in range(5):
        print(reader.readline(), end="")
3 28 2
3 106 1
3 147 4
3 177 1
3 295 4

The columns are the person who was making the report, the person that was identified as a contact, and the time spent ecoded into one of four values.

Code Lower Limit (minutes) Upper Limit (minutes)
1 0 5
2 5 15
3 15 60
4 60 infinity
contact_data = pandas.read_csv(contact_path, delimiter=" ", 
                                  names=["reporter", "contact", "time"])
contact_data = contact_data.dropna()

End

Citations

  • R. Mastrandrea, J. Fournet, A. Barrat,

Contact patterns in a high school: a comparison between data collected using wearable sensors, contact diaries and friendship surveys. PLoS ONE 10(9): e0136497 (2015)

High School Facebook Networks

The Departure

This looks at data provided by SocioPatterns that looks a the interactions between students at a High School in Marseilles, France, in December of 2013.

Imports

From Python
from collections import Counter
from functools import partial
from pathlib import Path
import os
From PyPi
from bokeh.models import HoverTool
from dotenv import load_dotenv
from holoviews import dim, opts
from holoviews.operation.datashader import datashade, bundle_graph
import holoviews
import hvplot.pandas
import networkx
import pandas as pandas
My Stuff
from graeae.timers import Timer
from graeae.visualization import EmbedBokeh, EmbedHoloview

Load the Dotenv

load_dotenv(".env")

Build the Timer

TIMER = Timer()

Setup The Plotting

holoviews.extension("bokeh")
SLUG = "high-school-contact-and-friendship-networks/"
output = Path("../../files/posts/networks/" + SLUG)
Embed = partial(EmbedHoloview, folder_path=output)
EmbedB = partial(EmbedBokeh, folder_path=output)
class Plot:
    """Constants for plotting"""
    width = 1000
    height = 800
    fontsize = 18

Load The Data

Let's take a look at the data before loading it into pandas.

HIGH_SCHOOL = Path(os.environ.get("HIGH_SCHOOL")).expanduser()
assert HIGH_SCHOOL.is_dir()

#+begin_src ipython :session highschool :results none
class Files:
    metadata = "metadata_2013.txt"
    contact_diaries = "Contact-diaries-network_data_2013.csv"
    facebook = "Facebook-known-pairs_data_2013.csv"
    friendship = "Friendship-network_data_2013.csv"
    high_school = "High-School_data_2013.csv"

MetaData

metadata_path = HIGH_SCHOOL.joinpath(Files.metadata)
assert metadata_path.is_file()
with metadata_path.open() as reader:
    for line in range(5):
        print(reader.readline(), end="")
650     2BIO1   F
498     2BIO1   F
627     2BIO1   F
857     2BIO1   F
487     2BIO1   F

This first file has the meta-data for the students. The three columns are the student's ID, class, and gender.

meta_data = pandas.read_csv(metadata_path, sep="\t", 
                            names=["id", "class", "gender"])
meta_data.loc[:, "class"] = meta_data["class"].astype("category")
meta_data.loc[:, "gender"] = meta_data.gender.astype("category")
Classes

First a bar-plot to look at how the classes are distributed.

grouped = meta_data.groupby(["class", "gender"]).agg(
    {"class": "count", "gender": "count"})
grouped.columns = ["class_count", "gender_count"]
grouped = grouped.reset_index()
plot = grouped.hvplot.bar(title="Class Counts by Gender", 
                          x="class", y="class_count", 
                          stacked=True,
                          by="gender", height=Plot.height, 
                          width=Plot.width,
                          ylabel="Count",
                          xlabel="Class",
                          tools=["hover"],
                          fontsize=Plot.fontsize).opts(xrotation=90)
Embed(plot=plot, file_name="gender_counts_stacked", height_in_pixels=Plot.height)()

Figure Missing

This is a look at the same thing except not stacked.

plot = grouped.hvplot.bar(title="Class Counts by Gender", x="class", 
                          y="class_count",
                          xlabel="Class",
                          ylabel="Count",
                          by="gender", height=Plot.height, width=Plot.width, 
                          tools=["hover"],
                          fontsize=Plot.fontsize).opts(xrotation=90)
Embed(plot=plot, file_name="gender_counts", height_in_pixels=Plot.height)()

Figure Missing

Strangely, the classes that start with 2BIO are more female while the others are more male.

Gender

A stacked bar plot to get a sense of not just the distribution among genders but among classes.

plot = grouped.hvplot.bar(title="Gender Counts", x="gender", y="gender_count",
                          stacked=True,
                          by="class", 
                          xlabel="Count",
                          ylabel="Gender",
                          fontsize=Plot.fontsize,
                          width=Plot.width,
                          height=Plot.height).opts(
                              xrotation=90, 
                              xlabel="Gender and Class")
Embed(plot=plot, file_name="class_counts_stacked", height_in_pixels=Plot.height)()

Figure Missing

A non-stacked bar plot to get a better sense of how the genders fill the different classes.

plot = grouped.hvplot.bar(title="Gender Counts", x="gender", y="gender_count",
                          xlabel="Gender",
                          ylabel="Count",
                          by="class", 
                          height=Plot.height,
                          width=Plot.width,
                          fontsize=Plot.fontsize).opts(
                              xrotation=90, xlabel="Gender and Class")
Embed(plot=plot, file_name="class_counts", height_in_pixels=Plot.height)()

Figure Missing

It looks like there were a little more males than females, but not a whole lot more.

The Descent

The Facebook Network

This is a dataset that shows whether a student was facebook friends with another student.

facebook_path = HIGH_SCHOOL.joinpath(Files.facebook)
assert facebook_path.is_file()
with facebook_path.open() as reader:
    for line in range(5):
        print(reader.readline(), end="")
1 984 0
1 883 1
1 941 0
1 650 0
1 132 1

The columns are one student, next student, facebook friends.

The third column is 0 if they aren't facebook friends and 1 if they are.

facebook_data = pandas.read_csv(facebook_path, delimiter=" ", 
                                names=["reporter", "other", "friend"])
facebook_data = facebook_data.dropna()

The Descent

Looking at the Friendship Network

with TIMER:
    facebook_graph = networkx.convert_matrix.from_pandas_edgelist(
        facebook_data, "reporter", "other", 
        create_using=networkx.DiGraph)
Started: 2019-03-27 23:05:04.495114
Ended: 2019-03-27 23:05:04.499622
Elapsed: 0:00:00.004508
genders = dict(zip(meta_data.id, meta_data.gender))
classes = dict(zip(meta_data.id, meta_data["class"]))
for node in facebook_graph.nodes:
    facebook_graph.nodes[node]["gender"] = genders[node]
    facebook_graph.nodes[node]["class"] = classes[node]
hover = HoverTool(
    tooltips = [
         ("Gender", "@gender"),
         ("Class", "@class"),
    ],
)

plot = holoviews.Graph.from_networkx(facebook_graph,
                                     networkx.circular_layout).opts(
                                         node_color=dim("gender"), cmap="Set1",
                                         tools=[hover],
                                         fontsize=Plot.fontsize,
                                         width=800,
                                         height=800,                                        
                                         title="Facebook Network by Gender",
                                         xaxis=None, yaxis=None, directed=True)
Embed(plot=plot, file_name="facebook_network_circular")()

Figure Missing

It's a little hard to see what's going on here, other than to note that you can see some people are more popular than others.

hover = HoverTool(
    tooltips = [
         ("Gender", "@gender"),
         ("Class", "@class"),
    ],
)

plot = holoviews.Graph.from_networkx(facebook_graph,
                                     networkx.circular_layout).opts(
                                         node_color=dim("class"), cmap="Set1",
                                         tools=[hover],
                                         fontsize=Plot.fontsize,
                                         width=800,
                                         height=800,                                        
                                         title="Facebook Network by Class",
                                         xaxis=None, yaxis=None, directed=True)
Embed(plot=plot, file_name="facebook_network_circular_class")()

Figure Missing

plot = holoviews.Graph.from_networkx(facebook_graph, networkx.spring_layout, ).opts(
                                         node_color=dim("class"), cmap="Set1",
                                         tools=["hover"],
                                         width=800,
                                         height=800,
                                         title="Facebook Network By Class",
                                         xaxis=None, yaxis=None, directed=True)
Embed(plot=plot, file_name="facebook_network_class_spring", height_in_pixels=810)()

Figure Missing

plot = holoviews.Graph.from_networkx(facebook_graph, networkx.spring_layout, ).opts(
                                         node_color=dim("gender"), cmap="Set1",
                                         tools=["hover"],
                                         width=800,
                                         height=800,
                                         title="Facebook Network By Gender",
                                         xaxis=None, yaxis=None, directed=True)
Embed(plot=plot, file_name="facebook_network_gender_spring", height_in_pixels=810)()

Figure Missing

End

Citations

  • R. Mastrandrea, J. Fournet, A. Barrat,

Contact patterns in a high school: a comparison between data collected using wearable sensors, contact diaries and friendship surveys. PLoS ONE 10(9): e0136497 (2015)