Exercise in Permutation Importance

Beginning

This is my re-do of the Machine Learning Explainability Permutation Importance exercise on kaggle. It uses the data from the New York City Taxi Fare Prediction dataset on kaggle.

Imports

From Python

from argparse import Namespace
from functools import partial
from pathlib import Path

From PyPi

from eli5.sklearn import PermutationImportance

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from tabulate import tabulate

import eli5
import hvplot.pandas
import numpy
import pandas

Others

from graeae import EmbedHoloviews, EnvironmentLoader, Timer

Set Up

A Timer

TIMER = Timer()

Plotting

SLUG = "exercise-in-permutation-importance"
PATH = Path("../../files/posts/tutorials/")/SLUG
Plot = Namespace(
    width=1000,
    height=800,
    )
Embed = partial(EmbedHoloviews, folder_path=PATH)

The Environment

ENVIRONMENT = EnvironmentLoader()

Table Printer

TABLE = partial(tabulate, tablefmt="orgtbl", headers="keys", showindex=False)

Middle

The Dataset

Since this is about permutation importance we're just going to load a subset (there are over five million rows in the dataset) and use previously discovered values to get rid of outliers.

ROWS = 5 * 10**4
PATH = Path(ENVIRONMENT["NY-TAXI"]).expanduser()
assert PATH.is_dir()
data = pandas.read_csv(PATH/"train.csv", nrows=ROWS)
print(TABLE(data.iloc[0].reset_index().rename(columns={
    "index": "Column", 0: "Value"})))
Column Value
key 2009-06-15 17:26:21.0000001
fare_amount 4.5
pickup_datetime 2009-06-15 17:26:21 UTC
pickup_longitude -73.844311
pickup_latitude 40.721319
dropoff_longitude -73.84161
dropoff_latitude 40.712278000000005
passenger_count 1
  • Trim Outliers
    print(TABLE(data.describe(), showindex=True))
    
      fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
    count 50000 50000 50000 50000 50000 50000
    mean 11.3642 -72.5098 39.9338 -72.5046 39.9263 1.66784
    std 9.68556 10.3939 6.22486 10.4076 6.01474 1.28919
    min -5 -75.4238 -74.0069 -84.6542 -74.0064 0
    25% 6 -73.9921 40.7349 -73.9912 40.7344 1
    50% 8.5 -73.9818 40.7527 -73.9801 40.7534 1
    75% 12.5 -73.9671 40.7674 -73.9636 40.7682 2
    max 200 40.7835 401.083 40.851 43.4152 6
    to_plot = data[[column for column in data.columns
                    if "latitude" in column or "longitude" in column]]
    plot = to_plot.hvplot.box().opts(title="Column Box-Plots",
                                     width=Plot.width,
                                     height=Plot.height)
    Embed(plot=plot, file_name="column_box_plots")()
    

    Figure Missing

    So you can see that there are negative fares, which seems wrong, and some outliers.

    This uses the pandas query method which let's you write slightly more readable code for boolean slicing.

    print(f"{len(data):,}")
    data = data.query("pickup_latitude > 40.7 and pickup_latitude < 40.8 and " +
                      "dropoff_latitude > 40.7 and dropoff_latitude < 40.8 and " +
                      "pickup_longitude > -74 and pickup_longitude < -73.9 and " +
                      "dropoff_longitude > -74 and dropoff_longitude < -73.9 and " +
                      "fare_amount > 0"
                      )
    print(f"{len(data):,}")
    
    50,000
    31,289
    

Set Up the Training and Test Sets

y = data.fare_amount
base_features = ['pickup_longitude',
                 'pickup_latitude',
                 'dropoff_longitude',
                 'dropoff_latitude',
                 'passenger_count']

X = data[base_features]
x_train, x_validate, y_train, y_validate = train_test_split(X, y, random_state=1)

print(f"{len(x_train):,}")
print(f"{len(x_validate):,}")
23,466
7,823

Build and Train the Model

estimators = list(range(50, 200, 10))
max_depth = list(range(10, 100, 10)) + [None]

grid = dict(n_estimators=estimators,
            max_depth=max_depth)

model = RandomForestRegressor()
search = RandomizedSearchCV(estimator=model,
                            param_distributions=grid,
                            n_iter=40,
                            n_jobs=-1,
                            random_state=1)
with TIMER:
    search.fit(x_train, y_train)
first_model = search.best_estimator_
print(f"CV Training R^2: {search.best_score_:0.2f}")
print(f"Training R^2: {first_model.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {first_model.score(x_validate, y_validate):0.2f}")
print(search.best_params_)
2020-02-10 13:40:59,617 graeae.timers.timer start: Started: 2020-02-10 13:40:59.616742
/home/brunhilde/.virtualenvs/Visions-Voices-Data/lib/python3.7/site-packages/sklearn/model_selection/_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
  warnings.warn(CV_WARNING, FutureWarning)
/home/brunhilde/.virtualenvs/Visions-Voices-Data/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning
2020-02-10 13:42:04,387 graeae.timers.timer end: Ended: 2020-02-10 13:42:04.387425
2020-02-10 13:42:04,390 graeae.timers.timer end: Elapsed: 0:01:04.770683
CV Training R^2: 0.45
Training R^2:  0.92
Validation R^2: 0.42
{'n_estimators': 170, 'max_depth': 70}

So it isn't really a great model, but we'll ignore that for now.

Questions

Question 1

The first model uses the following features:

  • pickup_longitude
  • pickup_latitude
  • dropoff_longitude
  • dropoff_latitude
  • passenger_count

Before running any code… which variables seem potentially useful for predicting taxi fares? Do you think permutation importance will necessarily identify these features as important?

I think that pickup and dropoff latitude might be important, since this would reflect where in the city the person was and wanted to go. Passenger count might make a difference as well, but I don't know if there's a greater charge for more people. Longitude might also be useful, but my guess would be that the North-South location is more indicative of the type of place you are in (uptown or downtown) and thus how far you have to travel (I have a vague notion that New York City is longer vertically than horizontally, but I don't know if this is true). This would be even more important if the fares change by location, but I don't know if that's the case.

with TIMER:
    permutor = PermutationImportance(first_model, random_state=1).fit(
        x_validate, y_validate)
ipython_html = eli5.show_weights(permutor,
                                 feature_names=x_validate.columns.tolist())
table = pandas.read_html(ipython_html.data)[0]
print(TABLE(table))
Weight Feature
0.8413 ± 0.0171 dropoff_latitude
0.8135 ± 0.0223 pickup_latitude
0.5723 ± 0.0370 pickup_longitude
0.5324 ± 0.0257 dropoff_longitude
-0.0014 ± 0.0015 passenger_count

So it looks like latitude and longitude are important, with latitude a little more important than longitude and passenger count isn't important.

A New Model

Question 4

Without detailed knowledge of New York City, it's difficult to rule out most hypotheses about why latitude features matter more than longitude.

A good next step is to disentangle the effect of being in certain parts of the city from the effect of total distance traveled.

The code below creates new features for longitudinal and latitudinal distance. It then builds a model that adds these new features to those you already had.

Feature Engineering

We're going to estimate the distance traveled by using the differences in latitude and longitude from the pickup to the dropoff. This should give us a taxicab-distance estimate.

data['absolute_change_longitude'] = abs(
    data.dropoff_longitude - data.pickup_longitude)
data['absolute_change_latitude'] = abs(
    data.dropoff_latitude - data.pickup_latitude)

features_2  = ['pickup_longitude',
               'pickup_latitude',
               'dropoff_longitude',
               'dropoff_latitude',
               'absolute_change_latitude',
               'absolute_change_longitude']

X = data[features_2]
new_x_train, new_x_validate, new_y_train, new_y_validate = train_test_split(
    X, y, random_state=1)

estimators = list(range(100, 250, 10))
max_depth = list(range(10, 50, 10)) + [None]
model = RandomForestRegressor()
search = RandomizedSearchCV(estimator=model,
                            param_distributions=grid,
                            n_jobs=-1,
                            random_state=1)
with TIMER:
    search.fit(new_x_train, new_y_train)

second_model = search.best_estimator_
print(f"Mean Cross-Validation Training R^2: {search.best_score_:0.2f}")
print(f"Training R^2: {second_model.score(new_x_train, new_y_train): 0.2f}")
print("Validation R^2: "
      f"{second_model.score(new_x_validate, new_y_validate):0.2f}")
print(search.best_params_)
2020-02-10 13:42:52,104 graeae.timers.timer start: Started: 2020-02-10 13:42:52.104493
/home/brunhilde/.virtualenvs/Visions-Voices-Data/lib/python3.7/site-packages/sklearn/model_selection/_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
  warnings.warn(CV_WARNING, FutureWarning)
2020-02-10 13:43:24,554 graeae.timers.timer end: Ended: 2020-02-10 13:43:24.554581
2020-02-10 13:43:24,556 graeae.timers.timer end: Elapsed: 0:00:32.450088
Mean Cross-Validation Training R^2: 0.48
Training R^2:  0.70
Validation R^2: 0.47
{'n_estimators': 190, 'max_depth': 10}

Still a pretty bad model, but that's not the point, I guess.

The Permutation Importance

permutor = PermutationImportance(second_model, random_state=1).fit(
    new_x_validate, new_y_validate)
ipython_html = eli5.show_weights(permutor,
                                 feature_names=new_x_validate.columns.tolist())
table = pandas.read_html(ipython_html.data)[0]
print(TABLE(table))
Weight Feature
0.5976 ± 0.0306 absolute_change_latitude
0.4429 ± 0.0496 absolute_change_longitude
0.0339 ± 0.0216 pickup_latitude
0.0232 ± 0.0032 dropoff_latitude
0.0214 ± 0.0068 dropoff_longitude
0.0159 ± 0.0055 pickup_longitude

The distance traveled seems to be the most important feature for the fare, even more than the actual locations, probably because taxis charge by distance.

Question 5

This question is about the scale of the parameters. Here's a sample.

print(TABLE(new_x_train.sample(random_state=1).iloc[0].reset_index()))
index 31975
pickup_longitude -73.9706
pickup_latitude 40.7613
dropoff_longitude -73.9806
dropoff_latitude 40.7483
absolute_change_latitude 0.01302
absolute_change_longitude 0.010067

And here's some statistics about each.

print(new_x_validate.describe())
       pickup_longitude  pickup_latitude  dropoff_longitude  dropoff_latitude  \
count       7823.000000      7823.000000        7823.000000       7823.000000   
mean         -73.976957        40.756877         -73.975293         40.757591   
std            0.014663         0.018064           0.015877          0.018669   
min          -73.999977        40.700400         -73.999992         40.700293   
25%          -73.988180        40.745044         -73.987078         40.746345   
50%          -73.979933        40.757881         -73.978427         40.758602   
75%          -73.968008        40.769486         -73.966296         40.770561   
max          -73.900123        40.799865         -73.901790         40.799984   

       absolute_change_latitude  absolute_change_longitude  
count               7823.000000                7823.000000  
mean                   0.015091                   0.013029  
std                    0.012508                   0.011554  
min                    0.000000                   0.000000  
25%                    0.006089                   0.004968  
50%                    0.011745                   0.010110  
75%                    0.020781                   0.017798  
max                    0.084413                   0.087337  

A colleague observes that the values for absolute_change_longitude and absolute_change_latitude are pretty small (all values are between -0.1 and 0.1), whereas other variables have larger values. Do you think this could explain why those coordinates had larger permutation importance values in this case?

Consider an alternative where you created and used a feature that was 100X as large for these features, and used that larger feature for training and importance calculations. Would this change the outputted permutation importance values?

Why or why not?

for column in ("pickup_longitude pickup_latitude dropoff_longitude "
               "dropoff_latitude absolute_change_latitude "
               "absolute_change_longitude").split():
    print(f"{column}: {new_x_validate[column].max() - new_x_validate[column].min():0.3f}")
pickup_longitude: 0.100
pickup_latitude: 0.099
dropoff_longitude: 0.098
dropoff_latitude: 0.100
absolute_change_latitude: 0.084
absolute_change_longitude: 0.087

Intuitively I would think that the difference in the scales would make a difference.

data["bigger_pickup_longitude"] = data.pickup_longitude * 100
data["bigger_absolute_change_longitude"] = data.absolute_change_longitude * 100
features_3  = ['pickup_longitude',
               'pickup_latitude',
               'dropoff_longitude',
               'dropoff_latitude',
               'absolute_change_latitude',
               'absolute_change_longitude',
               'bigger_pickup_longitude',
               'bigger_absolute_change_longitude'
               ]

X = data[features_3]
big_x_train, big_x_validate, big_y_train, big_y_validate = train_test_split(
    X, y, random_state=1)
model = RandomForestRegressor()
search = RandomizedSearchCV(estimator=model,
                            param_distributions=grid,
                            n_jobs=-1,
                            random_state=1)
with TIMER:
    search.fit(big_x_train, big_y_train)
big_model = search.best_estimator_
print(f"Mean Cross-Validation Training R^2: {search.best_score_:0.2f}")
print(f"Training R^2: {big_model.score(big_x_train, big_y_train): 0.2f}")
print("Validation R^2: "
      f"{big_model.score(big_x_validate, big_y_validate):0.2f}")
print(search.best_params_)
2020-02-09 15:06:45,693 graeae.timers.timer start: Started: 2020-02-09 15:06:45.693742
/home/athena/.virtualenvs/Visions-Voices-Data/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning
2020-02-09 15:09:15,559 graeae.timers.timer end: Ended: 2020-02-09 15:09:15.559561
2020-02-09 15:09:15,560 graeae.timers.timer end: Elapsed: 0:02:29.865819
Mean Cross-Validation Training R^2: 0.49
Training R^2:  0.70
Validation R^2: 0.47
{'n_estimators': 190, 'max_depth': 10}
permutor = PermutationImportance(big_model, random_state=1).fit(
    big_x_validate, big_y_validate)
ipython_html = eli5.show_weights(permutor,
                                 feature_names=big_x_validate.columns.tolist())
table = pandas.read_html(ipython_html.data)[0]
print(TABLE(table))
Weight Feature
0.6034 ± 0.0436 absolute_change_latitude
0.1794 ± 0.0126 bigger_absolute_change_longitude
0.1366 ± 0.0062 absolute_change_longitude
0.0326 ± 0.0217 pickup_latitude
0.0242 ± 0.0040 dropoff_latitude
0.0194 ± 0.0083 dropoff_longitude
0.0188 ± 0.0085 pickup_longitude
0.0116 ± 0.0018 bigger_pickup_longitude

Making the pickup longitude didn't change its ranking relative to the other features so I wouldn't say that the scale had an effect.

Question 6

You've seen that the feature importance for latitudinal distance is greater than the importance of longitudinal distance. From this, can we conclude whether travelling a fixed latitudinal distance tends to be more expensive than traveling the same longitudinal distance?

No, the feature importance indicates that it is useful in predicting fares, but it doesn't automatically mean that the fares will increase with the change in latitude. It might be the case that the change in longitude affects the cost of a change in latitude as well, so a fixed latitude distance might change depending on the longitude or latitude + longitude combination.

Euclidean Distance

Instead of keeping latitudinal and longitudinal distances separate, what if we used the euclidean distance?

data["euclidean_distance"] = numpy.sqrt(data.absolute_change_latitude**2
                                        + data.absolute_change_longitude**2)
features  = ['pickup_longitude',
               'pickup_latitude',
               'dropoff_longitude',
               'dropoff_latitude',
               'absolute_change_latitude',
               'absolute_change_longitude',
               "euclidean_distance",
               ]

X = data[features]
euclid_x_train, euclid_x_validate, euclid_y_train, euclid_y_validate = train_test_split(X, y, random_state=1)
model = RandomForestRegressor()
search = RandomizedSearchCV(estimator=model,
                            param_distributions=grid,
                            n_jobs=-1,
                            random_state=1)
with TIMER:
    search.fit(euclid_x_train, euclid_y_train)
euclidean_model = search.best_estimator_
print(f"Mean Cross-Validation Training R^2: {search.best_score_:0.2f}")
print(f"Training R^2: {euclidean_model.score(euclid_x_train, euclid_y_train): 0.2f}")
print("Validation R^2: "
      f"{euclidean_model.score(euclid_x_validate, euclid_y_validate):0.2f}")
print(search.best_params_)
2020-02-10 14:23:41,605 graeae.timers.timer start: Started: 2020-02-10 14:23:41.605310
/home/brunhilde/.virtualenvs/Visions-Voices-Data/lib/python3.7/site-packages/sklearn/model_selection/_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
  warnings.warn(CV_WARNING, FutureWarning)
2020-02-10 14:24:20,385 graeae.timers.timer end: Ended: 2020-02-10 14:24:20.385580
2020-02-10 14:24:20,388 graeae.timers.timer end: Elapsed: 0:00:38.780270
Mean Cross-Validation Training R^2: 0.48
Training R^2:  0.74
Validation R^2: 0.48
{'n_estimators': 190, 'max_depth': 10}

Interestingly, the training \(R^2\) score went down although there was a slight improvement in the validation \(R^2\).

permutor = PermutationImportance(euclidean_model, random_state=1).fit(
    euclid_x_validate, euclid_y_validate)
ipython_html = eli5.show_weights(
    permutor,
    feature_names=euclid_x_validate.columns.tolist())
table = pandas.read_html(ipython_html.data)[0]
print(TABLE(table))
Weight Feature
1.3370 ± 0.0469 euclidean_distance
0.3031 ± 0.0076 absolute_change_latitude
0.1179 ± 0.0046 absolute_change_longitude
0.0261 ± 0.0045 dropoff_latitude
0.0224 ± 0.0140 dropoff_longitude
0.0219 ± 0.0041 pickup_longitude
0.0183 ± 0.0051 pickup_latitude

According to the permutation importance, euclidean distance is much more important than the separate distances.

End

The suggested next tutorial is about Partial Dependence Plots.

Permutation Importance

Beginning

This is some notes on the kaggle tutorial on Permutation Importance. Permutation Importance is a form of feature selection where you ask - If you randomly shuffle the values one column in the dataset and leave the others in place, how does this affect the performance of the model?. The idea is that if the column is important, then shuffling the values should make the model perform worse, so you can measure how much it degrades after you shuffle each column and figure out which columns are contributing to the model. Here's the rough process:

  1. Start with a trained model and a labeled dataset.
  2. Shuffle a single column
  3. Measure how much worse the model does predicting the target.
  4. Restore the column and move on to the next column, repeating the steps until you have covered all the columns.

Imports

Python

from functools import partial
from pathlib import Path

PyPi

  • eli5

    eli5 (which is presumably short for explain it like I'm five) is a library to help with machine learning model debugging and visualization. You can read about the PermutationImportance class here.

    from eli5.sklearn import PermutationImportance
    import eli5
    
  • sklearn
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from tabulate import tabulate
    import hvplot.pandas
    import numpy
    import pandas
    

Others

from graeae import CountPercentage, EmbedHoloviews, EnvironmentLoader

Set Up

Plotting

SLUG = "permutation-importance"
PATH = Path("files/posts/tutorials/")/SLUG
Embed = partial(EmbedHoloviews, folder_path=PATH)

The Environment

ENVIRONMENT = EnvironmentLoader(path="posts/tutorials/.env")

The Table Printer

TABLE = partial(tabulate, tablefmt="orgtbl", headers="keys", showindex=False)

The Data

The dataset here is the Predict FIFA 2018 Man of the Match dataset from kaggle.

path = Path(ENVIRONMENT["FIFA-2018"]).expanduser()
data = pandas.read_csv(path)

Middle

Looking at the Dataset

print(TABLE(data.sample().iloc[0].reset_index(), headers="Column Value".split()))
Column Value
Date 18-06-2018
Team Belgium
Opponent Panama
Goal Scored 3
Ball Possession % 61
Attempts 15
On-Target 6
Off-Target 7
Blocked 2
Corners 9
Offsides 1
Free Kicks 21
Saves 2
Pass Accuracy % 89
Passes 544
Distance Covered (Kms) 102
Fouls Committed 17
Yellow Card 3
Yellow & Red 0
Red 0
Man of the Match Yes
1st Goal 47.0
Round Group Stage
PSO No
Goals in PSO 0
Own goals nan
Own goal Time nan

The Target

The target is "Man of the Match".

CountPercentage(data["Man of the Match"])()
Value Count Percent (%)
Yes 64 50.00
No 64 50.00

Not a particularly large dataset, but we aren't really interested in it per-se but rather how to use permutation importance with it.

We want it to be a True/False value rather than a string value so let's change it.

data.loc[:, "Man of the Match"] = data["Man of the Match"] == "Yes"
CountPercentage(data["Man of the Match"])()
Value Count Percent (%)
False 64 50.00
True 64 50.00

The Features

print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128 entries, 0 to 127
Data columns (total 27 columns):
Date                      128 non-null object
Team                      128 non-null object
Opponent                  128 non-null object
Goal Scored               128 non-null int64
Ball Possession %         128 non-null int64
Attempts                  128 non-null int64
On-Target                 128 non-null int64
Off-Target                128 non-null int64
Blocked                   128 non-null int64
Corners                   128 non-null int64
Offsides                  128 non-null int64
Free Kicks                128 non-null int64
Saves                     128 non-null int64
Pass Accuracy %           128 non-null int64
Passes                    128 non-null int64
Distance Covered (Kms)    128 non-null int64
Fouls Committed           128 non-null int64
Yellow Card               128 non-null int64
Yellow & Red              128 non-null int64
Red                       128 non-null int64
Man of the Match          128 non-null bool
1st Goal                  94 non-null float64
Round                     128 non-null object
PSO                       128 non-null object
Goals in PSO              128 non-null int64
Own goals                 12 non-null float64
Own goal Time             12 non-null float64
dtypes: bool(1), float64(3), int64(18), object(5)
memory usage: 26.2+ KB
None

As you can see there's both numeric and non-numeric columns. For illustration purposes let's use just the integer columns.

columns = [column for column in data.columns if data[column].dtype == numpy.int64]
for column in sorted(columns):
    print(f" - {column}")
X = data[columns]

x_train, x_validate, y_train, y_validate = train_test_split(
    X,
    data["Man of the Match"], random_state=1)
  • Attempts
  • Ball Possession %
  • Blocked
  • Corners
  • Distance Covered (Kms)
  • Fouls Committed
  • Free Kicks
  • Goal Scored
  • Goals in PSO
  • Off-Target
  • Offsides
  • On-Target
  • Pass Accuracy %
  • Passes
  • Red
  • Saves
  • Yellow & Red
  • Yellow Card

Build and Train the Model

model = RandomForestClassifier(n_estimators=100, random_state=0).fit(
    x_train, y_train)
print(f"Training Accuracy: {model.score(x_train, y_train)}")
print(f"Validation Accuracy: {model.score(x_validate, y_validate)}")
Training Accuracy: 1.0
Validation Accuracy: 0.6875

It didn't do particularly well on the validation set.

Permutation Importance

As I noted previously, you can read about the PermutationImportance class here. If you read the documentation you'll see that you don't have to pass it a prefit model, and in some cases you don't want to but for our purposes we will.

permutor = PermutationImportance(model, random_state=1).fit(
    x_validate, y_validate)

Now we can print out a table of the outcome.

ipython_html = eli5.show_weights(
    permutor, 
    feature_names=x_validate.columns.tolist())
table = pandas.read_html(ipython_html.data)[0]
print(TABLE(table))
Weight Feature
0.1750 ± 0.0848 Goal Scored
0.0500 ± 0.0637 Distance Covered (Kms)
0.0437 ± 0.0637 Yellow Card
0.0187 ± 0.0500 Off-Target
0.0187 ± 0.0637 Free Kicks
0.0187 ± 0.0637 Fouls Committed
0.0125 ± 0.0637 Pass Accuracy %
0.0125 ± 0.0306 Blocked
0.0063 ± 0.0612 Saves
0.0063 ± 0.0250 Ball Possession %
0 ± 0.0000 Red
0 ± 0.0000 Yellow & Red
0.0000 ± 0.0559 On-Target
-0.0063 ± 0.0729 Offsides
-0.0063 ± 0.0919 Corners
-0.0063 ± 0.0250 Goals in PSO
-0.0187 ± 0.0306 Attempts
-0.0500 ± 0.0637 Passes

The table is ranked from most important feature to least important (based on the accuracy after shuffling the column). Anything with 0 or less essenttially contributed nothing to the model - although that doesn't mean that they might not be useful for more feature engineering.

The data is for the team as a whole, not an individual, so the "Man of the Match" column is telling you if any player on the team was awarded the "Budweiser Man of the Match".

Plotting the Importance

The numbers are okay, but let's take a look at a plot of the weights.

weights = table.Weight.str.split(expand=True)[0].astype(float)
table["weights"] = weights
plot = table.hvplot.bar(x="Feature", y="weights").opts(
    title="Permutation Importance (by Accuracy)",
    width=1000, height=800, xrotation=45)
output = Embed(plot=plot, file_name="permutation_importance")()
print(output)

Figure Missing

End

So that's a quick look at getting a sense of the importance of a feature using eli5 and permutation importance. There's a more in-depth look at it on their site, but next is another look at it with a different data set.

Decision Trees

Beginning

Imports

Python

from functools import partial
from typing import Any
from math import log2

PyPi

from expects import (
    be_within,
    expect,
)
from tabulate import tabulate
import attr
import pandas

Set Up

TABLE = partial(tabulate, headers="keys", tablefmt="orgtbl")

Splitting A Node

We choose which is the next node to split by checking the amount of information we would gain for each candidate node and picking the one that gives us the highest gain. We do this by measuring the impurity of the nodes, which is a measurement of how dissimilar the class labels are for a node.

Entropy

One measure of "impurity" is entropy, a measurement of the information associated with our nodes.

Node Entropy

Here's where we'll calculate the entropy for a node.

\[ Entropy = - \sum_{i=0}^{c-1} p_i(t)log_2 p_i(t) \]

Where \(p_1(t)\) is the fraction of training data (t) that has the classification i. We can translate that to a python function.

 def node_entropy(data: pandas.Series, debug: bool=False) -> float:
     """Calculate the entropy for a child node

     Args:
      data: target data filtered to match the child-node's class
      debug: emit values

     Returns:
      entropy for this node
     """
     if debug:
         print("calculating node-entropy")
     total = len(data)
     accumulated = 0
     for classification in data.unique():
         p = len(data[data == classification])/total
         if debug:
             print(f"\tclass: {classification}, p: {p} entropy: {p * log2(p)}")
         accumulated += p * log2(p)
     if debug:
         print(f"Node Entropy: {accumulated}")
     return -accumulated

Entropy of a Node's Children

We'll use the entropy formula to get the entropy for an indivdual node but to get the total contribution from the possible splits we'll take a weighted sum of the node entropies.

\[ I(children) = \sum_{j=1}^k \frac{N(v_j)}{N} I(v_j) \]

Where \(\frac{N(v_j)}{N}\) is the fraction of the child data in node j and \(I(v_j)\) is the entropy (Impurity) of node j.

Once again, in python.

def children_impurity(data: pandas.DataFrame, column: str, target: str,
                      impurity: object=node_entropy,
            debug: bool=False) -> float:
    """Calculate the entropy for the parent of child nodes

    Args:
     data: the container for the values to check
     column: the column to get the entropy
     target: the target column
     impurity: the function to calculate the impurity of the child
     debug: whether to print some intermediate values

    Returns:
     entropy for the data
    """
    if debug:
        print("Calculating entropy for child nodes")
    total = len(data)
    accumulator = 0
    for classification in data[column].unique():
        child = data[data[column] == classification][target]
        if debug:
            print(f"\tI_({classification}) = ({len(child)}/{total}) "
                  f"x {impurity(child)}")
        accumulator += (len(child)/total) * impurity(child)
    if debug:
        print(f"Child node entropy: {accumulator}")
    return accumulator

Information Gain

The measurement of how much is gained is the difference between a parent node and its children. \[ \Delta = I(parent) - I(children) \]

 def information_gain(data: pandas.Series, column: str, target: str,
                      debug: bool=False) -> float:
     """See how much entropy is removed using this node

     Args:
      data: source to check
      column: name of the column representing the parent node
      target: name of the column we are trying to predict
      debug: emit messages
     """
     return node_entropy(data[target], debug=debug) - children_impurity(
         data, column=column, target=target, debug=debug)

Home Loans

To make this concrete we can look at a small dataset of people applying for a loan. We want to know if they are likely to default and we need to decide if we want our first split to be on whether they own a home or are married (we'll ignore income for this check because it's meant to illustrate splitting qualitative data).

 @attr.s(auto_attribs=True, slots=True, frozen=True)
 class LoanColumns:
     owner: str = "Home Owner"
     married: str = "Marital Status"
     income: str = "Annual Income"
     defaulted: str = "Defaulted"

 LOANS = LoanColumns()
 loans = pandas.DataFrame({
     LOANS.owner: [True, False, False, True, False, False, True, False, False, False],
     LOANS.married: ["Single", "Married", "Single", "Married", "Divorced", "Single", "Divorced", "Single", "Married", "Single"],
     LOANS.income: [125000, 100000, 70000, 120000, 95000, 60000, 220000, 85000, 75000, 90000],
     LOANS.defaulted: [False, False, False, False, True, False, False, True, False, True],
 })
 print(TABLE(loans))
  Home Owner Marital Status Annual Income Defaulted
0 True Single 125000 False
1 False Married 100000 False
2 False Single 70000 False
3 True Married 120000 False
4 False Divorced 95000 True
5 False Single 60000 False
6 True Divorced 220000 False
7 False Single 85000 True
8 False Married 75000 False
9 False Single 90000 True

The first step is to calculate the entropy for the entire set using whether they defaulted or not as our classification.

 impurity_parent = node_entropy(loans[LOANS.defaulted])
 print(f"I_parent: {impurity_parent:0.3f}")

 expect(impurity_parent).to(be_within(0.8812, 0.8813))
I_parent: 0.881

The next step is to figure out which of our two chosen attributes gives us the most information gain- whether the person was a Home Owner or their Marital Status. We could just look at which has a lower entropy, but the problem is stated so that we want to find the greatest difference between the class' entropy and the parent entropy instead.

  • Home Owners

    We have two child nodes - one for homeowners and one for non-homeowners.

     print(loans[loans[LOANS.owner]][LOANS.defaulted].value_counts())
     print()
     print(loans[~loans[LOANS.owner]][LOANS.defaulted].value_counts())
    
    False    3
    Name: Defaulted, dtype: int64
    
    False    4
    True     3
    Name: Defaulted, dtype: int64
    
     impurity_home_owner = children_entropy(loans,
                                            column=LOANS.owner,
                                            target=LOANS.defaulted, debug=True)
     print(f"{impurity_home_owner: 0.3f}")
     expect(impurity_home_owner).to(be_within(0.689, 0.691))
    
    Calculating entropy for child nodes
            I_(True) = (3/10) x -0.0
            I_(False) = (7/10) x 0.9852281360342515
    Child node entropy: 0.6896596952239761
     0.690
    

    Odd that python allows negative zero-values… Now we can see what the information gain will be.

     gain_home_owner = information_gain(loans, LOANS.owner, LOANS.defaulted)
     print(f"Delta Home-Owner: {gain_home_owner: 0.3}")
     expect(gain_home_owner).to(be_within(0.190, 0.19165))
    
    Delta Home-Owner:  0.192
    

    I seem to have precision differences with the book…

  • Married Applicants
     gain_married = information_gain(loans, LOANS.married, LOANS.defaulted, debug=True)
     print(f"Delta Married: {gain_married: 0.3f}")
     expect(gain_married).to(be_within(0.194, 0.196))
     choice = max(((gain_home_owner, LOANS.owner),
                   (gain_married, LOANS.married)))
     print(f"Next Node: {choice}")
    
    calculating node-entropy
            class: False, p: 0.7 entropy: -0.3602012209808308
            class: True, p: 0.3 entropy: -0.5210896782498619
    Node Entropy: -0.8812908992306927
    Calculating entropy for child nodes
            I_(Single) = (5/10) x 0.9709505944546686
            I_(Married) = (3/10) x -0.0
            I_(Divorced) = (2/10) x 1.0
    Child node entropy: 0.6854752972273344
    Delta Married:  0.196
    Next Node: (0.19581560200335835, 'Marital Status')
    

    Since we gain more information from checking whether someone was married or not, that would be the next node we would choose to split.

Binary Splitting of Qualitative Attributes Using the Gini Index

\[ \textit{Gini Index} = 1 - \sum_{i=0}^{c - 1} \]

def gini(data: pandas.Series) -> float:
    """Calculate the gini index for the data"""
    total = len(data)
    accumulator = 0
    for classification in data.unique():
        accumulator += (len(data[data==classification])/total)**2
    return 1 - accumulator

Parent Impurity

First we get the gini index for the overall data.

parent_gini = gini(loans[LOANS.defaulted])
print(f"I_parent = {parent_gini: 0.3f}")
expect(parent_gini).to(be_within(0.420, 0.421))
I_parent =  0.420

Home Owner Impurity

Now the homeowner attribute.

homeowner_gini = children_impurity(loans, LOANS.owner, LOANS.defaulted, gini, debug=True)
expect(homeowner_gini).to(be_within(0.342, 0.344))
Calculating entropy for child nodes
        I_(True) = (3/10) x 0.0
        I_(False) = (7/10) x 0.48979591836734704
Child node entropy: 0.3428571428571429

Married Impurity

This is different from the entropy case because we want to do binary splits but the marital status attribute has three values (Single, Married, and Divorced) so we need to use a different function that does each attribute against the other (or we could add columns which turn the marital status into a binary attribute, but this seems simpler).

def binary_gini(data: pandas.Series, column: str, target: str,
                classification: object, debug: bool=False) -> float:
    """Calculate the gini value for the data using one against many

    Args:
     data: source with qualitative values
     column: column with the classifications to test
     target: column with the classifications to predict
     classification: the classification to compare
     debug: whether to emit messages
    """
    total = len(data)
    classified = data[data[column] == classification]
    others = data[data[column] != classification]
    if debug:
        print(f"N({classification}/N) I({classification}) = {len(classified)/total * gini(classified[target]):0.3f}")
        print(f"N(!{classification}/N) I!({classification}) = {len(others)/total * gini(others[target]):0.3f}")
        print(f"I({classification}) = {((len(classified)/total) * gini(classified[target]) + (len(others)/total) * gini(others[target])):0.3f}")
    return ((len(classified)/total) * gini(classified[target])
            + (len(others)/total) * gini(others[target]))
@attr.s(auto_attribs=True, slots=True, frozen=True)
class MaritalStatus:
    single: str="Single"
    married: str="Married"
    divorced: str="Divorced"

status = MaritalStatus()
i_single = binary_gini(loans, LOANS.married, LOANS.defaulted, status.single,
                       debug=True)
print()
i_married = binary_gini(loans, LOANS.married, LOANS.defaulted, status.married,
                        debug=True)
print()
i_divorced = binary_gini(loans, LOANS.married, LOANS.defaulted,
                         status.divorced, debug=True)
expect(i_single).to(be_within(0.39, 0.41))
expect(i_divorced).to(be_within(0.39, 0.41))
expect(i_married).to(be_within(0.342, 0.344))
N(Single/N) I(Single) = 0.240
N(!Single/N) I!(Single) = 0.160
I(Single) = 0.400

N(Married/N) I(Married) = 0.000
N(!Married/N) I!(Married) = 0.343
I(Married) = 0.343

N(Divorced/N) I(Divorced) = 0.100
N(!Divorced/N) I!(Divorced) = 0.300
I(Divorced) = 0.400
best = 0
best_split = None
for candidate, label in ((homeowner_gini, "Homeowner"),
                         (i_single, "Single"),
                         (i_married, "Married"),
                         (i_divorced, "Divorced")):
    delta = parent_gini - candidate
    if delta > best:
        best = delta
        best_split = label
    print(f"Delta {label} = {delta:0.3f}")
print(f"Best Split: {best_split}")
Delta Homeowner = 0.077
Delta Single = 0.020
Delta Married = 0.077
Delta Divorced = 0.020
Best Split: Homeowner

Either using Home Ownership or Whether someone is married would be the best candidates for the next split.

Snorkel Data Labeling

Beginning

This is a walk-through of the Snorkel Snorkel Data Labeling tutorial.It uses the YouTube Spam Collection data set (downloaded from the UCI Machine Learning Repository). The data was collected in 2015 and represents comments from five of the ten most popular videos on YouTube. It is a tabular dataset with the columns COMMENT_ID, AUTHOR, DATE, CONTENT, CLASS. The tag represents whether it was considered Spam or not, so we'll pretend it isn't there for most of this walk-through.

Imports

Python

from argparse import Namespace
from functools import partial
from pathlib import Path
import re

PyPi

from snorkel.analysis import get_label_buckets
from snorkel.labeling import labeling_function, LabelingFunction, LFAnalysis, PandasLFApplier
from snorkel.preprocess import preprocessor
from snorkel.labeling.lf.nlp import nlp_labeling_function
from sklearn.model_selection import train_test_split
from tabulate import tabulate
from textblob import TextBlob

import pandas

Set Up

The Tabulate Table

TABLE = partial(tabulate, tablefmt="orgtbl", headers="keys")

Some Constants

Comment = Namespace(
    is_ambiguous = -1,
    is_ham = 0,
    is_spam = 1
)
Data = Namespace(
    test_artist = "Shakira",
    development_size = 200,
    validation_size = 0.5,
    random_seed = 666,
)
Columns = Namespace(
    comment = "CONTENT",
    classification = "CLASS",
    artist = "artist",
)

Middle

The Dataset

The data is split up into separate files - one for each artist/video (they are named after the artist and each only appears to have one entry) so I'm going to smash them back together and add a artist column.

path = Path("~/data/datasets/texts/you_tube_comments/").expanduser()
sets = []
for name in path.glob("*.csv"):
    artist = name.stem.split()[-1]
    data = pandas.read_csv(name)
    data["artist"] = artist
    sets.append(data)
    print(artist)
data = pandas.concat(sets)
KatyPerry
LMFAO
Eminem
Shakira
Psy

Splitting the Set

The tutorial takes a slightly different approach from the one I previously took. Here's their four data-set splits:

  • train: Comments from the first four videos
  • dev: 200 comments taken from train
  • valid & test: A 50-50 split of the last video (actually Shakira, not Psy as listed above)
test = data[data.artist==Data.test_artist]
train = data[data.artist != Data.test_artist]
train, development = train_test_split(train, test_size=Data.development_size)

validation, test = train_test_split(test, test_size=Data.validation_size)
print(f"Training: {train.shape}")
print(f"Development: {development.shape}")
print(f"Validation: {validation.shape}")
print(f"Testing: {test.shape}")
Training: (1386, 6)
Development: (200, 6)
Validation: (185, 6)
Testing: (185, 6)

Finding Labeling Functions

The place to start is with the development set - it's labeled (although in this case the training set is as well, but pretend it isn't) and we can inspect it to get ideas.

print(development.sample(random_state=Data.random_seed)[[Columns.comment, Columns.classification]])
                                               CONTENT  CLASS
216  Lol...I dunno how this joke gets a lot of like...      0
spam = development[development[Columns.classification]==Comment.is_spam]
for count in range(10):
    print(spam.sample(random_state=count).iloc[0][Columns.comment])
I #votekatyperry for the 2014 MTV #VMA Best Lyric Video! See who's in the  lead and vote:  http://on.mtv.com/Ut15kX
LIKE AND SUBSCRIB IF YOU WATCH IN 2015 ;)
 HI IM 14 YEAR RAPPER SUPPORT ME GUY AND CHECK OUT MY CHANNEL AND CHECK OUT MY SONG YOU MIGHT LIKE IT ALSO FOLLOW ME IN TWITTER @McAshim for follow back.
LIKE AND SUBSCRIB IF YOU WATCH IN 2015 ;)
HAPPY BIRTHDAY KATY :) http://giphy.com/gifs/birthday-flowers-happy-gw3JY2uqiaXKaQXS/fullscreen  (That´s not me)
plz check out fablife / welcome to fablife for diys and challenges so plz  subscribe thx!
CHECK OUT MY CHANNEL BOYS AND GIRLS ;)
HAPPY BIRTHDAY KATY :) http://giphy.com/gifs/birthday-flowers-happy-gw3JY2uqiaXKaQXS/fullscreen  (That´s not me)
*for 90&#39;s rap fans*  check out my Big Pun - &#39;Beware&#39; cover!  Likes n comments very much appreciated!
Who&#39;s watching in 2015 Subscribe for me !

You can already see that the spam has people asking viewers to check out their sites.

Check vs Check Out

Let's see which one of the strings (check or check out) does better for us.

  • The Labeling Functions
    @labeling_function()
    def check(row: pandas.Series) -> int:
        """sees if the word 'check' is in the comment"""
        return Comment.is_spam if "check" in row.CONTENT.lower() else Comment.is_ambiguous
    
    @labeling_function()
    def check_out(row: pandas.Series) -> int:
        """looks for phrase 'check out'"""
        return Comment.is_spam if "check out" in row.CONTENT.lower() else Comment.is_ambiguous
    
  • Applying the Functions

    The next step is to create some Labeling Matrices using our labeling functions by applying them to our training and development sets. Since our data is stored using pandas, we'll use the PandasLFApplier, but there are other types available as well.

    labeling_functions = [check, check_out]
    
    applier = PandasLFApplier(lfs=labeling_functions)
    train_labeling_matrix = applier.apply(df=train, progress_bar=False)
    development_labeling_matrix = applier.apply(df=development, progress_bar=False)
    print(f"Training Labeling Matrix: {train_labeling_matrix.shape}")
    print(f"Development Labeling Matrix: {development_labeling_matrix.shape}")
    
    Training Labeling Matrix: (1386, 2)
    Development Labeling Matrix: (200, 2)
    

    Each matrix has one column for each of our labeling functions (so two in this case) and one row for each of the rows in the set that the functions were applied to.

  • Evaluating the Labeling Functions

    Snorkel provides a LFAnalysis class to help you see how well the labeling functions do.

    analysis = LFAnalysis(L=train_labeling_matrix, lfs=labeling_functions)
    print(TABLE(analysis.lf_summary()))
    
      j Polarity Coverage Overlaps Conflicts
    check 0 [1] 0.257576 0.212843 0
    check_out 1 [1] 0.212843 0.212843 0

    This is what the table is giving us for each of the labeling functions:

    • j : I think this is just an index
    • Polarity: The number of unique values the function puts out (other than -1, which is interpreted as an un-labeled row)
    • /Coverage: The fraction of the data-set that the function labeled
    • /Overlaps: The fraction of the data that the function labeled and at least one other function also labeled
    • Conflicts: The fraction of the data that the function labeled something different from at least one other function

    So it looks like check covers slightly more than check_out, and they don't disagree with each other at all. This makes sense when you consider that check is a sub-string of check out - we can guess that all the overlaps are cases where check out were found in the comment.

    We can also pass it a set of labels and it will see how well the functions did. In this case we have labels for all the rows, but in most cases we won't just for the development set so we'll use it here.

    print(TABLE(LFAnalysis(
        L=development_labeling_matrix,
        lfs=labeling_functions).lf_summary(Y=development.CLASS.values)))
    
      j Polarity Coverage Overlaps Conflicts Correct Incorrect Emp. Acc.
    check 0 [1] 0.26 0.225 0 49 3 0.942308
    check_out 1 [1] 0.225 0.225 0 45 0 1

    Note: The LFAnalysis class works with numpy arrays, so when I called the lf_summary method I had to pass in the values and not the CLASS Series.

    With our development set, the functions cover slightly less than before (as a fraction of the total), and although check covers slightly more that check_out, it also has some false-postives, so we'd have to decide if we care about getting all the spam or not accidentally labeling non-spam as spam.

    We can also check which ones were mis-labeled to get a better idea of how off they were.

    buckets = get_label_buckets(development.CLASS.values, development_labeling_matrix[:, 0])
    for key, value in buckets.items():
        print(key)
        print(value)
    
    (0, -1)
    [  0   1   2   3   4   7   9  10  11  12  13  15  16  17  19  22  27  28
      33  34  35  39  41  43  44  46  48  49  50  51  52  53  55  57  58  61
      62  65  66  68  78  79  81  82  86  88  89  92  94  95  98  99 100 103
     104 105 107 108 112 113 114 120 121 122 123 124 125 128 129 131 133 135
     141 142 144 146 148 150 153 154 155 162 165 166 167 168 171 173 174 179
     182 183 184 190 191 195 196 197]
    (1, 1)
    [  5  14  23  25  29  36  40  42  45  59  67  69  71  73  74  75  76  77
      80  83  87  90  91  93 101 109 110 116 117 126 127 134 138 139 140 143
     149 151 157 160 163 164 169 172 186 189 192 193 198]
    (1, -1)
    [  6   8  18  20  21  24  26  30  31  32  38  47  54  56  60  63  64  70
      72  84  85  96  97 102 106 111 115 119 130 132 136 137 145 147 152 156
     158 159 161 175 176 177 178 180 181 185 187 188 194 199]
    (0, 1)
    [ 37 118 170]
    

    Buckets is a dict whose keys are tuples of (actual classes, predicted classes) and whose values are the indices of the rows matching the keys (so the key (0, 1) returns the indices for rows where we labeled the comment as spam but it wasn't). Looking at the output you can see that the last key (0, 1) has the cases that we labeled as spam when they weren't, let's take a look at them.

    for comment in development.iloc[buckets[(Comment.is_ham, Comment.is_spam)]]["CONTENT"]:
        print(comment)
    
    i turned it on mute as soon is i came on i just wanted to check the  views...
    i check back often to help reach 2x10^9 views and I avoid watching Baby
    Admit it you just came here to check the number of viewers 
    

    It's not obvious to me how you should handle those.

  • Check Out But Not Check

    What are some training examples that check labels but check_out doesn't? We can check by feeding the columns from the labeling matrix for the check and check_out functions and see where check_out abstained and check didn't. I said earlier that the first argument to get_label_buckets is the actual label, but really you can feed any two arrays and it will find give you the indices for the permutations of their row-values.

    buckets = get_label_buckets(train_labeling_matrix[:, 0], train_labeling_matrix[:, 1])
    sampled = train.iloc[buckets[(Comment.is_spam, Comment.is_ambiguous)]].sample(10, random_state=Data.random_seed)
    for sample in sampled.itertuples():
        print(sample.CONTENT)
    
    Lil m !!!!! Check hi out!!!!! Does live the way you lie and many more ! Check it out!!! And subscribe
    https://soundcloud.com/artady please check my stuff; and make some feedback
    Hey guys can you check my channel out plz. I do mine craft videos. Let's  shoot for 20 subs
    ┏━━━┓┏┓╋┏┓┏━━━┓┏━━━┓┏┓╋╋┏┓  ┃┏━┓┃┃┃╋┃┃┃┏━┓┃┗┓┏┓┃┃┗┓┏┛┃  ┃┗━━┓┃┗━┛┃┃┃╋┃┃╋┃┃┃┃┗┓┗┛┏  ┗━━┓┃┃┏━┓┃┃┗━┛┃╋┃┃┃┃╋┗┓┏┛  ┃┗━┛┃┃┃╋┃┃┃┏━┓┃┏┛┗┛┃╋╋┃┃  ┗━━━┛┗┛╋┗┛┗┛╋┗┛┗━━━┛╋╋┗┛ CHECK MY VIDEOS AND SUBSCRIBE AND LIKE PLZZ
    if you like raw talent, raw lyrics, straight real hip hop Everyone check my newest sound  Dizzy X - Got the Juice (Prod by. Drugs the Model Citizen)   COMMENT TELL ME WHAT YOU THINK  DONT BE LAZY!!!!  - 1/7 Prophetz
    check it out free stuff for watching videos and filling surveys<br /><br /><a href="http://www.prizerebel.com/index.php?r=1446084">http://www.prizerebel.com/index.php?r=1446084</a>
    Hey! I'm NERDY PEACH and I'm a new youtuber and it would mean THE ABSOLUTE  world to me if you could check 'em out! &lt;3  Hope you like them! =D
    Check my first video out
    http://tankionline.com#friend=cd92db3f4 great game check it out!
    hi beaties! i made a new channel please go check it out and subscribe and  enjoy!
    

    I'm going to deviate from the tutorial a little and create a regular expression to match any comment with "check" and not "view" to avoid cases where the commenter is saying that they're checking out how many views the video had.

    EXPRESSION = re.compile(r"check(?!.*view)")
    
    assert EXPRESSION.search("everyone please come check our newest song in memories of Martin Luther  King Jr.")
    assert EXPRESSION.search("and u should.d check my channel and tell me what I should do next!")
    assert not EXPRESSION.search("Admit it you just came here to check the number of viewers ")
    
    @labeling_function()
    def re_check_out(row: pandas.Series) -> int:
        """match cases with 'check' but not view"""
        return Comment.is_spam if EXPRESSION.search(row.CONTENT.lower()) else Comment.is_ambiguous
    
    labeling_functions = [check, check_out, re_check_out]
    applier = PandasLFApplier(lfs=labeling_functions)
    train_labeling_matrix = applier.apply(df=train, progress_bar=False)
    development_labeling_matrix = applier.apply(df=development, progress_bar=False)
    
    analysis = LFAnalysis(L=train_labeling_matrix, lfs=labeling_functions)
    print(TABLE(analysis.lf_summary()))
    
      j Polarity Coverage Overlaps Conflicts
    check 0 [1] 0.257576 0.248196 0
    check_out 1 [1] 0.212843 0.212843 0
    re_check_out 2 [1] 0.243146 0.243146 0

    Our re_check_out function has a little less coverage than check as we'd expect, since it excludes reviews with "view" in them but it also covers a little more than check_out.

    print(TABLE(LFAnalysis(
        L=development_labeling_matrix,
        lfs=labeling_functions).lf_summary(Y=development.CLASS.values)))
    
      j Polarity Coverage Overlaps Conflicts Correct Incorrect Emp. Acc.
    check 0 [1] 0.26 0.245 0 49 3 0.942308
    check_out 1 [1] 0.225 0.225 0 45 0 1
    re_check_out 2 [1] 0.245 0.245 0 49 0 1

    It looks like we were able to avoid the false-positives by adding our regular expression.

    buckets = get_label_buckets(development_labeling_matrix[:, 0], development_labeling_matrix[:,2])
    for comment in development.iloc[buckets[(Comment.is_spam, Comment.is_ambiguous)]]["CONTENT"]:
        print(comment)
    
    i turned it on mute as soon is i came on i just wanted to check the  views...
    i check back often to help reach 2x10^9 views and I avoid watching Baby
    Admit it you just came here to check the number of viewers 
    

    So it looks like we got rid of some false positives but also missed some spam by using the regular expression. We could probably grab more by searching for "my" as well.

Using TextBlob with a Preprocessor

Here we'll use text-blobs sentiment scorer to find comments that aren't spam. To do this we'll need to use snorkel's Preprocessor, which maps data using black-box functions.

@preprocessor(memoize=True)
def textblob_sentiment(row: pandas.Series) -> pandas.Series:
    """Add the polarity and subjectivity of the comment's sentiment

    This adds two columns ('polarity' and 'subjectivity') based on the comment

    """
    blob = TextBlob(row.CONTENT)
    row["polarity"] = blob.sentiment.polarity
    row["subjectivity"] = blob.sentiment.subjectivity
    return row

The polarity is a value from -1.0 to 1.0 which reflects how negative or positive the text is believed to be. The subjectivity is a value from 0.0 to 1.0 which reflects whether the text is objective or subjective - whether it is a statement of fact or opinion.

Now that we have the pre-processor we can use it with a labeling function.

Polarity

@labeling_function(pre=[textblob_sentiment])
def textblob_polarity(row: pandas.Series) -> int:
    """decides if the comment is ham based on the polarity of the sentiment"""
    return Comment.is_ham if row.polarity > 0.9 else Comment.is_ambiguous

Subjectivity

@labeling_function(pre=[textblob_sentiment])
def textblob_subjectivity(row: pandas.Series) -> int:
    """decides if the comment is ham based on the subjectivity"""
    return Comment.is_ham if row.subjectivity > 0.5 else Comment.is_ambiguous

Analyzing the Performance

Once again, now that we have labeling functions we need to analyze how well they do.

labeling_functions = [textblob_polarity, textblob_subjectivity]
applier = PandasLFApplier(lfs=labeling_functions)
train_label_matrix = applier.apply(train, progress_bar=False)
development_label_matrix = applier.apply(development, progress_bar=False)
print(TABLE(LFAnalysis(train_label_matrix, labeling_functions).lf_summary()))
  j Polarity Coverage Overlaps Conflicts
textblob_polarity 0 [0] 0.033189 0.0122655 0
textblob_subjectivity 1 [0] 0.32684 0.0122655 0
print(TABLE(LFAnalysis(development_label_matrix, labeling_functions).lf_summary(Y=development.CLASS.values)))
  j Polarity Coverage Overlaps Conflicts Correct Incorrect Emp. Acc.
textblob_polarity 0 [0] 0.05 0.025 0 9 1 0.9
textblob_subjectivity 1 [0] 0.3 0.025 0 32 28 0.533333

Subjectivity seems to have much better coverage, but it was also fairly inaccurate.

More Labeling Functions

We previously created a keyword-based labeling function for "check". Because using keywords is such a common thing Snorkel has a way to create them with a little less work than creating the labeling functions individually.

First we make a function that checks if any of a collection of keywords is in the comment.

def lookup_keyword(row: pandas.Series, keywords: list, label: int) -> int:
    """check if any of the keywords are in the comment

    Args:
     row: the series with the Comment
     keywords: collection of keywords indicating spam
     label: what to return if the keyword is in the comment

    Returns:
     label if keyword in comment else -1
    """
    return label if any(keyword in row.CONTENT.lower() for keyword in keywords) else Comment.is_ambiguous

Now we make the labeling-function creator that uses the lookup_keyword.

def make_keyword_labeling_function(keywords: list, label: int=Comment.is_spam) -> LabelingFunction:
    """Makes LabelingFunction objects that check keywords"""
    return LabelingFunction(
        name=f"keyword_{keywords[0]}",
        f=lookup_keyword,
        resources=dict(keywords=keywords, label=label)
    )
keyword_my = make_keyword_labeling_function(keywords=["my"])
keyword_subscribe = make_keyword_labeling_function(keywords=["subscribe"])
keyword_link = make_keyword_labeling_function(keywords=["http"])
keyword_please = make_keyword_labeling_function(keywords=["please", "plz"])
keyword_song = make_keyword_labeling_function(keywords=["song"], label=Comment.is_ham)
labeling_functions = [
    keyword_my,
    keyword_subscribe,
    keyword_link,
    keyword_please,
    keyword_song,
]
applier = PandasLFApplier(lfs=labeling_functions)
train_label_matrix = applier.apply(train, progress_bar=False)
development_label_matrix = applier.apply(development, progress_bar=False)
print(TABLE(LFAnalysis(development_label_matrix, labeling_functions).lf_summary(Y=development.CLASS.values)))
  j Polarity Coverage Overlaps Conflicts Correct Incorrect Emp. Acc.
keyword_my 0 [1] 0.18 0.115 0.05 33 3 0.916667
keyword_subscribe 1 [1] 0.125 0.075 0.015 25 0 1
keyword_http 2 [1] 0.09 0.03 0.005 16 2 0.888889
keyword_please 3 [1] 0.095 0.08 0.02 19 0 1
keyword_song 4 [0] 0.16 0.06 0.06 20 12 0.625

There are varying degrees of coveragen and accuracy with these. Interestingly, the subscribe keyword was completely accurate and had pretty good coverage (compared to our check-out labelers).

Adding a Spacy Preprocessor

The purpose of the pre-processors is to do a little feature engineering to add features that aren't in the original dataset but which can be derived from it. Becaues SpaCY is used so much for this, snorkel comes with a labeling function that adds a doc attribute (you can also create it manually to get more control).

@nlp_labeling_function()
def short_with_person(row: pandas.Series) -> int:
    """Check if the comment is short and mentions a person"""
    return (Comment.is_ham if (len(row.CONTENT) < 20 and any((entity.label_=="PERSON" for entity in row.dot.ents)))
                               else Comment.is_ambiguous)

Snorkel Example: Building a Spam Dataset

Beginning

This is a walk-through of the Snorkel Get Started tutorial which shows how you can use it to build a labeled dataset. It uses the YouTube Spam Collection data set (downloaded from the UCI Machine Learning Repository). The data was collected in 2015 and represents comments from five of the ten most popular videos on YouTube. It is a tabular dataset with the columns COMMENT_ID, AUTHOR, DATE, CONTENT, TAG. The tag represents whether it was considered Spam or not, so we'll pretend it isn't there for most of this walk-through.

Imports

Python

from argparse import Namespace
from functools import partial
from pathlib import Path
import random
import re

PyPi

from nltk.corpus import wordnet
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from snorkel.augmentation import ApplyOnePolicy, PandasTFApplier, transformation_function
from snorkel.labeling import labeling_function, LabelModel, PandasLFApplier
from snorkel.slicing import slicing_function
from textblob import TextBlob
import hvplot.pandas
import nltk
import pandas

Others

from graeae import CountPercentage, EmbedHoloviews

Set Up

The WordNet Corpus

nltk.download("wordnet", quiet=True)

Plotting

Embed = partial(EmbedHoloviews, folder_path="../../../files/posts/libraries/snorkel/snorkel-example-building-a-spam-dataset")

The Dataset

The data is split up into separate files - one for each artist/video (they are named after the artist and each only appears to have one entry) so I'm going to smash them back together and add a artist column.

path = Path("~/data/datasets/texts/you_tube_comments/").expanduser()
sets = []
for name in path.glob("*.csv"):
    artist = name.stem.split()[-1]
    data = pandas.read_csv(name)
    data["artist"] = artist
    sets.append(data)
    print(artist)
data = pandas.concat(sets)
KatyPerry
LMFAO
Eminem
Shakira
Psy

Splitting the Set

train, test = train_test_split(data)
print(train.shape)
print(test.shape)

train, development = train_test_split(train)
validation, test = train_test_split(test)
print(train.shape)
print(development.shape)
print(validation.shape)
print(test.shape)
(1467, 6)
(489, 6)
(1100, 6)
(367, 6)
(366, 6)
(123, 6)
grouped = train.groupby(["artist"]).agg({"COMMENT_ID": "count"}).reset_index().rename(columns={"COMMENT_ID": "Count"})
plot = grouped.hvplot.bar(x="artist", y="Count").opts(title="Comments by Artist", width=1000, height=800)
Embed(plot=plot, file_name="comments_by_artist")()

Figure Missing

grouped = train.groupby(["artist", "CLASS"]).agg({"COMMENT_ID": "count"}).reset_index().rename(columns={"COMMENT_ID": "Count"})
plot = grouped.hvplot.bar(x="artist", y="Count", by="CLASS").opts(title="Comments by Artist and Class", width=1000, height=800)
Embed(plot=plot, file_name="comments_by_artist_and_class")()

Figure Missing

I said earlier that the spam/not-spam column was named tag, but its named CLASS here, I don't know where the switch came (it says TAG on the UCI page).

Middle

Labeling Functions

Labeling functions output a label for values in the training set.

Labels

Label = Namespace(
    abstain = -1,
    not_spam = 0,
    spam = 1,
)

The actual data-set only has spam/not-spam classes, but the Snorkel tutorial adds the abstain class as well. Each function is going to be passed a row from the training dataframe, so the class name you use has to match it.

Keyword Matching

@labeling_function()
def labeling_by_keyword(comment: pandas.Series) -> int:
    """Assume if the author refers to something he/she owns it's spam

    Args: 
     row with comment CONTENT

    Returns:
     label for the comment
    """
    return Label.spam if "my" in comment.CONTENT.lower() else Label.abstain

Regular Expressions

@labeling_function()
def label_check_out(comment) -> int:
    """check my/it/the out will be spam"""
    return Label.spam if re.search(r"check.*out", comment.CONTENT, flags=re.I) else Label.abstain

Short Comments

@labeling_function()
def label_short_comment(comment) -> int:
    """if a comment is short it's probably not spam"""
    return Label.not_spam if len(comment.CONTENT.split()) < 5 else Label.abstain

Positive Comments

Here we'll use textblob to try and decide on whether a comment is positive (textblob uses pattern to decide on the polarity.)

@labeling_function()
def label_positive_comment(comment) -> int:
    """If a comment is positive, we'll accept it"""
    return Label.not_spam if TextBlob(comment.CONTENT).sentiment.polarity > 0.3 else Label.abstain

Combining the Functions and Cleaning the Labels

First I'll create a list of the labeling functions so that we can pass it to the label-applier class.

labeling_functions = [labeling_by_keyword, label_check_out, label_short_comment, label_positive_comment]

Now create the applier.

applier = PandasLFApplier(labeling_functions)

Now create it.

label_matrix = applier.apply(train, progress_bar=False)

print(label_matrix.shape)
print(train.shape)
(1100, 4)
(1100, 6)

The label-matrix has one row for each of the comments in our training set and one column for each of our labeling functions.

label_frame = pandas.DataFrame(label_matrix, columns=["keyword", "check_out", "short", "positive"])
re_framed = {}

for column in label_frame.columns:
    re_framed[column] = 

Training the Label Model

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(label_matrix, n_epochs=500, log_freq=50, seed=0)
train["label"] = label_model.predict(L=label_matrix, tie_break_policy="abstain")
grouped = train.groupby(["label", "artist"]).agg({"COMMENT_ID": "count"}).reset_index().rename(columns={"COMMENT_ID": "count"})
plot = grouped.hvplot.bar(x="label", y="count", by="artist").opts(title="Label Counts", height=800, width=1000)
Embed(plot=plot, file_name="label_counts")()

Figure Missing

Most comments were labeled spam or not-spam, but some were abstained. In order to move on to the next section, we'll drop the rows where an opinion about the label was abstained.

train = train[train.label != Label.abstain]
CountPercentage(train.label)()
Value Count Percent (%)
0 419 53.51
1 364 46.49
matched = sum(train.label == train.CLASS)
print(f"{matched/len(train): .2f}")
0.51

Of those that were matched, only a little more than half agree with the labels given by the dataset creators.

Data Augmentation

We're going to create new entries in the data by randomly replacing words with their synonyms.

Synonym Lookup Function

def synonyms_for(word: str) -> list:
    """get synonyms for word"""
    lemmas = set().union(*[synset.lemmas() for synset in wordnet.synsets(word)])
    return list(set(lemma.name().lower().replace("_" , " ") for lemma in lemmas) - {word})

The Transformation Function

@transformation_function()
def replace_word_with_synonym(comment: pandas.Series) -> pandas.Series:
    """Replace one of the words with a synonym

    Args:
     comment: row with a comment

    Returns:
     comment with a word replaced
    """
    tokens = comment.CONTENT.lower().split()
    index = random.choice(range(len(tokens)))
    synonyms = synonyms_for(tokens[index])
    if synonyms:
        comment.CONTENT = " ".join(tokens[:index] + [synonyms[0]] + tokens[index + 1 :])
    return comment
transform_policy = ApplyOnePolicy(n_per_original=2, keep_original=True)
transform_applier = PandasTFApplier([replace_word_with_synonym], transform_policy)
train_augmented = transform_applier.apply(train, progress_bar=False)
print(train_augmented[:3].CONTENT)
415           very good song:)
415    very respectable song:)
415           very good song:)
Name: CONTENT, dtype: object

Because it's random, we don't always end up with different content.

print(f"{len(train_augmented):,}")
train_augmented = train_augmented.drop_duplicates(subset="CONTENT")
print(f"{len(train_augmented):,}")
print(f"{len(train):,}")
2,349
1,357
783

So we added some content.

Slicing

A slice is a subset of the data - in this case we want to identify slices that might be more important than others. In this case we're going to assume that we've identified that short links are more likely to be malicious, so we want to be more aware of them.

@slicing_function()
def short_link(comment: pandas.Series) -> int:
    """checks for shortened links in the comment

    Args:
     comment: row with comment in it

    Returns:
     1 if short-link detected, 0 otherwise
    """
    return int(bool(re.search(r"\w+\.ly", comment.CONTENT)))

End

Train A Classifier

training_text = train_augmented.CONTENT.tolist()
vectorizer = CountVectorizer(ngram_range=(1, 2))
x_train = vectorizer.fit_transform(training_text)

classifier = LogisticRegression(solver="lbfgs")
classifier.fit(x_train, train_augmented.label.values)
development_test = vectorizer.transform(development.CONTENT)
development["predicted"] = classifier.predict(development_test)

print(f"{sum(development.CLASS == development.predicted)/len(development):.2f}")
0.54

So our model is almost random.

print(metrics.classification_report(development.CLASS, development.predicted, target_names=["not spam", "spam"]))
              precision    recall  f1-score   support

    not spam       0.56      0.61      0.58       194
        spam       0.51      0.45      0.48       173

    accuracy                           0.54       367
   macro avg       0.53      0.53      0.53       367
weighted avg       0.53      0.54      0.53       367

Training on the Original Labels

vectorizer = CountVectorizer(ngram_range=(1, 2))
x_train = vectorizer.fit_transform(train.CONTENT)

classifier = LogisticRegression(solver="lbfgs")
classifier.fit(x_train, train.CLASS.values)

development_test = vectorizer.transform(development.CONTENT)
predicted = classifier.predict(development_test)

print(metrics.classification_report(development.CLASS, predicted, target_names=["not spam", "spam"]))
              precision    recall  f1-score   support

    not spam       0.91      0.97      0.94       194
        spam       0.97      0.90      0.93       173

    accuracy                           0.94       367
   macro avg       0.94      0.94      0.94       367
weighted avg       0.94      0.94      0.94       367

So our self-labeled data set really hurt the performance, but this was the Getting Started tutorial, so it was meant to be just a skimming of what the basic procedure is, hopefully tuning the labeling and transformations more would improve the performance.

Trying out DABEST

Beginning

Imports

PyPi

from sklearn.datasets import load_iris
import dabest
import pandas
import seaborn

Set Up

The Data Set

iris = load_iris()
print(iris.DESCR)
Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...
data = pandas.DataFrame(iris.data, columns=iris.feature_names)
target = pandas.Series(iris.target)
names = dict(zip(range(len(iris.target_names)), iris.target_names))
data["species"] = target.map(names)

Middle

Petal Width

iris_dabest = dabest.load(data=data, x="species", y="petal width (cm)", idx=iris.target_names)
iris_dabest.mean_diff.plot()

nil

Hedge's G

Hedges G is a measurement of effect size, similar to Cohen's d but with better properties when the samples are smaller or the sample sizes are different.

iris_dabest.hedges_g.plot()

nil

print(iris_dabest.hedges_g)
DABEST v0.2.7
=============
             
Good afternoon!
The current time is Mon Dec 16 15:56:24 2019.

The unpaired Hedges' g between setosa and versicolor is 6.76 [95%CI 5.71, 7.86].
The two-sided p-value of the Mann-Whitney test is 2.28e-18.

The unpaired Hedges' g between setosa and virginica is 8.49 [95%CI 7.08, 9.77].
The two-sided p-value of the Mann-Whitney test is 2.43e-18.

5000 bootstrap samples were taken; the confidence interval is bias-corrected and accelerated.
The p-value(s) reported are the likelihood(s) of observing the effect size(s),
if the null hypothesis of zero difference is true.

To get the results of all valid statistical tests, use `.hedges_g.statistical_tests`

According to Wikipedia, an effect size of 2 is "huge" so since the differences between the setosa and versicolor and setosa and virginica are 6.76 and 8.49 respectively, we might conclude that there is a significant difference between the petal width of the setosa and the other two species.

I don't think that's really what this is meant for, but I just wanted to see how it works.

Cohen's D

iris_dabest.cohens_d.plot()

nil

print(iris_dabest.cohens_d)
DABEST v0.2.7
=============
             
Good afternoon!
The current time is Mon Dec 16 16:46:25 2019.

The unpaired Cohen's d between setosa and versicolor is 6.82 [95%CI 5.76, 7.92].
The two-sided p-value of the Mann-Whitney test is 2.28e-18.

The unpaired Cohen's d between setosa and virginica is 8.56 [95%CI 7.13, 9.85].
The two-sided p-value of the Mann-Whitney test is 2.43e-18.

5000 bootstrap samples were taken; the confidence interval is bias-corrected and accelerated.
The p-value(s) reported are the likelihood(s) of observing the effect size(s),
if the null hypothesis of zero difference is true.

To get the results of all valid statistical tests, use `.cohens_d.statistical_tests`

In this case the Cohen's d and Hedges g look similar.

A First Look At Snorkel

Beginning

This is a walk-through of Snorkel's Get Started page.

Imports

Python

from argparse import Namespace
from datetime import datetime
from functools import partial
from pathlib import Path

PyPi

from sklearn.model_selection import train_test_split
import hvplot.pandas
import pandas

Others

from graeae import CountPercentage, EmbedHoloviews

Set Up

Some Constants

There are two classes in the data set - spam and not spam, and for the labeling that we're going to do we also need a third value for the cases where the code can't give it a label.

classified_as = Namespace(
    not_spam = 0,
    spam = 1,
    unknown = -1,
)

The Plotting

path = "../../files/posts/data/a-first-look-at-snorkel"
Embed = partial(EmbedHoloviews, folder_path=path)

The Data Set

This dataset is a set of comments taken from you-tube videos and hosted on this site. The comments are for music videos by five artists.

path = Path("~/data/datasets/texts/you_tube_comments/").expanduser()
assert path.is_dir()

parts = []
for name in path.glob("*csv"):
    data = pandas.read_csv(name)    
    data["artist"] = name.name.split()[-1].split(".")[0]
    parts.append(data)

data = pandas.concat(parts)

Rename the Columns

This just makes it easier for me since it matches my style.

Column = Namespace(
    comment_id = "comment_id",
    author = "author",
    datetime = "datetime",
    text = "text",
    label = "label",
    artist = "artist",
)
renames = {"COMMENT_ID": Column.comment_id,
           "AUTHOR": Column.author,
           "DATE": Column.datetime,
           "CONTENT": Column.text,
           "CLASS": Column.label}
data = data.rename(columns=renames)

Middle

Setting Up the Training and Testing Sets

x_train, x_test, y_train, y_test = train_test_split(data[Column.comment_id,
                                                         Column.author,
                                                         Column.datetime,
                                                         Column.text,
                                                         Column.artist],
                                                    data[Column.label],
                                                    test_size=0.2)
x_train, x_dev, y_train, y_dev = train_test_split(x_train, y_train,
                                                  test_size=0.1)
x_test, x_validation, y_test, y_validation = train_test_split(x_test, y_test,
                                                              test_size=0.1)

print(f"Training Size: {len(x_train):,}")
print(f"Development Size: {len(x_dev):,}")
print(f"Validation Size: {len(x_validation):,}")
print(f"Test Size: {len(x_test):,}")
Training Size: 1,407
Development Size: 157
Validation Size: 40
Test Size: 352

Looking at the Data

print(x_dev.sample().iloc[0])
If I get 100 subscribers, I will summon Freddy Mercury's ghost to whipe  from the face of earth One Direction and Miley Cirus.
print(f"{len(data):,}")
1,956

Spam and Ham

There are two classes in the dataset - SPAM (1) and not-spam (0), sometimes called HAM.

counter = CountPercentage(data.label)
counter()
Value Count Percent (%)
1 1,005 51.38
0 951 48.62
grouped = data.groupby([Column.artist]).agg({Column.label: "sum", Column.comment_id: "count"}).reset_index().rename(
    columns={Column.label: "spam", Column.comment_id: "total"})
grouped["ham"] = grouped.total - grouped.spam
plotter = grouped[[Column.artist, "spam", "ham"]]
plot = plotter.hvplot.bar(x=Column.artist, stacked=True, legend=True,).opts(
    title="Spam Counts",
    width=1000, height=800)
Embed(plot=plot, file_name="spam_counts")()

Figure Missing

The Dates

I'll look at when the comments were made, just to see.

print(len(data[data[Column.datetime].isna()]))
245
with_date = data[~data[Column.datetime].isna()]
with_date.loc[:, Column.datetime] = pandas.to_datetime(with_date[Column.datetime])
with_date.loc[:, "Month"] = with_date[Column.datetime].apply(lambda date: datetime(date.year, date.month, 1))
group = with_date.groupby(["Month", Column.artist, Column.label]).agg(
    {Column.comment_id: "count"}).reset_index().rename(
        columns={Column.comment_id: "Count",
                 Column.artist: "Artist"})
spam = group[group[Column.label] == classified_as.spam]
ham = group[group[Column.label] == classified_as.not_spam]
spam_plot = spam.hvplot(x="Month", y="Count", by="Artist", label="Spam")
plot = spam_plot.opts(title="Monthly Spam By Artist", width=1000, height=800)
Embed(plot=plot, file_name="monthly_spam_by_artist")()

Figure Missing

Some Samples

  • SPAM
    spam = data[data[Column.label]==classified_as.spam].sample(5)
    for index in range(len(spam)):
        print(f"({spam.iloc[index][Column.artist]}): {spam.iloc[index][Column.text]}")
    
    (Eminem): Do you need more instagram followers or photo likes? Check out IGBlast.com and get em in minutes!
    (Eminem): Check out my channel im 15 year old rapper!
    (Shakira): Part 5. Comforter of the afflicted, pray for us Help of Christians, pray for us Queen of Angels, pray for us Queen of Patriarchs, pray for us Queen of Prophets, pray for us Queen of Apostles, pray for us Queen of Martyrs, pray for us Queen of Confessors, pray for us Queen of Virgins, pray for us Queen of all Saints, pray for us Queen conceived without original sin, pray for us Queen of the most holy Rosary, pray for us Queen of the family, pray for us Queen of peace, pray for us 
    (Eminem): Hey guys I&#39;m 87 cypher im 11 years old and Rap is my life I recently made my second album desire ep . please take a moment to check out my album on YouTube thank you very much for reading every like comment and subscription counts
    (Eminem): Check out this video on YouTube:
    
  • Ham
    ham = data[data[Column.label]==classified_as.not_spam].sample(5)
    for index in range(len(ham)):
        print(f"({ham.iloc[index][Column.artist]}): {ham.iloc[index][Column.text]}")
    
    (Eminem): charlieee :DDDD (Those who saw Lost only will understand)
    (LMFAO): BEST PARTY SONG LITERALLY PARTY ROCK IS IN THE HOUSEE TONIGHT!!!!
    (LMFAO): I like how the robot shuffles he shuffles good
    (KatyPerry): ROAAAAARRRRRR 🐯🐯🐯
    (Shakira): like me
    

Labeling Functions

End

Citations

  • Alberto, T.C., Lochter J.V., Almeida, T.A. TubeSpam: Comment Spam Filtering on YouTube. Proceedings of the 14th IEEE International Conference on Machine Learning and Applications (ICMLA'15), 1-6, Miami, FL, USA, December, 2015. (preprint)

SQL Module 3

All the questions in this Quiz use the Chinook Database.

Question One

Using a subquery, find the names of all the tracks for the album "Californication".

SELECT
name
from tracks
LEFT JOIN albums ON tracks.albumid=albums.albumid
WHERE albums.title="Californication"

What is the 8th Track? Porcelain.

Question Two

Find the total number of invoices for each customer along with the customer's full name, city and email.

SELECT
COUNT(invoiceid), firstname, lastname, city, email
FROM invoices
LEFT JOIN customers ON invoices.customerid=customers.customerid
GROUP BY invoices.customerid

What is the email address of the 5th person, Frantisek Wichterlova?

frantsekw@jetbrains.com

Question Three

Retrieve the track name, album, artistID, and trackID for all the albums.

SELECT
name, title, artistid, trackid
FROM tracks
LEFT JOIN albums ON tracks.albumid=albums.albumid
WHERE title="For Those About To Rock We Salute You" AND trackid=12

What is the song title of trackID 12 from the "For Those About to Rock We Salute You" album? Enter the answer below

Breaking The Rules

Question Four

Retrieve a list with the managers last name, and the last name of the employees who report to him or her.

SELECT
A.LastName, B.LastName
from employees as A
LEFT JOIN employees as B ON A.ReportsTo=B.EmployeeId

After running the query described above, who are the reports for the manager named Mitchell (select all that apply)?

King and Callahan.

Question Five

Find the name and ID of the artists who do not have albums.

SELECT
name, ArtistId
from artists
WHERE ArtistId NOT IN (
  SELECT
  ArtistId
  FROM albums
  )

After running the query described above, two of the records returned have the same last name. Enter that name below.

Gilberto

Question Six

Use a UNION to create a list of all the employee's and customer's first names and last names ordered by the last name in descending order.

SELECT
FirstName, LastName
FROM customers
UNION
SELECT
FirstName, LastName
FROM employees
ORDER BY LastName DESC

After running the query described above, determine what is the last name of the 6th record? Enter it below. Remember to order things in descending order to be sure to get the correct answer.

Taylor

Question Seven

See if there are any customers who have a different city listed in their billing city versus their customer city.

SELECT
customers.CustomerID, City, BillingCity from customers
LEFT JOIN invoices ON customers.CustomerID=invoices.CustomerId
WHERE City != BillingCity

No, there aren't.

Chinook Questions

Beginning

The following questions use the SQLite Sample Database - called the Chinook database - found on the SQLite Tutorial site. nil

It is made up of 11 tables:

Table Description
employees stores employees data such as employee id, last name, first name, etc. It also has a field named ReportsTo to specify who reports to whom.
customers stores customers data.
invoices stores invoice header data
& invoice_items stores the invoice line items data.
artists stores artists data. It is a simple table that contains only artist id and name.
albums stores data about a list of tracks. Each album belongs to one artist. However, one artist may have multiple albums.
media_types stores media types such as MPEG audio and AAC audio file.
genres stores music types such as rock, jazz, metal, etc.
tracks store the data of songs. Each track belongs to one album.
playlists stores data about playlists. Each playlist contains a list of tracks. Each track may belong to multiple playlists.
playlist_track The relationship between the playlists table and tracks table is many-to-many. The playlist_track table is used to reflect this relationship.

Middle

Question One

Pull a list of customer ids with the customer’s full name, and address, along with combining their city and country together. Be sure to make a space in between these two and make it UPPER CASE.

What is the city and country result for CustomerID 16?

Question Two

Create a new employee user id by combining the first 4 letters of the employee’s first name with the first 2 letters of the employee’s last name. Make the new field lower case and pull each individual step to show your work.

What is the final result for Robert King?

Question Three

Show a list of employees who have worked for the company for 15 or more years using the current date function. Sort by lastname ascending.

What is the lastname of the last person on the list returned?

Question Four

Profiling the Customers table, answer the following question.

Are there any columns with null values? Indicate any below. Select all that apply.

Question Five

Find the cities with the most customers and rank in descending order.

Which of the following cities indicate having 2 customers?

Question Six

Create a new customer invoice id by combining a customer’s invoice id with their first and last name while ordering your query in the following order: firstname, lastname, and invoiceID.

Select all of the correct "AstridGruber" entries that are returned in your results below. Select all that apply.

Question Seven

End