Linear Classification
1 Imports
import pandas from sklearn.model_selection import train_test_split from sklearn.datasets import load_breast_cancer from sklearn.linear_model import LogisticRegression from sklearn.svm import LinearSVC
2 The Data
cancer = load_breast_cancer() print(cancer.keys())
dict_keys(['target_names', 'feature_names', 'data', 'DESCR', 'target'])
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target)
3 Logistic Regression
logistic_model = LogisticRegression(penalty="l1") logistic_model.fit(X_train, y_train) print("Logistic Training Accuracy: {:.2f}".format(logistic_model.score(X_train, y_train))) print("Logistic Testing Accuracy: {:.2f}".format(logistic_model.score(X_test, y_test)))
Logistic Training Accuracy: 0.97 Logistic Testing Accuracy: 0.92
Depending on the random seed it sometimes does better on the testing than it does on the training set.
coefficients = pandas.Series(logistic_model.coef_[0], index=cancer.feature_names) print(coefficients)
mean radius 2.257338 mean texture 0.058581 mean perimeter -0.001644 mean area -0.009889 mean smoothness 0.000000 mean compactness 0.000000 mean concavity 0.000000 mean concave points 0.000000 mean symmetry 0.000000 mean fractal dimension 0.000000 radius error 0.000000 texture error 2.657975 perimeter error 0.000000 area error -0.118846 smoothness error 0.000000 compactness error 0.000000 concavity error 0.000000 concave points error 0.000000 symmetry error 0.000000 fractal dimension error 0.000000 worst radius 1.635063 worst texture -0.412327 worst perimeter -0.201013 worst area -0.022727 worst smoothness 0.000000 worst compactness 0.000000 worst concavity -4.246229 worst concave points 0.000000 worst symmetry 0.000000 worst fractal dimension 0.000000 dtype: float64
features = len(cancer.feature_names) non_zero = coefficients[coefficients!=0] print(non_zero) print(len(non_zero)/features)
mean radius 2.257338 mean texture 0.058581 mean perimeter -0.001644 mean area -0.009889 texture error 2.657975 area error -0.118846 worst radius 1.635063 worst texture -0.412327 worst perimeter -0.201013 worst area -0.022727 worst concavity -4.246229 dtype: float64 0.36666666666666664
The model was able to remove 37% of the features.
model = LogisticRegression(C=100) model.fit(X_train, y_train) print("Training Accuracy: {0:.2f}".format(model.score(X_train, y_train))) print("Testing Accuracy: {0:.2f}".format(model.score(X_test, y_test)))
Training Accuracy: 0.98 Testing Accuracy: 0.95
Using an L2 penalty of 100 improves the accuracy of the model. Increasing "C" means less regularization, so in this case the improvement came from using a more complex model.
4 Support Vector Machine Classification
for power in range(-4, 4): penalty = 10**power svc = LinearSVC(C=penalty) svc.fit(X_train, y_train) print("C={}".format(penalty)) print("Training Accuracy: {0:.2f}".format(svc.score(X_train, y_train))) print("Testing Accuracy: {0:.2f}".format(svc.score(X_test, y_test))) print()
C=0.0001 Training Accuracy: 0.93 Testing Accuracy: 0.93 C=0.001 Training Accuracy: 0.93 Testing Accuracy: 0.92 C=0.01 Training Accuracy: 0.70 Testing Accuracy: 0.71 C=0.1 Training Accuracy: 0.94 Testing Accuracy: 0.93 C=1 Training Accuracy: 0.92 Testing Accuracy: 0.92 C=10 Training Accuracy: 0.93 Testing Accuracy: 0.94 C=100 Training Accuracy: 0.86 Testing Accuracy: 0.84 C=1000 Training Accuracy: 0.92 Testing Accuracy: 0.92
Every time I run this it comes out slightly differently, but it seems like most values do pretty well, there's usually only one or two values of C below 0.92 for the test set.
5 Tuning the Penalty
The L1 penalty makes use of more of the features so it will generally do better if they are all relevant. The L2 penalty is better for interpreting the important features and will do better if some of the features are in fact not relevant. Unlike alpha for regression, C decreases the regularization as it gets bigger. When searching for the best value it can be useful to search a logarithmic space (e.g. 0.001, 0.01, 0.1, 1, 10, 100)