Predicting Cancer (Course 3, Assignment 1)
This assignment uses the Breast Cancer Wisconsin (Diagnostic) Database to create a classifier that can help diagnose patients.
The data
cancer = load_breast_cancer()
This data set has 569 rows (cases) with 30 numeric features. The outcomes are either 1 - malignant, or 0 - benign.
From their description:
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
The object returned by load_breast_cancer() is a scikit-learn Bunch object, which is similar to a dictionary, but like pandas, also supports using dot-notation to retrieve attributes when possible (i.e. no spaces in the keys).
print(cancer.keys())
dict_keys(['DESCR', 'target', 'feature_names', 'data', 'target_names'])
Question 0 (Example)
How many features does the breast cancer dataset have?
def answer_zero(): """number of feature names in the data Returns: int: count of feature names in the 'cancer' data-set """ return len(cancer['feature_names'])
answer_zero()
30
Question 1
Scikit-learn works with lists, numpy arrays, scipy-sparse matrices, and pandas DataFrames, so converting the dataset to a DataFrame is not necessary for training this model. Using a DataFrame does however help make many things easier such as munging data, so let's practice creating a classifier with a pandas DataFrame.
def answer_one(): """converts the sklearn 'cancer' bunch Returns: pandas.DataFrame: cancer data """ data = numpy.c_[cancer.data, cancer.target] columns = numpy.append(cancer.feature_names, ["target"]) return pandas.DataFrame(data, columns=columns)
frame = answer_one() assert frame.shape == (len(cancer.target), 31)
Question 2
What is the class distribution? (i.e. how many instances of malignant and how many benign?)
def answer_two(): """calculates number of malignent and benign Returns: pandas.Series: counts of each """ cancerdf = answer_one() counts = cancerdf.target.value_counts(ascending=True) counts.index = "malignant benign".split() return counts
output = answer_two() assert output.malignant == 212 assert output.benign == 357
Question 3
Split the DataFrame into `X` (the data) and `y` (the labels).
def answer_three(): """splits the data into data and labels Returns: (pandas.DataFrame, pandas.Series): data, labels """ cancerdf = answer_one() X = cancerdf[cancerdf.columns[:-1]] y = cancerdf.target return X, y
x, y = answer_three() assert x.shape == (569, 30) assert y.shape == (569,)
Question 4
Using train_test_split(), split X and y into training and test sets (X_train, X_test, y_train, and y_test).
from sklearn.model_selection import train_test_split def answer_four(): """splits data into training and testing sets Returns: tuple(pandas.DataFrame): x_train, y_train, x_test, y_test """ X, y = answer_three() return train_test_split(X, y, train_size=426, test_size=143, random_state=0)
x_train, x_test, y_train, y_test = answer_four() assert x_train.shape == (426, 30) assert x_test.shape == (143, 30) assert y_train.shape == (426,) assert y_test.shape == (143,)
Question 5
Using KNeighborsClassifier, fit a k-nearest neighbors (knn) classifier with X_train, y_train and using one nearest neighbor (n_neighbors = 1).
from sklearn.neighbors import KNeighborsClassifier def answer_five(): """Fits a KNN-1 model to the data Returns: sklearn.neighbors.KNeighborsClassifier: trained data """ X_train, X_test, y_train, y_test = answer_four() model = KNeighborsClassifier(n_neighbors=1) model.fit(X_train, y_train) return model
knn = answer_five() assert type(knn) == KNeighborsClassifier assert knn.n_neighbors == 1
Question 6
Using your knn classifier, predict the class label using the mean value for each feature.
You can use cancerdf.mean()[:-1].values.reshape(1, -1) which gets the mean value for each feature, ignores the target column, and reshapes the data from 1 dimension to 2 (necessary for the predict method of KNeighborsClassifier).
def answer_six(): """Predicts the class labels for the means of all features Returns: numpy.array: prediction (0 or 1) """ cancerdf = answer_one() means = cancerdf.mean()[:-1].values.reshape(1, -1) model = answer_five() return model.predict(means)
answer_six()
array([ 1.])
Question 7
Using your knn classifier, predict the class labels for the test set X_test.
def answer_seven(): """predicts likelihood of cancer for test set Returns: numpy.array: vector of predictions """ X_train, X_test, y_train, y_test = answer_four() knn = answer_five() return knn.predict(X_test)
predictions = answer_seven() assert predictions.shape == (143,) assert set(predictions) == {0.0, 1.0}
print("no cancer: {0}".format(len(predictions[predictions==0]))) print("cancer: {0}".format(len(predictions[predictions==1])))
Question 8
Find the score (mean accuracy) of your knn classifier using X_test and y_test.
def answer_eight(): """calculates the mean accuracy of the KNN model Returns: float: mean accuracy of the model predicting cancer """ X_train, X_test, y_train, y_test = answer_four() knn = answer_five() return knn.score(X_test, y_test)
answer_eight()
Optional plot
Try using the plotting function below to visualize the differet predicition scores between training and test sets, as well as malignant and benign cells.
%matplotlib inline def accuracy_plot(): import matplotlib.pyplot as plt X_train, X_test, y_train, y_test = answer_four() # Find the training and testing accuracies by target value (i.e. malignant, benign) mal_train_X = X_train[y_train==0] mal_train_y = y_train[y_train==0] ben_train_X = X_train[y_train==1] ben_train_y = y_train[y_train==1] mal_test_X = X_test[y_test==0] mal_test_y = y_test[y_test==0] ben_test_X = X_test[y_test==1] ben_test_y = y_test[y_test==1] knn = answer_five() scores = [knn.score(mal_train_X, mal_train_y), knn.score(ben_train_X, ben_train_y), knn.score(mal_test_X, mal_test_y), knn.score(ben_test_X, ben_test_y)] plt.figure() # Plot the scores as a bar chart bars = plt.bar(numpy.arange(4), scores, color=['#4c72b0','#4c72b0','#55a868','#55a868']) # directly label the score onto the bars for bar in bars: height = bar.get_height() plt.gca().text(bar.get_x() + bar.get_width()/2, height*.90, '{0:.{1}f}'.format(height, 2), ha='center', color='w', fontsize=11) # remove all the ticks (both axes), and tick labels on the Y axis plt.tick_params(top='off', bottom='off', left='off', right='off', labelleft='off', labelbottom='on') # remove the frame of the chart for spine in plt.gca().spines.values(): spine.set_visible(False) plt.xticks([0,1,2,3], ['Malignant\nTraining', 'Benign\nTraining', 'Malignant\nTest', 'Benign\nTest'], alpha=0.8); plt.title('Training and Test Accuracies for Malignant and Benign Cells', alpha=0.8) accuracy_plot()