
So for Part 2, let’s evaluate the k-nearest neighbors accuracy for our dataset. We will answer the question: will our model accurately classify normal and abnormal cases when presented with new data?
Before we use our model for real life predictions we will have to train and test our classifier and thereby get an actual numeric score to quantify how accurate our model will be in practice. This will indicate how accurate or inaccurate our knn classifier is. The higher the score, the better.
Using sklearn.model_selection.train_test_split we will split our dataset into random test and train subsets.
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42, stratify=y)
Let’s explain what is happening here: The train_test_split() method returns 4 numpy arrays:
- X_train, X_test, y_train, y_test are all numpy arrays returned by train_test_split(). X_train and X_test are subsets of the X feature variable dataset. And y_train and y_test are subsets of
- train_test_split() accepts the following:
- X our feature variables in the original dataset.
- y our predictor variables in the original dataset.
- Parameter test_size which in this case is 40% of our dataset. The split is then 60/40 with 60% of the dataset being for training of the classifier and 40% for testing.
- Parameter random_state controls the shuffling applied to the data before it is split
- stratify ensures that our labels are distributed between train and test sets exactly as they are in our original dataset according to the target variables or labels (y).
Once again we define our classifier knn as we did in Part 1:
#our classifier
knn = KNeighborsClassifier(n_neighbors=7)
Then we fit our classifier to the training data:
knn.fit(X_train,y_train)
Because we have to train our classifier before we actually test our classifier. We supply the training data to our classifier first and then it’s just a simple matter of getting the accuracy of our classifier by supplying it with test data and measuring how well it predicts normal vs abnormal when compared to the actual results:
print(knn.score(X_test, y_test))
#Output
#0.75
So our classifier is 75% accurate at predicting normal vs abnormal cases or, put another way, there is a 25% chance that our classifier could be wrong. Not the best, especially for medical data. We can probably do better if we use another classifier algorithm or maybe a larger dataset but we covered the basics for this tutorial so we are OK for now. 👍👍😉
Thanks for reading. We hope this tutorial was helpful. Find the full source code HERE.