In this Python k-Nearest Neighbors tutorial, we will give you a simple introduction into Machine Learning with Python using the k-nearest neighbors (KNN or k-NN) algorithm. We will use this algorithm to solve a simple classification problem. We will be using Python 3.8.10.
KNN is a simple, supervised machine learning algorithm that can be used to solve both classification and regression problems. For this tutorial however, we will be doing classification.
Classification with KNN works by a simple majority vote mechanism. KNN predicts the label of any data point by looking at the K closest labeled data points and getting them to vote on what label the unlabeled point should have. The label assigned is the class of that data point.
Let’s begin. 🙌🙌🙌
STEP 1
STEP 2
STEP 3
STEP 4
STEP 1
We will use the scikit-learn module to perform our classification. If you don’t have it installed you can install it with the following command at the console:
pip install scikit-learn
STEP 2
Next we import our classifier and our dataset.
We import the KNeighborsClassifier which is the classifier that actually implements the k-nearest neighbors vote.
from sklearn.neighbors import KNeighborsClassifier
For this tutorial we will be using the cardiac Single Proton Emission Computed Tomography (SPECT) images dataset which you can find HERE. This dataset merges the original separate train and test datasets, which you can find HERE.
In the cardiac SPECT dataset, each of 348 patients was classified into two categories (normal and abnormal) based on 44 continuous features. These features were extracted by processing the original SPECT images.
Let us import the dataset from a CSV file:
from sklearn.neighbors import KNeighborsClassifier
import requests
import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt
url = 'https://raw.githubusercontent.com/devrescue/python/main/datasets/SPECTF.csv'
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')))
#some information about our dataset
print(df.info())
We did a separate tutorial on how to import data from CSV files that you might want to check out.
Doing a bit of Exploratory Data Analysis on our dataset we can see that there are 45 columns and 349 rows.
There are actually 348 samples or rows because the first row is the column header, which we exclude. Each row represents the data for one heart patient.
The OVERALL_DIAGNOSIS column is our TARGET VARIABLE. As you can see, the possible values are 0 = normal and 1 = abnormal.
We will use KNN to try and predict the class of the target variable, i.e. whether a patient is normal or abnormal.
We will base this prediction on the 44 PREDICTOR VARIABLES or FEATURE VARIABLES. We say 44 because we excluded our target variable column.
Let’s do a quick count of the NORMAL vs ABNORMAL cases in our dataset:
print(len(df[df.OVERALL_DIAGNOSIS == 0])) #normal
print(len(df[df.OVERALL_DIAGNOSIS == 1])) #abnormal
#output
95
254
Based on a quick count, we see that we have 95 normal cases and 254 abnormal cases. We can go further and produce a bar plot that summarizes the NORMAL vs ABNORMAL cases visually. First, we install the matplotlib module:
pip install matplotlib
Then we write our code to produce the plot.
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
diagnoses = ['NORMAL', 'ABNORMAL']
patients = [len(df[df.OVERALL_DIAGNOSIS == 0]),len(df[df.OVERALL_DIAGNOSIS == 1])]
ax.bar(diagnoses,patients)
plt.show()

STEP 3
Now we begin the actual classification:
y = df['OVERALL_DIAGNOSIS'].values
X = df.drop('OVERALL_DIAGNOSIS', axis=1).values
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X,y)
The target variable for classification as previously said will be OVERALL_DIAGNOSIS. This is what we will use the KNN algorithm to help us predict for new samples or patients.
y is the target variable. So the above statement selects only that column from our dataset. y is also the label for the data.
X is the unlabeled training data. We used df.drop() so that the label isn’t included in the dataset.
KNeighborsClassifier() is the classifier that implements the k-nearest neighbors vote. The parameter n_neighbors is the number of neighbors that will vote for the label of our unlabeled data.
knn.fit(X,y) fits the k-nearest neighbors classifier from the training dataset X and target values y.
STEP 4
Now that we have fitted our knn classifier, we can now try to predict our target variable also known as the label.
y_pred = knn.predict(X)
y_pred is numpy array of the predicted labels as determined by our KNN classifier. Recall that X is our unlabeled training data.
NEXT STEPS
Now that we have trained our KNN classifier with Test data, the next step would be to measure the accuracy and performance of our classifier knn.
You can find the entire source code for this tutorial HERE.