Developing a Machine Learning Model for Breast Cancer Prediction


Breast cancer is one of the most lethal and heterogeneous disease in this present era that causes the death of enormous number of women all over the world. It is most common type of cancer found in women around the world and it is among the leading causes of deaths in women. The early diagnosis of Breast cancer can improve the prognosis and chance of survival significantly, as it can promote timely clinical treatment to patients. Further accurate classification of benign tumors can prevent patients undergoing unnecessary treatments.


Machine learning process is based on three main strategies that consists of preprocessing, features selection or extraction and classification . Feature extraction is the main part of machine learning process and actually helps in diagnosis and prognosis of cancer, this process can elaborate the cancer set in to benign and malignant tumors.

Our Objective:

To develop an efficient Machine Learning Model to predict whether the breast cancer is benign or malignant with high accuracy . So, for high accuracy we will be using KNN Algorithm.

What is KNN Algorithm?

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric classification method first developed by Evelyn Fix and Joseph Hodges in 1951,and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in data set. The output depends on whether k-NN is used for classification or regression:

  • In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
  • In k-NN regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors.

But why KNN?

The main reasons are :

  • It is simple to implement.
  • It is robust to the noisy training data
  • It can be more effective if the training data is large.

For our case as this algorithm is used in pattern recognition, it is a good approach. In order to recognize the pattern, each class is given an equal importance. K Nearest Neighbor extracts the similar featured data from a large dataset. On the basis of features similarity we classify a big dataset (that is whether tumors are benign or malignant).

Steps for implementing KNN Algorithm:

  1. Calculate distance of each data from a point say p.
  2. Find KNN that is nearest neighbor.
  3. Perform voting (counting).

For calculating distance we will be using Manhattan distance (L1 norm):

Manhattan distance is a distance metric between two points in a N dimensional vector space. It is the sum of the lengths of the projections of the line segment between the points onto the coordinate axes. In simple terms, it is the sum of absolute difference between the measures in all dimensions of two points. It is, also, known as L1 norm and L1 metric. It is used extensively in a vast area of field from regression analysis to frequency distribution.

The reason for using L1 norm is that the L1 norm is more robust than the L2 norm that is it is more able to ignore extreme values in the data set , hence more preferable to use.

So let’s get started on developing the required Machine Learning Model -

Step1 : Data Collection

We will be using the Breast Cancer Wisconsin (Diagnostic) Data Set. The data set can be found on UCI Machine Learning Repository:

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23–34].

Attribute Information:

1) ID number
2) Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter² / area — 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension (“coastline approximation” — 1)

The mean, standard error and “worst” or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

Step 2 : Data Preparation

In this step we replace any missing values that may be present in the data set with appropriate values and also format the data according to our requirements as shown below:

We need to make sure our data set is random to ensure uniform distribution of data for better predictions. Also we need to split data. The data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.

Step 3 : Model Selection

The model we are going to use is KNN also known as K- Nearest Neighbor Algorithm. We have previously stated the reasons as to why we are choosing KNN for developing the required model .

Step 4: Training the model

In this step we apply the KNN Algorithm to our training data set and predicting the outcome for single data to verify if the model has been applied correctly.

Step 5 : Model Evaluation

Now we will use the test data and run through the developed model . Along with this we will calculate accuracy for our model and determine the value of k for which the model has high accuracy. For better visualization we plot the accuracy vs k that is nearest neighbors graph.

According to the graph and accuracy list the value of k for which the model is most accurate is 5 and the maximum accuracy is 97.12% which is very high.


Hence, in this way we developed an efficient Machine Learning Model for Breast Cancer Prediction using KNN Algorithm with 97.12% accuracy.

The GitHub Repository link for this model :



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store