Logistic regression classifier of breast cancer data

Logistic regression classifier of breast cancer data

 Message: Implement a logistic regression classifier,

Use the Breast Cancer Wisconsin dataset from UCI machine learning repository: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
Use Newton-Raphson method.
Please implement this algorithm for logistic regression
(i.e., to minimize the cross-entropy loss), and run it over the Breast Cancer Wisconsin dataset.

Please randomly sample 80% of the training instances to train a classifier and then testing it on the remaining 20%. Ten such random data splits should be performed and the average over these 10 trials is used to estimate the generalization performance.
Please submit: (1) your source code that i should be able to (compile and) run, and the processed dataset if any; (2) a report on a program checklist, how you accomplish the project, and the result of your classification. 

Solution 

Introduction

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

Attribute Information:

1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32)

Ten real-valued features are computed for each cell nucleus:

  1. a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area – 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension (“coastline approximation” – 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant 

Libraries Used

import pandas as pd

import numpy as np

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV 

Flow chart:

  • The model takes in 30 feature vectors along with one input variable. These labels are either 1 or 0 (binary classification)
  • The data is pre-processed or further modelling with splitting of dat and scaling the data to have 0 mean and 1 variance.
  • The Estimator Algorithm ie. Logistic Reg Classifier with newton optimizer that is used with grid search (10 fold CV) to obtain the best fit hyperparameters
  • Following the estimation of parameters for our logistic classifier, we move on to modelling the data. This is achieved by looping over 10 train test split combinations (randomly selected with test size 0.2), and training our model on each of these 10 train sets and predicted over corresponding test sets.
  • The performance metric used to predict the result is ‘F-1 score’

Evaluation of performance

The evaluated f-1 score over 10 train test splits is as shown:

 

Iteration F-1 Score(on test)
1 0.989
2 0.949
3 0.989
4 1
5 0.959
6 0.949
7 1
8 0.969
9 0.959
10 0.979
Mean 0.98

The evaluated score gives a classification solution that is just almost close to perfect classification!

import os

% matplotlib inline

os.chdir(‘C://Users//RAJA  IIT//Desktop’) ######## Change to your directory if you want to execute notebook on your system

import pandas as pd

import numpy as np

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV

## Import data

df=pd.read_csv(‘data.csv’)

y=df[‘diagnosis’]

X=df.drop([‘id’,’diagnosis’],1)

y=y.map({‘M’:1,’B’:0})

## Scale and search for best modelling params

scaler=StandardScaler()

X_scaled=scaler.fit_transform(X)

C=[.001,.01,.1,1,10,100,1000]

tuned_parameters = [{‘C’: C}]

clf=LogisticRegression(solver=’newton-cg’)

clf = GridSearchCV(clf, tuned_parameters, cv=10,scoring=’f1′)

X_train,X_test,y_train,y_test=train_test_split(X_scaled,y,test_size=0.2)

clf.fit(X_train,y_train)

print(‘The best fit parameter for C obtained on the model are: {} ‘.format(clf.best_params_))

## Fitting our model over 10 iterations

Score=[]

for i in range(0,10):

X_train,X_test,y_train,y_test=train_test_split(X_scaled,y,test_size=0.2)

clf=LogisticRegression(solver=’newton-cg’)

clf.fit(X_train, y_train)

pred=clf.predict(X_test)

s=np.round(metrics.f1_score(y_test,pred),2)

Score.append(s)

print(‘The averaged out f1 score onthe dataset is: {}’.format(np.round(np.mean(Score),2)))

## Graphing our results

plt.plot(np.linspace(1,10,10),Score,’-‘)

plt.xlabel(‘n-th train test split iteration’)

plt.ylabel(‘F-1 Score on test set’)

plt.title(‘F-1 scores over 10 iterations’)