Major Concepts

 

Sign-Up/Login to access Several ML Models and also Deploy & Monetize your own ML solutions for free

Breast Cancer Detection

Models Status

Model Overview

Accurate diagnosis in healthcare is one of the most critical processes in breast cancer treatment. In this use case, I have performed an analysis of the Breast Cancer Wisconsin (Diagnostic) DataSet. The goal is to classify whether a breast cancer tumor is benign or malignant from ten real-valued features are computed for each cell nucleus. This analysis shall prove to be helpful for doctors.


This dataset was created by Dr. William H. Wolberg, a physician at the University Of Wisconsin Hospital in Madison, Wisconsin, USA. 


https://www.kaggle.com/uciml/breast-cancer-wisconsin-data


Objectives


This analysis aims to observe which features are most helpful in predicting malignant or benign cancer. The goal is to classify whether breast cancer is benign or malignant.

Dataset Description


  • Ten real-valued features are computed for each cell nucleus:



  1. radius (mean of distances from the center to points on the perimeter)

  2. texture (standard deviation of gray-scale values)

  3. perimeter

  4. area

  5. smoothness (local variation in radius lengths)

  6. compactness (perimeter² / area — 1.0)

  7. concavity (severity of concave portions of the contour)

  8. concave points (number of concave portions of the contour)

  9. symmetry

  10. fractal dimension (“coastline approximation” — 1)



  • The mean, standard error, and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

  • The dataset contains 569 samples of malignant and benign tumor cells.

  • Columns 2-32 contain 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.




    • 1= Malignant (Cancerous) - Present (M)

    • 0= Benign (Not Cancerous) -Absent (B)





Data Visualisation


Visualization of data is an essential aspect of data science. It helps to understand data with insights and also to present the data to another person. I used several Python visualization libraries such as Matplotlib, Seaborn to draw plots to gain specific insights.


We can identify that out of 569 persons of the dataset, 357 are labeled B (benign) and 212 as M (malignant). This distribution may cause skewed ‘M’ values.

We can find any null data points of the data set using the following pandas function.


From boxplots, we can see data distribution and also identify outliers.


By plotting the correlation matrix, we could see many columns are very highly correlated, which causes multicollinearity, we shall use PCA to deal with highly correlated features.


Feature Engineering


The ‘M’ values in the dataset are a bit skewed, I performed oversampling using SMOTE (Synthetic Minority Oversampling Technique).


SMOTE is an oversampling technique where the synthetic samples are generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random oversampling.


This has us 50% values of both ‘M’ and ‘B’ in the dataset.


from imblearn.over_sampling import SMOTE


sm = SMOTE(random_state = 42)


X_res, y_res = sm.fit_resample(X,y)


X_res = pd.DataFrame(X_res)


Y_res = pd.DataFrame(y_res)


Feature Scaling was performed on the feature dataset using StandardScalar.


scaler = StandardScaler()


scaler.fit(X_train)


X_train = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns)


X_test = pd.DataFrame(scaler.transform(X_test), columns=X_train.columns)


PCA:


PCA has been performed on an existing dataset and a new dataset has been formed with n=5 principal components.


n_components = 5


pca = PCA(n_components=n_components)


pca.fit(X_train)


pc_train


Principal components analysis (PCA) is a statistical technique that allows identifying underlying linear patterns in a dataset so it can be expressed in terms of other data set of a significatively lower dimension without much loss of information.



The final dataset should explain most of the variance of the original data set by reducing the number of variables. The final variables will be named as principal components.


pca.explained_variance_ratio_


array([0.46960336, 0.18512009, 0.09477612, 0.06351298, 0.05197679])


pca.explained_variance_ratio_.cumsum()


array([0.46960336, 0.65472345, 0.74949957, 0.81301255, 0.86498934])




Splitting the dataset into the Training set and Test set


X_train, X_test, y_train, y_test = train_test_split(X_res, Y_res, test_size = 0.3, random_state = 0)


Model Selection


I trained the following models on the dataset, Logistic Regression, Support vector classifier, Random forest classifier, K-Nearest Neighbours.


Logistic regression is a method for fitting a regression curve, y = f(x) when y is a categorical variable. It is a classification algorithm used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables.


Logistic Regression uses a more complex cost function, this cost function can be defined as the ‘Sigmoid function’.


Hypothesis function for logistic regression:



pca_model = LogisticRegression()


pca_model.fit(pc_train, y_train)


Model Accuracy (PCA Data): 97.20930%


pca_model_acc = accuracy_score(y_test, pca_model.predict(pc_test))


0.9720930232558139


 


KNN


K Nearest Neighbour (KNN) is one of the simplest algorithms that calculate the distance between the query observation and each data point in the training dataset and finds the K closest observations.


When we use the Nearest neighbor search algorithm, it compares all the data points with the mentioned query point and finds the closest points.


There are many ways in which we can find the distance between two data points. The most commonly used distance metrics are “Euclidean distance” and “Hamming distance”.


knn = KNeighborsClassifier()


knn.fit(pc_train, y_train)


y_pred_knn = knn.predict(pc_test)


print(accuracy_score(y_train, knn.predict(pc_train)))


knn_acc = accuracy_score(y_test, knn.predict(pc_test))


0.9720930232558139


SVC


SVM maximizes the margin by learning a suitable decision boundary/decision surface/separating hyperplane.


from sklearn.svm import SVC


from sklearn.model_selection import GridSearchCV


svc = SVC()


parameters = {'gamma' : [0.0001, 0.001, 0.01, 0.1],


'C' : [0.01, 0.05, 0.5, 0.1, 1, 10, 15, 20]}


grid_search = GridSearchCV(svc, parameters)


grid_search.fit(pc_train, y_train)


svc_acc = accuracy_score(y_test, svc.predict(pc_test))


0.9674418604651163


Model Score Comparison 



*pca_model= logistic regression model


 


 


 


 


 


0 comments