Note: If the author has requested for "Expert Guidance" and you can help, please start a New Topic in the "Discussions" Tab

Hashwanth Gogineni's other Models Reports

Major Concepts


Sign-Up/Login to access Several ML Models and also Deploy & Monetize your own ML solutions for free

Lung Cancer Detection

Models Status

Model Overview

Lung cancer

Lung cancer is cancer that starts in the lungs and spreads throughout the body. Your lungs are two spongy organs in your chest that take in oxygen and expel carbon dioxide when you breathe in and out. Lung cancer is the most common cancer that kills people around the world. Lung cancer is most common in smokers, although it can also strike persons who have never smoked. The amount of time and number of cigarettes you smoke raises your risk of lung cancer. You can dramatically reduce your risk of developing lung cancer if you quit smoking, even if you've been smoking for a long time.


In the early stages of lung cancer, there are usually no signs or symptoms. Lung cancer signs and symptoms usually appear when the disease has progressed.

Signs and symptoms of lung cancer may include:

  • Cough that doesn't go away

  • Coughing up blood

  • Shortness of breath

  • Chest pain

  • Hoarseness

  • Losing weight without trying

  • Bone pain

  • Headache

Why Lung Cancer Detection?

The project can be helpful for healthcare organizations to detect cancers in patients' lungs.


There are 3 classes in the dataset, each with 5,000 images, being:

  • Lung adenocarcinoma

  • Lung benign

  • Lung squamous cell carcinoma

Which makes up a total of 15,000 images in the dataset.

Convolutional Neural Networks (ConvNets)

Convolutional Neural Networks are similar to the conventional Neural Networks discussed in the preceding chapter in that they are made up of neurons with learnable weights and biases. Each neuron takes some inputs, does a dot product, and then executes a non-linearity if desired. From the raw image pixels on one end to class scores on the other, the entire network still defines a single differentiable score function. They still contain a loss function on the last (fully-connected) layer (e.g. SVM/Softmax), and all of the tips/tricks we discovered for learning ordinary Neural Networks still apply.

So, what's new? The assumption that the inputs are images is explicit in ConvNet topologies, allowing us to embed specific attributes into the architecture. As a result, the forward function is more efficient to construct, and the number of parameters in the network is greatly reduced.

Understanding Code

First, let us import the required libraries for our project.

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from tensorflow.keras.preprocessing import image_dataset_from_directory
import tensorflow as tf
import cv2
from keras.layers import Input, Lambda, Dense, Flatten,GlobalAveragePooling2D, Dropout, Activation
from keras.models import Model
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from tensorflow.python.keras.preprocessing.image import ImageDataGenerator
from sklearn.metrics import classification_report, log_loss, accuracy_score

Now, let us load the data into our system and convert the data into a dataframe.

image_dir = Path('/content/sample_data/lung_colon_image_set/lung_image_sets')

# Get filepaths and labels
filepaths = list(image_dir.glob(r'**/*.jpeg'))
labels = list(map(lambda x: os.path.split(os.path.split(x)[0])[1], filepaths))

filepaths = pd.Series(filepaths, name='Filepaths').astype(str)
labels = pd.Series(labels, name='Labels')

# Concatenate filepaths and labels
image_df = pd.concat([filepaths, labels], axis=1)

# Shuffle the DataFrame and reset index
image_df = image_df.sample(frac=1).reset_index(drop = True)

# Show the result

As you can see, we extracted data from the data's directory and concatenated 'filepaths' and 'labels' into a dataframe.

Let us also split the dataframe for testing and training purposes.

# Separating train and test data
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(image_df, train_size=0.85, shuffle=True, random_state=1)

As you can see, I used the "train_test_split" function to split the dataframe.

train_datagen = ImageDataGenerator(rescale=1./255, validation_split=0.15)

test_datagen = ImageDataGenerator(rescale=1./255)

As you can see, I used the 'ImageDataGenerator' function for data augmentation purposes.


train_images = train_datagen.flow_from_dataframe(

val_images = train_datagen.flow_from_dataframe(
target_size=(224, 224),

test_images = test_datagen.flow_from_dataframe(

Also, I loaded train and test data using the 'flow_from_dataframe' function into the kernel.

Next, let us get into the modelling part of the project.

input_shape = (224, 224, 3)
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=input_shape ),
tf.keras.layers.MaxPool2D(pool_size = (2,2)),
tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
tf.keras.layers.MaxPool2D(pool_size = (2,2)),
tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
tf.keras.layers.MaxPool2D(pool_size = (2,2)),
tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
tf.keras.layers.MaxPool2D(pool_size = (2,2)),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(3, activation='softmax')


Here I used 4 'Conv2D' layers, 4 'MaxPool2D' layers and 1 'flatten' layer, 2 'Dropout' layers, 4 'Dense' layers to get the best out of our data.

Now, let us compile our model and fit the data.


callback = tf.keras.callbacks.EarlyStopping(monitor='accuracy', patience=2)

history = model.fit_generator(train_images, validation_data=val_images, epochs=25, callbacks=callback)

As you can see, I used 'categorical_crossentropy' and 'accuracy' as metrics.

Now let us understand how our model performed.

 get_acc = history.history['accuracy']
value_acc = history.history['val_accuracy']
get_loss = history.history['loss']
validation_loss = history.history['val_loss']

epochs = range(len(get_acc))
plt.plot(epochs, get_acc, 'r', label='Accuracy of Training data')
plt.plot(epochs, value_acc, 'b', label='Accuracy of Validation data')
plt.title('Training vs validation accuracy')

Also, let us have a look at our model's classification report.

# Classification Report

from sklearn.metrics import classification_report

predictions=model.predict_generator(test_images, verbose=1)
y_pred = np.argmax(predictions, axis=-1)
print(classification_report(test_labels, y_pred))

Here in the report '0' represents 'Lung adenocarcinoma' and '1' represents 'Lung benign' and '2' represents 'Lung squamous cell carcinoma'.

Thank you for your time.