Aarzoo Goel's other Models Reports

Major Concepts

 

Sign-Up/Login to access Several ML Models and also Deploy & Monetize your own ML solutions for free

Models Home » Generic Models » Predictive Modelling » Used Car Price Prediction

Used Car Price Prediction

Models Status

Model Overview

Used Car price prediction


Introduction:
Used car Price Prediction is used in predicting the price of a used car with different features included like age of the car, how many km covered, Mileage, Fuel type, Owner, and many more.

New car prices are fixed by the manufacturers with additional costs that include government additional amounts in form of taxes. So, buying a new car means investing a large amount of money and sometimes customers don’t have that many funds. So, nowadays used car sale is increasing and there are many different apps for the same. Used Car price prediction uses other models from where we get to know the best price for our cars. It tells us this according to its market price.
This model helps the client predict his car’s market price if he wants to sell or purchase a used car.


Data:
The dataset I used here is downloaded from Kaggle. It includes 14 different features like name, model, brand, owner type, mileage, Kilometers driven, engine, power, transmission, etc.


You Can download the dataset from the link provided: https://www.kaggle.com/avikasliwal/used-cars-price-prediction.


 


Data Pre Processing:
Data Cleaning, Data pre-processing, Exploratory data analysis have been applied to data, and we will be discussing that in-depth further.


Top 50 Cars Names: 
Maruti Swift,Honda City,Hyundai i20,Hyundai Verna,Toyota Innova,Hyundai i10,Maruti Wagon,Hyundai Grand,Volkswagen Polo,Maruti Alto,Mahindra XUV500,Volkswagen Vento,Honda Amaze,Toyota Fortuner,Ford ,igo,BMW 3,Mercedes-Benz New,Hyundai Creta,Mercedes-Benz E-Class,Renault Duster,Audi A4,Hyundai Santro,Maruti Ertiga,Maruti Ciaz,BMW 5,Toyota Corolla,Maruti Ritz,Maruti Baleno,Hyundai EON,Mahindra ,corpio,Toyota Etios,Honda Brio,Land Rover,Hyundai Xcent,Maruti Celerio,Honda Jazz,Ford Ecosport,Audi A6,Skoda Superb,Skoda Rapid,Chevrolet Beat,Maruti Vitara,Ford EcoSport,Ford Fiesta,Tata Indica,Renault KWID,Ford Endeavour,Audi Q7,Maruti SX4,Nissan Micra,Other.

These are the top 50 car names use in this use case with a maximum number of entries and data entries of car names with fewer value counts, their name is replaced with 'Other' which helps us to improve our mean square error and percentage loss.


There is one more assumption for this dataset that the Price value should be greater than the New_Price as the Price is the actual price of the car and the New_Price is the price calculated but in this dataset, we see that the New_Price is greater than Price. So, we will assume the Price as the New_Price of the car and drop New_Price values.
Data cleaning includes to check the null values, deleting duplicate rows, imputation of null values and even change in column name, replacing of blank space, removing string part from integer type like from mileage column removing ‘km/h’, ’kmpl’, ‘CC’ from engine to get the integer or float values which can be computed.


data['Mileage'] = data['Mileage'].str.replace(r'kmpl', '')


data['Engine'] = data['Engine'].str.replace(r' CC', '')


data['Power'] = data['Power'].str.replace(r'bhp', '')


Every data needs to be pre-processed first. Different pre-processing functions are written like:



  • getCountOfMissingValuesPerColumn(): Gives us the count of missing values per column.


          display(preprocessing.getCountOfMissingValuesPerColumn(data, exclude_Zero_percent=True))
 



  • convertObjectColumnsToCategory(): Every Object datatype column is converted into category type.


           preprocessing.convertObjectColumnsToCategory(cols)
 



  • convertFloatColumnsToInt64(): every float type column in integer.



  • dropMissingColumnsByThreshold(): We dropped the missing values by putting some threshold, here I used above 80% of missing values columns should be dropped.



  • printValueCountsOfCatagoricalColumns(): To print how many values are there in categorical column. 



  • LabelEncoding: Label encoding is for the columns like name, location, ownertype, fuel_type, transmission.






In the dataset, there is a feature New_Price with maximum null values which is dropped using drop by a threshold value.


EDA:

While exploring data, we will look at the different combinations of features with the help of visuals. This helps us to understand our dataset better and give us some hints about the pattern in the data.

Many graphs are used to understand the data like count plot, boxplot, bar plot, face grid, etc.




 


Feature Engineering: Most of the feature engineering tasks are covered in data pre-processing. Here we updated the Name column.

Modeling:

We used different models here like Linear Regression, Random Forest Regressor, Gradient Boosting Regressor, Grid Search CV, ANN. I have split the dataset into the train and test with 80 and 20 respectively.


Linear Regression:


Linear regression was the first type of regression analysis, used extensively in practical applications. Some models linearly depend on their unknown parameters and are easier to fit than models which non-linearly depends to their parameters and because of this statistical property of the resulting variable is easier to determine. Regression can be used to identify the effect independent variable(s) have on a dependent variable. Solves problems like finding the age, market spending, income etc. 


Grid Search CV:

This is a method for adjusting the features in supervised learning and improve the general performance of a model. With Grid Search, we try all possible combinations of the features of interest and find the best one. We first need to specify the parameters we want to search and then GridSearchCV will perform all the necessary model that fits.


Randomized Search CV:

This is very useful when we have many features to try and the training time is also very long.


 


We get better accuracy from Grid Search CV. So, this model can be further used to calculate the price of the car. This gives the accuracy of 88%.



Random forest Regressor:

Random forest regressor includes both regression and classification. Random forest prevents overfitting by creating trees on random subsets. That’s why it’s a good model for the analysis of this type of problem. The use case here is not a classification problem as there the price is continuous and changes for every new row.Here we calculate the MSE, MAE, RMSE, and R^2 Error for Random Forest Regressor.


0 comments