Major Concepts

 

Sign-Up/Login to access Several ML Models and also Deploy & Monetize your own ML solutions for free

Models Home » Domain Usecases » Retail » PRE-OWNED-CAR-PRICE-PREDICTION

PRE-OWNED-CAR-PRICE-PREDICTION

Models Status

Model Overview

PRE-OWNED-CAR-PRICE-PREDICTION



                                

1 . PROBLEM
GOING TO SOLVE

A car price prediction has been a high interest research area, as it requires noticeable effort and knowledge of the field expert. Considerable number of distinct attributes are examined for the reliable and accurate prediction.

To build a model for predicting the price of used cars we use the regression algorithms.

Respective performances of different algorithms were then compared to find one that best suits the available data set. The final prediction model was integrated into python application. Furthermore, the model is evaluated using test data with the suitable Performance Metrics.

Problem statement

Given the features of the car , predict the selling price of that used car.

2 . WHO CAN USE IT

Price prediction of a car especially when the vehicle is used or pre-owned and not coming direct from the factory, is both a critical and important task. With increase in demand for used cars more and more vehicle buyers are finding alternatives of buying new cars.

There is a need of accurate price prediction mechanism for the used cars.

Prediction techniques of machine learning can be helpful in this regard.

This model can be used by potential industries or clients, used cars selling platfoms like CARS24, CARDHEKO, OLX, CARTRADE ,TRUEVALUE ETC... not only for companys it is also used for individuals who are willing to buy or sell a usedcar to know the aproximate price

3 . MODELING

3.1 . DATA SOURCE

3.1.1 . DATASET 

The data is scrapped from online used cars websites .Since the data is scrapped from live website ,the data is uncleaned.

3.1.2 . DATASET FEATURES DESCRIPTION

1. NAME --> Car Model name along with purchase year of the car.
2. RATING --> Rating given while car inspection by cars 24 team.
3. CITY --> City of that car placed for advertisement;city is given in terms of code format.
4. KILOMETRES --> How many kilometres that particular car has driven for before placing that advertisement .
5. YEAR OF PURCHASE --> Original Date of purchase of that car.
6. OWNER --> How many previous owners it had before selling .
7. FUEL TYPE --> Type of fuel which car runs on.(Petrol, diesel …)
8. TRANSMISSION --> whether car is automated or manual transmission.
9. RT0 --> Car registered under which RTO .
10. INSURANCE --> expiry date of the insurance if any.
11. INSURANCE_TYPE --> Different types insurance availed by owner.
12. PRICE --> Price of the used car.

3.2 . DATA PREPROCESSING

Data Analysis & Feature Engineering

DATA USED

 In the given data there is much noise, missing values and the data is inappropriate to use.

 I have cleaned all the data accordingly to its data type by applying relative function.

 I have treated all the missing values with mode Imputer.

     


As the data is scraped it is uncleaned Here some of the features are in string formate but contains numeric data and some features contains inappropriate symbols and letters. so we have to clean that features.

MISSING VALUES IN THE DATASET

     


The missing values are treated with mean,median,mode imputersas per the data types of the features

PAIR PLOT

     

HISTOGRAM FOR NUMERIAL FEATURES

     


   

   


MODELING

3.3 . EVALUATED MODELS

Here we trained the data with different models along with Hyperparameter tuning so that we can get the best of the models.

The used Mechine Learning Regression models are-
ml1 linearRegression
ml2 Polynomial
ml3 DecisionTreeRegressor
ml4 RandomForestRegressor
ml5 RidgeRegressor
ml6 LassoRegressor
ml7 XGBOOSTRegressor

3.4 . Evaluation 

After training the data with all the above models to choose the best model i evaluated with the below Metrics. 

     

     

BEST MODEL

     


I have used almost all regression algorithms in my model like Linear Regression, Polynomial Regression, Lasso and Ridge Regression, XG Boost, Decision Tree, Random Forest.


By using all the above algorithms I have got R-square of above 85% on both train and test data with some algorithms and Best R-square is with "XG BOOST REGRESSOR" model which is not having high bias and high variance.



3.5 . USED MODEL


What is XGBoost?

XGBoost stands for eXtreme Gradient Boosting.

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree based algorithms are considered best-in-class right now. Please see the chart below for the evolution of tree-based algorithms over the years.

      

How to build an intuition for XGBoost?

Decision trees, in their simplest form, are easy-to-visualize and fairly interpretable algorithms but building intuition for the next-generation of tree-based algorithms can be a bit tricky. See below for a simple analogy to better understand the evolution of tree-based algorithms.

Imagine that you are a hiring manager interviewing several candidates with excellent qualifications. Each step of the evolution of tree-based algorithms can be viewed as a version of the interview process.

1.Decision Tree: Every hiring manager has a set of criteria such as education level, number of years of experience, interview performance. A decision tree is analogous to a hiring manager interviewing candidates based on his or her own criteria.

2.Bagging: Now imagine instead of a single interviewer, now there is an interview panel where each interviewer has a vote. Bagging or bootstrap aggregating involves combining inputs from all interviewers for the final decision through a democratic voting process.

3.Random Forest: It is a bagging-based algorithm with a key difference wherein only a subset of features is selected at random. In other words, every interviewer will only test the interviewee on certain randomly selected qualifications (e.g. a technical interview for testing programming skills and a behavioral interview for evaluating non-technical skills).

4.Boosting: This is an alternative approach where each interviewer alters the evaluation criteria based on feedback from the previous interviewer. This ‘boosts’ the efficiency of the interview process by deploying a more dynamic evaluation process.

5.Gradient Boosting: A special case of boosting where errors are minimized by gradient descent algorithm e.g. the strategy consulting firms leverage by using case interviews to weed out less qualified candidates.

6.XGBoost: Think of XGBoost as gradient boosting on ‘steroids’ (well it is called ‘Extreme Gradient Boosting’ for a reason!). It is a perfect combination of software and hardware optimization techniques to yield superior results using less computing resources in the shortest amount of time.

Why does XGBoost perform so well?

XGBoost and Gradient Boosting Machines (GBMs) are both ensemble tree methods that apply the principle of boosting weak learners (CARTs generally) using the gradient descent architecture. However, XGBoost improves upon the base GBM framework through systems optimization and algorithmic enhancements.

     

System Optimization:

1.Parallelization: XGBoost approaches the process of sequential tree building using parallelized implementation. This is possible due to the interchangeable nature of loops used for building base learners; the outer loop that enumerates the leaf nodes of a tree, and the second inner loop that calculates the features. This nesting of loops limits parallelization because without completing the inner loop (more computationally demanding of the two), the outer loop cannot be started. Therefore, to improve run time, the order of loops is interchanged using initialization through a global scan of all instances and sorting using parallel threads. This switch improves algorithmic performance by offsetting any parallelization overheads in computation.

2.Tree Pruning: The stopping criterion for tree splitting within GBM framework is greedy in nature and depends on the negative loss criterion at the point of split. XGBoost uses ‘max_depth’ parameter as specified instead of criterion first, and starts pruning trees backward. This ‘depth-first’ approach improves computational performance significantly.

3.Hardware Optimization: This algorithm has been designed to make efficient use of hardware resources. This is accomplished by cache awareness by allocating internal buffers in each thread to store gradient statistics. Further enhancements such as ‘out-of-core’ computing optimize available disk space while handling big data-frames that do not fit into memory.

Algorithmic Enhancements:

1.Regularization: It penalizes more complex models through both LASSO (L1) and Ridge (L2) regularization to prevent overfitting.

2.Sparsity Awareness: XGBoost naturally admits sparse features for inputs by automatically ‘learning’ best missing value depending on training loss and handles different types of sparsity patterns in the data more efficiently.

3.Weighted Quantile Sketch: XGBoost employs the distributed weighted Quantile Sketch algorithm to effectively find the optimal split points among weighted datasets.

4.Cross-validation: The algorithm comes with built-in cross-validation method at each iteration, taking away the need to explicitly program this search and to specify the exact number of boosting iterations required in a single run.

Where is the proof?

We used Scikit-learn’s ‘Make_Classification’ data package to create a random sample of 1 million data points with 20 features (2 informative and 2 redundant). We tested several algorithms such as Logistic Regression, Random Forest, standard Gradient Boosting, and XGBoost.

     

As demonstrated in the chart above, XGBoost model has the best combination of prediction performance and processing time compared to other algorithms. Other rigorous benchmarking studies have produced similar results. No wonder XGBoost is widely used in recent Data Science competitions.


ADVANTAGES


The system is more effective since it measures the vehicle combinations by their prices.


Easy to use (user friendly).

DISADVANTAGES


It will give approximate values not exact values.


Though it is easy to use but for some features like Rating there is a need of domain experts.



Conclusion


Vehicle price prediction can be a challenging task due to the more number of attributes that should be considered for the accurate prediction. The collection and preprocessing of data is the major step in prediction. In this project, to clean the data, Data Pre-processing is performed. This will used to avoid unnecessary noise for machine learning algorithms. The prediction performance must be increased by using data cleaning processes. But in this project, the insufficient set of complex data is the drawback here. We will get only 85 percent result on applying the machine algorithm. Therefore, we proposed to use multiple features in the data to gain more R-square and to achieved 95 percent of efficiency. Although, this system has achieved valuable performance in vehicle price prediction, our aim for the future work is to test this system to work successfully with various data sets.


0 comments