About

About This Project

The data set was obtained from a non-profit Kaggle project that periodically scrapes Craigslist for every used vehicle entry within the United States. It is then uploaded to Kaggle and contains most all relevant information that Craigslist provides on car sales including columns like 'price', 'odometer', 'year', 'model', 'manufacturer', 'condition', 'title status', 'latitude/longitude', and 17 other categories. Data scraping was most recently performed in January 2021. In total, the original .csv file contains 458,213 vehicle listings.


The following steps were then performed and shall be discussed:

  • Data Cleaning (Identifying Null Values, Filling-In Missing Values & Removing Outliers) Using pandas, NumPy, & seaborn
  • Data Preprocessing (Standardization or Normalization) & Splitting
  • Training & Testing the Data Using 8 Algorithm Models Obtained from scikit-learn, yellowbrick, & XGBoost ML Libraries in python
  • Comparison of Each ML Models Performance
  • Raw Data Visual Analysis Using matplotlib
  • Conclusions - Determination of Accurate Price Prediction Model

Since this study concerns price predictions, it's important to consider how price distributions in regression models are skewed right, as shown in the distplot diagram of the actual data set below. For any fixed value of X (independent / predictor variable), the Y value (price / dependent / target variable) prediction will be inaccurately higher than it actually is and should be corrected for. To solve this problem, a log transformation is used to scale the price, thereby helping to generate more accurate predictions of the actual target values. For this reason, evaluations of ML model accuracies are calculated based on Root Mean Squared Log Error (RMSLE) and the Coefficient of Determination (R²).

Data Preparation

Cleaning and Preprocessing

Irrelevant Feature Removal

Removal of 'url', 'region_url', 'vin', 'image_url', 'description', 'county', and 'state' columns

Missing Values Filled-In

Missing values for each respective column are viewed below.

These values were filled using variants of the IterativeImputer method that estimates each feature from all the others according to different regression models. Mean and median in addition to 4 of these estimators were compared for the purpose of missing feature imputation. The models include 'BayesianRidge', which is based on regularized linear regression, 'DecisionTreeRegressor', which accounts for non-linear regression, 'ExtraTreesRegressor', which effectively imputes missing values in mixed-type data, that may involve continuous and/or categorical data including complex interactions and nonlinear relations. And finally 'KNeighborsRegressor', a nearest neighbor imputation method. Effectiveness is decided based upon the least amount of MSE. From the figure below, 'BayesianRidge' provides the least error thus it was chosen to fill missing values.

Outlier Removal

InterQuartile Range for 'price', 'odometer', and 'year' were visualized using the boxplots seen below. Any value considered more extreme than 1 ½ times the interquartile range above the third quartile or below the first quartile were deemed as outliers and eliminated. The box plots show that for prices, any listing amount whose log is 1 ½ times below 6.55 or above 11.55 are considered outliers. The box plot for odometer visually does not provide the interquartile range due to extreme outliers for this feature. These values were calculated and eliminated. The boxplot for year identifies 1996 as being the Q1 boundary for older vehicles. Since no vehicles newer than 2020/2021 exist, only extremely old vehicles could act as outliers for this variable.

Cleaned Data Set

We began with 458,213 rows and 25 columns of data and removed 62,231 rows and 7 columns to end up with 395,982 rows and 18 columns, 16 columns of which were utilized by ML algorithms.

Label Encoding

Our data set contains 12 categorical variables and 4 numerical variables, excluding the price column. In order to apply the ML models, the categorical variables needed to be tranformed into numerical variables. The scikit-learn library LabelEncoder was applied for this purpose.

Normalization

Since the data set is not normally distributed, all of the features have different ranges. Feature scaling is essential for ML algorithms that calculate distances between data. If not to scale, the feature with a higher value range starts dominating when calculating distances. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance. The scikit-learn library MinMaxScalar was applied for this process.

Split the Data

During this step, 90% of the data was split as training data and remaining 10% used as test data.

Machine Learning

Training and Testing the Data

Linear Regression

MSLE : 0.0027

Root MSLE : 0.0523

R² Score : 0.6316 or 63.16%


In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (dependent variable) and one or more explanatory variables (independent variables). In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models.



The performance of linear regression is determined by the differences between the actual values and predicted values. While the model presents a seemingly good fit, it only surmounted an R² score of 63%, thus other models may provide better prediction results. Additionally, linear regression considers 'year', 'odometer', 'fuel', and 'cylinders' to be the most important variables in predicting price as viewed in the feature importance graph.

Ridge Regression

MSLE : 0.0027

Root MSLE : 0.0524

R² Score : 0.6316 or 63.16%



Ridge Regression is a technique used for analyzing a multiple regression model that suffers from multicollinearity, or when more than two explanatory variables are highly linearly related. This commonly occurs in models with a large number of parameters. Ridge regression regularizes all parameters equally and provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias. This is suited for multicollinearity, where ordinary least squares provide unbiased regression coefficients (maximum likelihood estimates as observed in the data set).



A yellowbrick library by Alpha Selection which uses cross validation found the best alpha value (20.336) to fit the data set. The way our variables interact to predict used car price was not suited to a Ridge model analysis which also gave a low R² score of 63%.

Lasso

MSLE : 0.0027

Root MSLE : 0.0524

R² Score : 0.6316 or 63.16%

Lasso (least absolute shrinkage and selection operator) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model. It was originally formulated for linear regression models though Lasso regularization is easily extended to other statistical models including generalized linear models, generalized estimating equations, proportional hazards models, and M-estimators. Lasso’s ability to perform subset selection relies on the form of the constraint and has a variety of interpretations in terms of geometry, Bayesian statistics and convex analysis.

The Lasso model was not suited to our data set as indicated by a low R² score. After conducting our predictive analysis using three linear regression models, it appears that price is not impacted by any variable or variables in a clear linear fashion. We move on to other types of ML models for testing.

K Neighbors Regressor

MSLE : 0.0015

Root MSLE : 0.0392

R² Score : 0.8011 or 80.11%




The k-nearest neighbors algorithm (k-NN) is a non-parametric classification method in which the input consists of the k closest training examples in the dat set. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until function evaluation. In other words, it manipulates the training data and classifies the new test data based on distance metrics. The quality of the predictions depends on the distance measure, which if the features represent different physical units or come in vastly different scales then normalizing the training data can improve the models accuracy dramatically.


From both of the above figures, it can be observed that RMSLE value is at lowest when k is four. On the other hand, there is no signifcant difference between RMSLE values for when k is three through six. By choosing the lowest MSLE occurrence, the data set is trained with more consistency using neighborsN is 5 and a 'euclidean distance' metric. Since we performed normalization on our data set beforehand, the algorithm was able to achieve an 80% R² score, which is a considerably higher accuracy rating for used car price predictions than what was produced by the linear regression models.

Random Forest Regressor

MSLE : 0.0008

Root MSLE : 0.0284

R² Score : 0.8994 or 89.94%



A Random Forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the data set and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with a max-samples parameter if by default (bootstrap=True), otherwise the whole dataset is used to build each tree. In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. Furthermore, when splitting each node during the construction of a tree, the best split is found either from all input features or a random subset of size max features. Randomness helps to prevent over-fitting and decreases the variance of the forest estimator.



In our model, 180 trees were created with max features of 0.5. In general, the more the trees, the better the results. Random Forest produced an R² score of 90%. The variables importance diagram above shows that 'year' and 'odometer' are the variables with the highest degree of predictive usefulness out of all the variables. The performance diagram above shows that actual versus predicted MSLE's were nearly identical for all except 3 of the 25 instances.

Bagging Regressor

MSLE : 0.0015

Root MSLE : 0.0383

R² Score : 0.8124 or 81.24%

A Bagging Regressor is an ensemble meta-estimator that fits base regressors each on random subsets of the original data set and then aggregates their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.


In our model, DecisionTreeRegressor is used as the estimator with max depth of 20, which creates 50 decision trees that results in an R² score of 81%. Despite its complexity, the performance of Random Forest is much better than Bagging Regressor. The fundamental difference between these two ML models is that in Random Forests, only a subset of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree, unlike in bagging where all features are considered for splitting a node.

AdaBoost Regressor

MSLE : 0.0009

Root MSLE : 0.0295

R² Score : 0.8891 or 88.91%



The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction. The data modifications at each so-called boosting iteration consist of applying weights w1, w2, ..., wN to each of the training samples. Initially, those weights are all set to wi = 1/N, so that the first step simply trains a weak learner on the original data. For each successive iteration, the sample weights are individually modified and the learning algorithm is reapplied to the reweighted data.

At a given step, those training examples that were incorrectly predicted by the boosted model induced at the previous step have their weights increased, whereas the weights are decreased for those that were predicted correctly. As iterations proceed, examples that are difficult to predict receive ever-increasing influence. Each subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence.

In our model, the Decision Tree Regressor is used as an estimator with a max depth of 24, 200 trees, and a learning rate of 0.6. This produced a strong R² score of 89%. The feaures importance bar plot shows how all variables factor into used vehicle price, though it is most influenced by 'year', 'odometer', 'model', and 'lat/long'.

XGBoost

MSLE : 0.0007

Root MSLE : 0.0260

R² Score : 0.9146 or 91.46%



XGBoost is an ensemble learning method that is a specific implementation of the Gradient Boosted method which uses more accurate approximations to find the best tree model. Sometimes, it may not be sufficient to rely upon the results of just one machine learning model. Ensemble learning offers a systematic solution to combine the predictive power of multiple learners. The resultant is a single model which gives the aggregated output from several models. The models that form the ensemble, also known as base learners, could be either from the same learning algorithm or different



learning algorithms. Bagging (previously discussed) and boosting are two widely used ensemble learners. Though these two techniques can be used with several statistical models, the most predominant usage has been with decision trees. XGBoost employs scalability to drive fast learning through parallel and distributed computed, in addition to efficient memory usage. In order to fit the data set to the model, parameters were adjusted to provide cross validation of max depth 24, 200 decision trees (estimators) and a learning rate of 0.4.

Learning Rate: The most important hyperparameter when configuring a neural network, which controls how much to change the model in response to the estimated error each time the model weights are updated. Choosing the learning rate is challenging as a value too small may result in a long training process that could get stuck, whereas a value too large may result in learning a sub-optimal set of weights too fast or an unstable training process.

n_estimators: This is the number of trees you want to build before taking the maximum voting or averages of predictions. A higher number of trees give you better performance with the drawback being longer run times.

The algorithm produced the highest R² score at 91%. The performance chart above displays the great accuracy of the prediction model.

Comparison

Machine Learning Model Performance


Of the eight different ML models we explored, three produced prediction accuracies of 90% which is a substantial result. These algorithms were RandomForestRegressor, AdaBoostRegressor, and XGBoost. By performing different models, we were able to learn more about the data set and the relative importance of the variables. As each ML algorithm was applied, information regarding the four statistical measures observed in the table above was gathered in order to assess the suitability of the ML model in meeting our studies objective of accurately predicting used car prices.

The first three models were based in linear regression and were all found to be unsuitable for our prediction needs based on their poor MSLE, Root MSLE and R² scores. The relative importance of the features/variables generated by these early models provided useful insights however, since they indicated that several factors were influencing used car prices. These factors were (decreasing order of importance) 'year', 'odometer', 'fuel', and 'cylinders'. As we performed more learning models on the data, a few tendencies became apparent in that relative importance of features were always found to be most for both 'odometer' and 'year', and then a handful of others.

Moving onto more complex algorithms that allowed for much larger decision trees and depth parameters is when higher accuracies of price predictions were achieved. Our data set contained 12 categorical variables and 4 numerical variables, which despite being encoded and scaled prior to utilization by the ML models, presented the need for a model that could be sensitive to the very broad importance value range for many different variables.

The most suited algorithm was found to be the XGBoost ensemble-based ML model. The advantage of the XGBoost model is its scalable and accurate implementation of gradient boosting techniques and several advanced features for model tuning, computing environments and algorithm enhancement. XGBoody can perform the three main forms of gradient boosting (Gradient Boosting (GB), Stochastic GB and Regularized GB) and is robust enough to support fine tuning and addition of regularization parameters. The 91% accuracy model established here provides a jumping off point for further/future exploration.

Visualizations

Raw Data Visual Analysis

The pair plots of vehicle year, price, and odometer are shown above. Correlations between these variables are visually apparent from the distribution of the scatter plots. Firstly, the pricing of vehicles tends to increase the newer they are. Secondly, the higher the odometer of a vehicle, the older the vehicle tends to be. A correlation between pricing and odometer is less noticeable though it still appears that prices tend to increase in vehicles with lower mileage. These three variable interactions confirm already well-established concepts of car value.

The following bar plots performed on the raw data provide a means to visually consider any possible relationships between the different variables and used car prices.

These additional bar plots regarding vehicle count verses the different variables provides an idea of the makeup of the data set based off of their distributions.

and in

Conclusion

The process applied here provides great potential in becoming an actual price prediction application that can be used by the public. Although the complex manner in which the different features impact used car pricing is constantly in flux, certain areas of importance were uncovered through exploration with eight distinct ML models. A 91% accuracy rating is substantial theoretically, though differences between actual versus predicted prices due to error with this model could amount to hundreds or even thousands of dollars. For buyers in the market, that would defeat the purpose of the prediction model. Ideally, a near 100% accuracy rating should be sought after to achieve a true "working" model.

Machine learning appears to have the potential to change relationships between producers and consumers in positive ways. In this application, we have explored the potential for it to assist consumers in navigating the used car market. A highly accurate machine learning price prediction model could potentially revolutionize the used automotive industry by giving buyers the power to locate the most reasonable deal. Consequently, sellers, both private party and dealer, will have to adjust their pricing schemes in order to accurately represent true market values. This may attract an even higher percentage of buyers to the used car market and potentially impact the new car market in turn. Such a predicament would indeed be quite amusing.

In order to continue building towards an actual deployable model, more used car data from other sources besides Craiglist should be obtained and incorporated into the ML model. These sources will have to be varied in order to comprehensively capture the used car sales market and will include other major online used car platforms, as well as dealerships in order to represent the portion of used car sales that are not transacted online. Obtaining data on local private party sales will be inconsistent and difficult to locate. A much larger data set would allow for more robust price predictions by the ML model. Although 16 different features, both numerical and categorical, were applied here, potentially more features could be included due to the broad importance value range for many different variables when considering used car value.

Additionally, scrutinizing certain aspects of the data cleaning and preprocessing could lead to more accurate methods of filling in missing values or scaling the data, which could provide a data set that the ML model optimally uses for predictive analysis. Immediate next steps include loading the current prediction model into a Heroku-hosted web application.

Behind the Scenes

Coding Snippets

About

The Researcher

Kiran Rangaraj

Aspiring Data Analyst / Former Biochemist