Result

Model Results

Just like our expectation, Ensemble Tree methods, specifically XGBoost has the best overall performance. ​ This is not very surprising since ensemble method is known to:​

  • Have higher predictive accuracy, compared to the individual models.​
  • Be very useful when there is both linear and non-linear type of data in the dataset​

Linear Regression and Linear SVR, like expected, didn’t perform well. From the R squared score, we see that both models could not capture all the variance of our data.​ XGBoost performs better than Random Forest.

ModelR SquaredTrain RMSEValidation/ Test RMSETest MAPE
Linear Regression0.685222.945272.660.49
Linear SVR0.655613.045626.030.43
Random Forest0.892971.653215.390.25
Gradient Boosting Machine0.893111.863179.890.25
XGBoost0.922400.812674.250.208

MAPE below 0.2 is considered a good score. We handled overfitting by comparing Train & Test RMSE.

Code Reference

Final Models

Feature Importance

MSRP, Odometer, Year, State are top important features.
MSRP, Odometer, Year, State are top important features.

Key Findings

MSRP, odometer, and production year are proven to be top 3 strongest determinants of used car prices.​

Expected from initial EDA as we observed correlations​

States determine price range.​

Higher price variance as years go by.​

Some cars are not being sold as advertised (ex. Vintage cars may be lemons).​

Challenges/Areas of Improvement

Employ highly advanced NLP on textual data (description) excluding Ads, supplement​ the data with public reviews on each car, and apply topical modeling into our features. ​

Perform deeper research on car models with missing values and perform more​ thorough anomaly detection.​

We could integrate image detection algorithms to see whether car is described as it is and additionally use them as features for modelling (CNN Image Classification)

Recommendations

Proposed Business Application To Problems of Information Asymmetry: ​

  1. Craigslist or other platforms can present predictions (using the predictive model) of​ used cars so that buyers can get a sense of what is reasonable and have a base point ​ for comparison. ​

  2. Craigslist should require sellers to fill in clearly defined forms for used cars so that​ ‘information asymmetry’ can be mitigated. (Now, it is not mandatory. ‘Condition’ criteria​ is also not clear, while this can be an important indicator.)​

  3. Craigslist can also add exception criteria or specific section for vintage cars.​

  4. For reputation and quality assurance purposes, used car companies can use the ​ predictions to target and filter out sellers prone to selling lemons prior to posting for​ sale.​​

  5. Eventually, all these adjustments can be expected to improve the quality of used car​ listings in Craigslist, which in turn, can improve transaction success rate.​