Data Collection

Collected datasets to serve the project needs, pre-processed them and selected features.

Dataset

Our dataset is taken from Kaggle public dataset (1.45 GB)​. Based on resale car listings on Craigslist​, columns include selling price, car attributes, color, condition, VIN, mileage, make and model, etc.​ We enhanced and imputed some data with data from iSeeCars.

Used Car Dataset

MSRP Dataset

iSeeCar

Incorporating External Data

To impute some of the columns, we looked at the most common​ value for each car model​.

Assumptions:​

  • Cars of the same make and model share common attributes ​such as cylinders, type and size.​
  • For customizable attributes (drive type) we are assuming the most common values for each model​.

We collected data from iSeeCars through web scraping​. Data scraped include MSRP and car attributes for all makes and​ models available.

Problem:​

  • Due to user-input values, car model names may not match​. exactly with the clean names from iSeeCars​.
  • To overcome this problem, we used OpenRefine to try and fix some of the entries based on naming clusters​.
  • We also used SequenceMatcher from the difflib library to​ programmatically fix model names based on similarity index​ and assigned model with the highest name similarity.

Example of cases where difflib sequence matcher was needed:

Available Columns And Relevant Columns

After combining data and cleaning it up, we looked in each column. We determined the relevant columns by looking at distribution of data, correlation with price, missing values and other factors from inference.

Irrelevant

High Cardinality

  • ID
  • URL
  • Image_URL
  • VIN

Categorical

  • Region
  • Region_URL

Null

  • County(100% missing)
  • Size (>70%)

Geolocation:

  • Longitude
  • Latitude

Text

  • Description

Relevant

Categorical

  • Manufacturer​
  • Model ​
  • Condition​
  • Cylinders​
  • Fuel​
  • Title Status​
  • Transmission​
  • Drive​
  • Type​
  • Paint Color​
  • State

Continuous

  • Price
  • MSRP
  • Odometer

Time

  • Year
  • Posting Date

Initial Feature Selection And Imputation

Dropped Columns

ID, VIN, URL, Lattitude, Longitude, Image URL, Region, Region URL, County, Size, Description

Imputed Columns

Mode:

Title Status, Fuel, Color

Based on other features:

Condition

Based on external dataset:

Manufacturer, Model, Drive, Cylinder, Type