Data Collection

Collected datasets to serve the project needs, pre-processed them and selected features.

Dataset

Our dataset is taken from Kaggle public dataset (1.45 GB). Based on resale car listings on Craigslist, columns include selling price, car attributes, color, condition, VIN, mileage, make and model, etc. We enhanced and imputed some data with data from iSeeCars.

Used Car Dataset

MSRP Dataset

iSeeCar

👉

Since each listing data is 100% based on seller entry on Craiglist, there is high possibility sellers tend to input false data, resulting in datapoints that are anormal.

Incorporating External Data

To impute some of the columns, we looked at the most common value for each car model.

Assumptions:

Cars of the same make and model share common attributes such as cylinders, type and size.
For customizable attributes (drive type) we are assuming the most common values for each model.

We collected data from iSeeCars through web scraping. Data scraped include MSRP and car attributes for all makes and models available.

Problem:

Due to user-input values, car model names may not match. exactly with the clean names from iSeeCars.
To overcome this problem, we used OpenRefine to try and fix some of the entries based on naming clusters.
We also used SequenceMatcher from the difflib library to programmatically fix model names based on similarity index and assigned model with the highest name similarity.

Example of cases where difflib sequence matcher was needed:

Available Columns And Relevant Columns

After combining data and cleaning it up, we looked in each column. We determined the relevant columns by looking at distribution of data, correlation with price, missing values and other factors from inference.

Irrelevant

High Cardinality

ID
URL
Image_URL
VIN

Categorical

Region
Region_URL

Null

County(100% missing)
Size (>70%)

Geolocation:

Longitude
Latitude

Text

Description

Relevant

Categorical

Manufacturer
Model
Condition
Cylinders
Fuel
Title Status
Transmission
Drive
Type
Paint Color
State

Continuous

Price
MSRP
Odometer

Time

Year
Posting Date

Initial Feature Selection And Imputation

Dropped Columns

ID, VIN, URL, Lattitude, Longitude, Image URL, Region, Region URL, County, Size, Description

Imputed Columns

Mode:

Title Status, Fuel, Color

Based on other features:

Condition

Based on external dataset:

Manufacturer, Model, Drive, Cylinder, Type

Data Collection

Dataset #

Incorporating External Data #

Problem:​ #

Available Columns And Relevant Columns #

Irrelevant #

Relevant #

Initial Feature Selection And Imputation #

Dropped Columns #

Imputed Columns #