Data Collection
Collected datasets to serve the project needs, pre-processed them and selected features.
Dataset
Our dataset is taken from Kaggle public dataset (1.45 GB). Based on resale car listings on Craigslist, columns include selling price, car attributes, color, condition, VIN, mileage, make and model, etc. We enhanced and imputed some data with data from iSeeCars.
Incorporating External Data
To impute some of the columns, we looked at the most common value for each car model.
Assumptions:
- Cars of the same make and model share common attributes such as cylinders, type and size.
- For customizable attributes (drive type) we are assuming the most common values for each model.
We collected data from iSeeCars through web scraping. Data scraped include MSRP and car attributes for all makes and models available.
Problem:
- Due to user-input values, car model names may not match. exactly with the clean names from iSeeCars.
- To overcome this problem, we used OpenRefine to try and fix some of the entries based on naming clusters.
- We also used SequenceMatcher from the difflib library to programmatically fix model names based on similarity index and assigned model with the highest name similarity.
Example of cases where difflib sequence matcher was needed:
Available Columns And Relevant Columns
After combining data and cleaning it up, we looked in each column. We determined the relevant columns by looking at distribution of data, correlation with price, missing values and other factors from inference.
Irrelevant
High Cardinality
- ID
- URL
- Image_URL
- VIN
Categorical
- Region
- Region_URL
Null
- County(100% missing)
- Size (>70%)
Geolocation:
- Longitude
- Latitude
Text
- Description
Relevant
Categorical
- Manufacturer
- Model
- Condition
- Cylinders
- Fuel
- Title Status
- Transmission
- Drive
- Type
- Paint Color
- State
Continuous
- Price
- MSRP
- Odometer
Time
- Year
- Posting Date
Initial Feature Selection And Imputation
Dropped Columns
ID, VIN, URL, Lattitude, Longitude, Image URL, Region, Region URL, County, Size, Description
Imputed Columns
Mode:
Title Status, Fuel, Color
Based on other features:
Condition
Based on external dataset:
Manufacturer, Model, Drive, Cylinder, Type