Data Profile

Data Infrastructure

Data Profile

SOURCEDESCRIPTIONDATA SIZEFORMAT
ProductKaggleProducts listed on Sephora12.4MBcsv
BrandsPhantom BusterInformation on cosmetic brands and company information226KBjson/csv
ReviewsBazaarVoice APIUser product reviews on Sephora500MBjson
TwitterTweepy APIUsers’ tweets with hashtag of brand name7.3MBcsv
IngredientCalifornia Safe Cosmetics Program Product DBChemicals used in cosmetics that are known to be harmful2.2MBcsv

Data Quality Dimensions Assessment

CompletenessAccuracyConsistencyValidityUniquenessIntegrity
Original data did not contain all products up to this year and tweets data was limited to recent dataTweet data contained inaccurate feedback on brands as search query was simply limited to “#brand” and not advanced keywords such as product namesThe amount of data for product, reviews and tweets per brand was not consistentFor ingredient harmfulness, the data is perfectly valid as it has scientific backing. Using tweet sentiment and review data however, may not be perfectly representative of realityA customer may have repeated review based on multiple experience Sentiment analysis may be based on multiple tweets that could have been oneEntity, domain and referential integrities were considered with constraints

ETL Process: Extraction, Transformation & Load

Implementation Tools

Design Consideraton

Branding Original product contains products across all categories, but we limit our data to facial brands under Sephora’s makeup and skincare categories.

Outliers & Anomalies Tweet location data contains non-location data. Intensive clustering and replacing with NULL if no-location can be derived from.

Ingredients Intermediary lookup table linking harmful ingredients to the product’ ingredient. Dealing With NAs MySQL Workbench recognizes NULL instead of empty string so removed unused columns and filled NaNs as NULL.

Product Reviews The reviews have too many attributes that are sparsely populated due to the difference between product categories, making it hard to create table columns in SQL for it.

Thus, using NoSQL and store as key value pairs is what we decided to do.

Data Naming Convention Standardize the format for names across entities and attributes (snake case).

Normalization Using brand id across the tables, referring to the brand table to normalize.