Data Collection

XHR was used to collect SEPHORA product reviews, Tweepy API was used to collect tweets per brand, Phantom Buster was used to collect beauty company information.

XHR

XHR provided a better solution to scrape Sephora product reviews over BS4 & Selenium

Since scraping with both BS4 and Selenium did not work, we turned to XHR requests This turned to our advantage since the XHR API was generally cleaner, more complete product listing (website removed old products) and contains reviewer metadata

Tweepy API

Tweeter developer account created to use the free-tier api calls to scrape tweet data per brand (brand, tweet_id, tweet_date, follower_count, retweets, location, tweet_text)
Iterate over brand names in brands table and append collected tweets to dataframe using Pandas
For 186 brands, avg 110 tweets per brand were collected but number of tweets varied per brand, which could result in diminishing the value of analysis

Phantom Buster

Phantom Buster was a great substitute to BS4 to scrape brand data.

Attempted to scrape data from Wikipedia, but not all brands are listed on Wikipedia
Discovered most brands have their company page on LinkedIn providing valuable information
Chose Phantom Buster to extract brand information automatically from LinkedIn
The extraction result is in JSON/CSV format

Data Collection

XHR #

Tweepy API #

Phantom Buster #

XHR

Tweepy API

Phantom Buster