Data Collection

XHR was used to collect SEPHORA product reviews, Tweepy API was used to collect tweets per brand, Phantom Buster was used to collect beauty company information.

XHR

XHR provided a better solution to scrape Sephora product reviews over BS4 & Selenium

Since scraping with both BS4 and Selenium did not work, we turned to XHR requests This turned to our advantage since the XHR API was generally cleaner, more complete product listing (website removed old products) and contains reviewer metadata

Tweepy API

  • Tweeter developer account created to use the free-tier api calls to scrape tweet data per brand (brand, tweet_id, tweet_date, follower_count, retweets, location, tweet_text)
  • Iterate over brand names in brands table and append collected tweets to dataframe using Pandas
  • For 186 brands, avg 110 tweets per brand were collected but number of tweets varied per brand, which could result in diminishing the value of analysis

Phantom Buster

Phantom Buster was a great substitute to BS4 to scrape brand data.

  • Attempted to scrape data from Wikipedia, but not all brands are listed on Wikipedia
  • Discovered most brands have their company page on LinkedIn providing valuable information
  • Chose Phantom Buster to extract brand information automatically from LinkedIn
  • The extraction result is in JSON/CSV format