Result

Lessons Learned

  • Larger datasets require paid API calls so data costs should be considered more carefully in data planning process
  • Research API restrictions in data collection is essential as expectation and reality may differ (ie. Tweepy api gives only recent tweets)
  • Heavy web scraping needs ways to bypass access restrictions (ie. Proper header and referrer required to access Sephora website)
  • JavaScript is not rendered with only Python requests
  • Data clean-up is indispensable, yet requires the most amount of time investment after data collection
  • Using a balance of MySQL and NoSQL based on data types improves data handling and ease of producing analyses
  • Expectations on data completeness should be based on data availability and must be managed early on
  • GCP allows for synchronized working environment for team

Recommendations

  • Larger datasets require paid API calls so data costs should be considered more carefully in data planning process
  • Research API restrictions in data collection is essential as expectation and reality may differ (ie. Tweepy api gives only recent tweets)
  • Heavy web scraping needs ways to bypass access restrictions (ie. Proper header and referrer required to access Sephora website)
  • JavaScript is not rendered with only Python requests
  • Data clean-up is indispensable, yet requires the most amount of time investment after data collection
  • Using a balance of MySQL and NoSQL based on data types improves data handling and ease of producing analyses
  • Expectations on data completeness should be based on data availability and must be managed early on
  • GCP allows for synchronized working environment for team