Result

Lessons Learned

Larger datasets require paid API calls so data costs should be considered more carefully in data planning process
Research API restrictions in data collection is essential as expectation and reality may differ (ie. Tweepy api gives only recent tweets)
Heavy web scraping needs ways to bypass access restrictions (ie. Proper header and referrer required to access Sephora website)
JavaScript is not rendered with only Python requests
Data clean-up is indispensable, yet requires the most amount of time investment after data collection
Using a balance of MySQL and NoSQL based on data types improves data handling and ease of producing analyses
Expectations on data completeness should be based on data availability and must be managed early on
GCP allows for synchronized working environment for team

Larger datasets require paid API calls so data costs should be considered more carefully in data planning process
Research API restrictions in data collection is essential as expectation and reality may differ (ie. Tweepy api gives only recent tweets)
Heavy web scraping needs ways to bypass access restrictions (ie. Proper header and referrer required to access Sephora website)
JavaScript is not rendered with only Python requests
Data clean-up is indispensable, yet requires the most amount of time investment after data collection
Using a balance of MySQL and NoSQL based on data types improves data handling and ease of producing analyses
Expectations on data completeness should be based on data availability and must be managed early on
GCP allows for synchronized working environment for team