Result
Lessons Learned #
- Larger datasets require paid API calls so data costs should be considered more carefully in data planning process
- Research API restrictions in data collection is essential as expectation and reality may differ (ie. Tweepy api gives only recent tweets)
- Heavy web scraping needs ways to bypass access restrictions (ie. Proper header and referrer required to access Sephora website)
- JavaScript is not rendered with only Python requests
- Data clean-up is indispensable, yet requires the most amount of time investment after data collection
- Using a balance of MySQL and NoSQL based on data types improves data handling and ease of producing analyses
- Expectations on data completeness should be based on data availability and must be managed early on
- GCP allows for synchronized working environment for team
Recommendations #
- Larger datasets require paid API calls so data costs should be considered more carefully in data planning process
- Research API restrictions in data collection is essential as expectation and reality may differ (ie. Tweepy api gives only recent tweets)
- Heavy web scraping needs ways to bypass access restrictions (ie. Proper header and referrer required to access Sephora website)
- JavaScript is not rendered with only Python requests
- Data clean-up is indispensable, yet requires the most amount of time investment after data collection
- Using a balance of MySQL and NoSQL based on data types improves data handling and ease of producing analyses
- Expectations on data completeness should be based on data availability and must be managed early on
- GCP allows for synchronized working environment for team