Data Preparation
Among all datasets, tweet's text needed heavier cleanup.
On this page
Data Cleanup
- Remove NaN
- Remove Emojis from tweets
- REGEX to clean tweet text
REGEX: Tweet text data includes emoji, hashtag, numbers, links and etc so REGEX is used to clean up each text
NLTK: NLTK tokenizing by word was used to extract more meaningful and logical words from cleaned tweet texts
Clustering On Location Data From Tweets
Tweets’ location is not structured so cleaning up the users’ various input on location required heavy clustering that involved:
- Key Collision Methods
- Nearest Neighbor Methods
- Fingerprinting
- N-Gram Fingerprint
- Phonetic Fingerprint (metaphone3 was the most effective in finding and clustering data that was not appropriate as location)
For data that was not classified as location, “ ” was assigned.