We Rate Dogs

Role: Data Analyst, Data Visualization Designer
Tools used: Python, NumPy, pandas, Matplotlib, Jupyter notebook, Excel, Sketch

GOAL

Practice data wrangling using #WeRateDogs, a funny Twitter feed where dogs are ‘numerically objectified’. Present the findings visually. As with any data project, I began by questioning. What determines the ‘best’ dog? What are the most common dog breeds? What are the most common dog names?

What I did

Examine three datasets (6,773 total entries)
Use Python to wrangle and analyze the data
Create a custom visualization to communicate observations
View Python code

Introduction

Real-world data rarely comes to us clean. I will be wrangling, analyzing, and visualizing the #WeRateDogs Twitter archive, an account that rates people's dogs along with a humorous comment. The higher the rating, the 'better' the dog. WeRateDogs has over 4 million followers and has received international media coverage.

The Data Wrangling Process

Gather. I gathered data from three sources:

The Twitter archive #WeRateDogs, a csv file that contains the tweet, rating, dog name, and dog 'stage' in life (such as puppy).
An 'image prediction' file, or what breed of dog is in each tweet, according to a neural network. I downloaded the image prediction file programmatically from Udacity's servers using the Requests library.
Twitter's API to gather retweet count and favorite count, two columns missing in the Twitter archive. I queried the API for each tweet's JSON data using Python's Tweepy library.

Assess. I assessed the data based on quality and tidiness.

Low quality (‘dirty’) data has content issues such as missing, invalid, inaccurate, and inconsistent data. I assessed removing unnecessary columns, converting data types, making dog names title case, removing entries that were not really dogs, and removing outliers.
Untidy (‘messy’) data has structural issues. I assessed gathering dog stages from multiple columns into one, creating a ‘prediction’ column (Dog, Maybe Dog, Not Dog), and combining the three datasets into one.

Clean. I converted each assessment observations into action items. Each item was defined, coded, and tested to ensure the problem was fixed. I first addressed missing data, then structural issues, then quality issues. For example, there were lots of ‘rating’ outliers. Dogs were supposed to be rated from 1-10, but many were well above 10. I converted the ones I could to a ‘10’ scale, such as ‘165 out of 150’ became an 11 rating.

Conclusion

I analyzed the information in the clean, combined dataframe and created initial visuals using Matplotlib in Python. Finally, I created a custom visualization of my findings in Sketch.

Although ratings are supposed to be from 1-10, 14 is actually the ‘highest’ rating.
I defined the ‘best’ dog as the one with a combination of highest rating, most favorites, AND most retweets—Bo, a standard poodle.
The most favorited dog is a women’s march supporter, a Lakeland Terrier.
The most retweeted dog is a Siberian Husky that can blow bubbles.
Goldens, Labs, and Corgis are the most common dog breeds.
Cooper and Lucy are the most common dog names.
The most popular day/ month to post is Monday/ December. Dogs clearly prefer to heighten our spirits during more stressful human times.
Most people post from their iPhone app. The camera makes it so simple.
Most dogs in the Twitter feed are puppies (or ‘puppers’).
People prefer to ‘favorite’ a dog over a ‘retweet’, and both actions have decreased over time (2015-2017).