We Rate Dogs

Role: Data Analyst, Data Visualization Designer
Tools used
: Python, NumPy, pandas, Matplotlib, Jupyter notebook, Excel, Sketch

GOAL

Practice data wrangling using #WeRateDogs, a funny Twitter feed where dogs are ‘numerically objectified’. Present the findings visually. As with any data project, I began by questioning. What determines the ‘best’ dog? What are the most common dog breeds? What are the most common dog names?

What I did

  • Examine three datasets (6,773 total entries)

  • Use Python to wrangle and analyze the data

  • Create a custom visualization to communicate observations

  • View Python code

Introduction

Real-world data rarely comes to us clean. I will be wrangling, analyzing, and visualizing the #WeRateDogs Twitter archive, an account that rates people's dogs along with a humorous comment. The higher the rating, the 'better' the dog. WeRateDogs has over 4 million followers and has received international media coverage.

The Data Wrangling Process

Gather. I gathered data from three sources:

  • The Twitter archive #WeRateDogs, a csv file that contains the tweet, rating, dog name, and dog 'stage' in life (such as puppy).

  • An 'image prediction' file, or what breed of dog is in each tweet, according to a neural network. I downloaded the image prediction file programmatically from Udacity's servers using the Requests library.

  • Twitter's API to gather retweet count and favorite count, two columns missing in the Twitter archive. I queried the API for each tweet's JSON data using Python's Tweepy library.

Assess. I assessed the data based on quality and tidiness.

  • Low quality (‘dirty’) data has content issues such as missing, invalid, inaccurate, and inconsistent data. I assessed removing unnecessary columns, converting data types, making dog names title case, removing entries that were not really dogs, and removing outliers.

  • Untidy (‘messy’) data has structural issues. I assessed gathering dog stages from multiple columns into one, creating a ‘prediction’ column (Dog, Maybe Dog, Not Dog), and combining the three datasets into one.

Clean. I converted each assessment observations into action items. Each item was defined, coded, and tested to ensure the problem was fixed. I first addressed missing data, then structural issues, then quality issues. For example, there were lots of ‘rating’ outliers. Dogs were supposed to be rated from 1-10, but many were well above 10. I converted the ones I could to a ‘10’ scale, such as ‘165 out of 150’ became an 11 rating.

Conclusion

I analyzed the information in the clean, combined dataframe and created initial visuals using Matplotlib in Python. Finally, I created a custom visualization of my findings in Sketch.

  • Although ratings are supposed to be from 1-10, 14 is actually the ‘highest’ rating.

  • I defined the ‘best’ dog as the one with a combination of highest rating, most favorites, AND most retweets—Bo, a standard poodle.

  • The most favorited dog is a women’s march supporter, a Lakeland Terrier.

  • The most retweeted dog is a Siberian Husky that can blow bubbles.

  • Goldens, Labs, and Corgis are the most common dog breeds.

  • Cooper and Lucy are the most common dog names.

  • The most popular day/ month to post is Monday/ December. Dogs clearly prefer to heighten our spirits during more stressful human times.

  • Most people post from their iPhone app. The camera makes it so simple.

  • Most dogs in the Twitter feed are puppies (or ‘puppers’).

  • People prefer to ‘favorite’ a dog over a ‘retweet’, and both actions have decreased over time (2015-2017).

VISUALIZATION