Hollywood's Top 25

Role: Data Analyst, Data Visualization Designer
Tools Used
: Python (NumPy, pandas, Matplotlib), Jupyter notebook, Excel, Sketch

GOAL

A director pitched a documentary to a Hollywood producer. The objective was to uncover characteristics that are associated with high box office revenues. Is this genre likely to do well? Will a great Rotten Tomatoes score predict higher revenues? Will higher revenues indicate a better return on investment?

What I Did

  • Examine The Movie Database csv (10,866 rows, 21 columns)

  • Quantify relationships between revenue, popularity, return on investment, and genres

  • Design a custom visualization to communicate findings

  • View the Python code

THE PROCESS

Question. I began by posing questions about the data to direct my analysis towards meaningful insights. I was most interested in what properties are associated with higher box office revenues:

  • Are there common genres?

  • Are they also more popular? What does ‘popularity’ mean?

  • Are they more profitable? Do they have a higher return on investment (ROI)?

Wrangle. I loaded the The Movie Database (TMDb) into a Jupyter notebook and wrangled the data into a useful format. I fixed missing and errant data, removed duplicates and irrelevant columns, renamed columns and converted data types as necessary, and split cells that had multiple inputs (such as genres, cast, and directors).

Exploratory Data Analysis. I augmented the data with new columns to explore profitability (revenue – budget) and ROI (profits : budget). Since popularity can be subjective, I pulled in ratings from Rotten Tomatoes and The Internet Movie Database (IMDb) to compare all three. I created a new dataframe with the top 100 box office movies to compare with the rest.

Draw Conclusions. Finally, I summarized the relationships I found and made predictions.

  • Higher revenue (box office) movies are more profitable in general but might not yield a higher ROI

  • Popularity has a weak positive correlation to box office revenue (not a good indicator)

  • Most common genres in top grossing: Adventure / Action / Fantasy

  • Most common genres overall: Drama / Comedy / Thriller

  • Least common genres overall: TV Movies / Westerns / Foreign

  • Most popular genre by decade: Drama, except for the 80s when it was Comedy (Thank you, John Hughes!)

  • Runtime lengths have decreased by 16% from 1960 (avg. 118 mins) to 2015 (avg. 97 mins)

  • Fun Fact: Which actors have starred in the most movies?
    Robert De Niro (72), Samuel L. Jackson (71), Bruce Willis (62), Nicolas Cage (61), Michael Caine (53)

  • Fun Fact: Who has directed the most movies?
    Woody Allen (46), Clint Eastwood (34), Steven Spielberg (30), Martin Scorsese (30), Steven Soderbergh (23)

Communicate Results. I designed a custom visualization using Sketch. I limited it to the top 25 in box office revenues and their genres, popularity, and ROI. I wireframed first to explore element positioning and colors. For the final visualization, I color coded the highest values in each category — box office revenues, genres, popularity, and ROI — to draw attention yet not distract from the design as a whole.

Conclusion

A movie that does better at the box office does not mean it will bring in a higher return on investment. Not a single documentary is among the top 25 revenue producing movies; action, adventure, and/ or fantasy yield the highest box office revenues. Popularity is not a good revenue indicator — some of the top producing moves are popular and others are not. Also, keeping it under 100 minutes is best suited for today’s audience attention span.

Of the Top 25 box office revenue producing movies:

  • Top Revenue: Avatar ($2.8B)

  • Biggest Return on Investment: Minions (15x)

  • Most Popular: The Dark Knight (IMDb)

  • Most Popular: Toy Story 3 (Rotten Tomatoes)

  • Most Popular: Jurassic World (TMDb)

  • Most Common Genres: Action/ Adventure/ Sci-Fi/ Fantasy

VISUALIZATION

movies-tmdb-final.png