IMDB Top 100K Movies – Analysis in Depth, Part 1

114 pointsby lauriswtfover 11 years ago

21 comments

sigilover 11 years ago

You mention scraping, but did you actually scrape these? I just discovered that IMDB publishes textual data dumps. They appear to be pretty complete. Terms of use are non-commercial, but I'd love to see more analyses like this!<a href="http://www.imdb.com/interfaces" rel="nofollow">http://www.imdb.com/interfaces</a>

评论 #7250454 未加载

评论 #7251592 未加载

3rd3over 11 years ago

The conclusion that runtime correlates with rating is not quite clear to me. It looks to me that the scatter plot gets just more sparse as runtime increases and the rating distribution remains more or less the same.Somehow the number of movies exploded last(?) year.

评论 #7252914 未加载

评论 #7250343 未加载

mcphilipover 11 years ago

Great work!I did a quick and dirty project[1] involving IMDB and Neo4j when I had some time off between jobs over the holidays. I used screen scraping to get the list of IMDB ids for the AFI top 100 movies and then made calls to MyMovieAPI to pull down IMDB data about each AFI film. I wasn't aware of the imdb.com/interfaces at that point, but it wasn't really my goal to do the "best" possible implementation since it was just a learning experience. For those interested, there's a simple overview of the project[2] that shows what (i thought) were interesting questions about the data: for instance, which actors, if any have appeared in 2 or more of the top 25 AFI films?After looking at imdb.com/interfaces, I'm not sure that it has what I'm looking for. My plan on expanding this project at some point in the future is to start with data from Freebase[3] since it's already presented in a normalized format and then filling in missing details via IMDB as necessary.My ultimate goal is to generalize the N-degrees-to-Bacon trivia question to work with any two actors, but that requires getting a lot more data to work with.All in all, it's a fun dataset to play with.[1]<a href="https://github.com/mcphilip/film-graph" rel="nofollow">https://github.com/mcphilip/film-graph</a>[2]<a href="http://htmlpreview.github.io/?https://github.com/mcphilip/film-graph/blob/master/film-graph-overview.html" rel="nofollow">http://htmlpreview.github.io/?https://github.com/mcphilip/fi...</a>[3]<a href="http://www.freebase.com/film" rel="nofollow">http://www.freebase.com/film</a>

facepalmover 11 years ago

Funny that overrated movies is dominated by Twilight. I suspect boy-friends who were forced to watch them together with their girl-friends are responsible.

brownbatover 11 years ago

Buğra talks about looking at directors and actors next.I'd really like to see whether directors or writers have a bigger impact on quality of films. Like a smallish number of critics, including Pauline Kael, I'm deeply suspicious of the auteur theory that everyone kind of unquestioningly accepts.“A filmgoer seeking out pictures written by, say, Eric Roth or Charlie Kaufman won’t always see a masterpiece, but he’ll see fewer clunkers than he would following even a brilliant director like John Boorman, or an intelligent actor like Jeff Goldblum. It’s all a matter of betting on the fastest horse, instead of the most highly touted or the prettiest.” - David Kipen<a href="http://en.wikipedia.org/wiki/Schreiber_theory" rel="nofollow">http://en.wikipedia.org/wiki/Schreiber_theory</a>

Juhaover 11 years ago

> Therefore, it may be safe to assume this ranking more or less holds true for non-top 250 movies as well.This may not hold true. A while ago I was looking into it and they seemed to use more complex weighed average without exact details (possibly using internal user scoring). This may affect the final rating in many ways. More detailed analysis here: <a href="http://www.quora.com/Movies/What-algorithm-does-IMDB-use-for-ranking-the-movies-on-its-site?show=1" rel="nofollow">http://www.quora.com/Movies/What-algorithm-does-IMDB-use-for...</a>

vikpover 11 years ago

Interesting stuff, although it would be nice to see more analysis and less tables/charts. Some regression lines would also be good, and help in interpreting correlations.I was wondering how your post got 3 million facebook shares, then I realized that you left in the default data-href attribute for the facebook docs. You might want to change that.

评论 #7250353 未加载

anjcover 11 years ago

I'd love if someone (who isn't as lazy as me) could figure out a sophisticated way to show the actual good movies from a year, rather than the popularly good ones. Sentiment analysis? Trend recognition? I don't know, but, I feel like Imdb and Rotten Tomatoes are now effectively useless for new movie reviews.

infinitybeyondover 11 years ago

I am having two problems with your site. In FF the data isn't centered. In chrome and FF I don't see anything in the preformatted code block. Newest version of FF and Chrome on Win 8.1. FF is on the left, Chrome on the right.<a href="http://i.imgur.com/udHv4pH.png" rel="nofollow">http://i.imgur.com/udHv4pH.png</a>

评论 #7250986 未加载

Implicatedover 11 years ago

Would be very interested to see the correlation of rating to director/actor/actress/budget?

评论 #7250357 未加载

eCaover 11 years ago

Interesting!A couple of comments:* The first two tables could be joined, with the movies from the first table bolded to distinguish them as "best rated".* Should be: "not average runtimes(>70 and <120)" (not the other way around)* The lables of the certificate graphs are on the wrong axis.

roshansinghover 11 years ago

Gangs of Wasseypur was released in two parts as two separate movies and IMDB has added the runtime of both the movies. However both movies were equally good :)

评论 #7251257 未加载

JFrolichover 11 years ago

Great analysis, and nice matplotlib visualisations. Would it be possible to share the 'in' code to produce the graphs for learning purposes? :)

llimllibover 11 years ago

ipython notebook rocks, these sorts of analyses are super easy to cook up.

matdrewinover 11 years ago

Curious to know what tools you used to gather and build out the stats?

JoeAltmaierover 11 years ago

Runtime vs Rating is essentially a heatmap; hard to draw conclusions.

JoeAltmaierover 11 years ago

Best rated movies are bimodal: war movies, and gangster movies.

bemmuover 11 years ago

Where can one get a list of more top movies than 250?

评论 #7250338 未加载

chaddeshonover 11 years ago

Melancholia is not 450 minutes long.

评论 #7250337 未加载

matiasbover 11 years ago

nice!

flibertgibitover 11 years ago

I'd like to see more breakdown by release year, e.g. # of movies in each category by release year.I think it would be interesting to look at those stats next to economic stats, etc.I'd also like to see a more granular breakdown of attributes of each movie (movies relating to technology, movies with a workers' union being a strong component of the film, race relations, international relations, etc.) and the # of each of those per year, but that would be much more work.

21 comments

sigilover 11 years ago

评论 #7250454 未加载

评论 #7251592 未加载

3rd3over 11 years ago

评论 #7252914 未加载

评论 #7250343 未加载

mcphilipover 11 years ago

facepalmover 11 years ago

Funny that overrated movies is dominated by Twilight. I suspect boy-friends who were forced to watch them together with their girl-friends are responsible.

brownbatover 11 years ago

Juhaover 11 years ago

vikpover 11 years ago

评论 #7250353 未加载

anjcover 11 years ago

infinitybeyondover 11 years ago

评论 #7250986 未加载

Implicatedover 11 years ago

Would be very interested to see the correlation of rating to director/actor/actress/budget?

评论 #7250357 未加载

eCaover 11 years ago

roshansinghover 11 years ago

Gangs of Wasseypur was released in two parts as two separate movies and IMDB has added the runtime of both the movies. However both movies were equally good :)

评论 #7251257 未加载

JFrolichover 11 years ago

Great analysis, and nice matplotlib visualisations. Would it be possible to share the 'in' code to produce the graphs for learning purposes? :)

llimllibover 11 years ago

ipython notebook rocks, these sorts of analyses are super easy to cook up.

matdrewinover 11 years ago

Curious to know what tools you used to gather and build out the stats?

JoeAltmaierover 11 years ago

Runtime vs Rating is essentially a heatmap; hard to draw conclusions.

JoeAltmaierover 11 years ago

Best rated movies are bimodal: war movies, and gangster movies.

bemmuover 11 years ago

Where can one get a list of more top movies than 250?

评论 #7250338 未加载

chaddeshonover 11 years ago

Melancholia is not 450 minutes long.

评论 #7250337 未加载

matiasbover 11 years ago

nice!

flibertgibitover 11 years ago