Breaking Bad, Mad Men, How I Met Your Mother or Game of Thrones - predicting ratings with big data

Mo Data stashed this in Analysis Tips and Tricks

http://www.r-bloggers.com/using-sentiment-analysis-to-predict-ratings-of-popular-tv-series/

After some minor data cleaning, I was able to plot the evolution of IMDB user ratings for some of the most popular TV series. Breaking Bad looks like the highest rated series, followed closely by Game of Thrones. It is also interesting to note the big drop in ratings for shows such as Family Guy, South Park and How I Met Your Mother. The same goes for the Simpsons, who (I’ve been told) used to be excellent and are now much less fun to watch.

Since I’ve recently taken an interest in NLP and some of the challenges associated with it, I also decided to perform a sentiment analysis of the TV series under study. In this case, we can use the AFINN list of positive and negative words in the English language, which provides 2477 words weighted in a range of [-5, 5] according to their “negativeness” or “positiveness”. For example, the phrase below would be scored as -3 (terrible) -2 (mistake) + 4 (wonderful) = -1

"There is a terrible mistake in this work, but it is still wonderful!"

I used a Python scraper (for any midly sophisticated scraping purposes, the BeautifulSoup Python library still has no equal in R) to retrieve the transcripts of all episodes in each TV series and computed their overall sentiment score, which produced the figure below. Here, the higher the sentiment score, the more “positive” was the episode, and vice-versa.

Of the TV series featured here, we can see that Game of Thrones is by far the most negative of them all, which is not surprising given the plotting, killing and general all out warring that goes on in this show. On the flip side, Glee was the most positive TV series, which also makes a lot sense, given how painfully corny it can be. Of the shows that have already ended (Friends, West Wing and Grey’s anatomy), It is interesting to observe a progressive rise of positiveness as we get closer to the final episode, presumably because the writers try and end the series on a high note. I have included more detailed graphs of the rating and sentiments for each TV series at the bottom of this post.

Looking at the plot above, we can wonder whether user ratings are somehow dependent on the sentiments of a given episode. We can investigate this further by fitting a simple model in which the response is the IMDB user ratings, and predictor variables are sentiment, number of submitted votes, and TV series.

sentiment rating VoteCount series

148 8.4 2352 BBT

61 8.4 1691 Breaking Bad

115 7.9 1418 BBT

109 8.2 1458 Game of Thrones

194 8.1 1356 Simpsons

131 8.5 1406 Simpsons

More here: http://www.r-bloggers.com/using-sentiment-analysis-to-predict-ratings-of-popular-tv-series/

<a rel="nofollow" target="_blank" href="http://www.r-bloggers.com/using-sentiment-analysis-to-predict-ratings-of-popular-tv-series/">http://www.r-bloggers.com/using-sentiment-analysis-to-predict-ratings-of-popular-tv-series/</a>

After some minor data cleaning, I was able to plot the evolution of IMDB user ratings for some of the most popular TV series. Breaking Bad looks like the highest rated series, followed closely by Game of Thrones. It is also interesting to note the big drop in ratings for shows such as Family Guy, South Park and How I Met Your Mother. The same goes for the Simpsons, who (I’ve been told) used to be excellent and are now much less fun to watch.<a rel="nofollow" target="_blank" href="http://statofmind.files.wordpress.com/2014/05/rating_all_series.png"><img src="http://statofmind.files.wordpress.com/2014/05/rating_all_series.png?w=646&h=179" alt="rating_all_series" width="646" height="179" /></a>

Since I’ve recently taken an interest in <a rel="nofollow" target="_blank" href="http://en.wikipedia.org/wiki/Natural_language_processing">NLP</a> and some of the challenges associated with it, I also decided to perform a <a rel="nofollow" target="_blank" href="http://en.wikipedia.org/wiki/Sentiment_analysis">sentiment analysis</a> of the TV series under study. In this case, we can use the <a rel="nofollow" target="_blank" href="http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010">AFINN list of positive and negative words in the English language</a>, which provides 2477 words weighted in a range of [-5, 5] according to their “negativeness” or “positiveness”. For example, the phrase below would be scored as -3 (terrible) -2 (mistake) + 4 (wonderful) = -1

<pre><code>"There is a terrible mistake in this work, but it is still wonderful!"</code></pre>

I used a Python scraper (for any midly sophisticated scraping purposes, the BeautifulSoup Python library still has no equal in R) to retrieve the transcripts of all episodes in each TV series and computed their overall sentiment score, which produced the figure below. Here, the higher the sentiment score, the more “positive” was the episode, and vice-versa.

Of the TV series featured here, we can see that Game of Thrones is by far the most negative of them all, which is not surprising given the plotting, killing and general all out warring that goes on in this show. On the flip side, Glee was the most positive TV series, which also makes a lot sense, given how painfully corny it can be. Of the shows that have already ended (Friends, West Wing and Grey’s anatomy), It is interesting to observe a progressive rise of positiveness as we get closer to the final episode, presumably because the writers try and end the series on a high note. I have included more detailed graphs of the rating and sentiments for each TV series at the bottom of this post.

Looking at the plot above, we can wonder whether user ratings are somehow dependent on the sentiments of a given episode. We can investigate this further by fitting a simple model in which the response is the IMDB user ratings, and predictor variables are sentiment, number of submitted votes, and TV series.1

2

3

4

5

6

7

<code>sentiment rating   VoteCount series</code>

<code>61        8.4      1691      Breaking Bad</code>

<code>109       8.2      1458      Game of Thrones</code>

<code>194       8.1      1356      Simpsons</code>

<code>131       8.5      1406      Simpsons</code>

<code>More here: <a rel="nofollow" target="_blank" href="http://www.r-bloggers.com/using-sentiment-analysis-to-predict-ratings-of-popular-tv-series/">http://www.r-bloggers.com/using-sentiment-analysis-to-predict-ratings-of-popular-tv-series/</a> </code>

Mo Data
8:08 PM May 28 2014

Stashed in: Are You Not Entertained?, tv

To save this post, select a stash from drop-down menu or type in a new one:

I'm not sure what your conclusion is. Ratings correlate with sentiment?

Adam Rifkin
12:00 AM May 29 2014

Ha, the post left me a little confused too - and, but it was late and I stuck it here anyway hoping that it might prompt some discussion that would help me work this out.

From what I can make out the conclusion is as follows:

The reviews on any episode can have positive or negative tone. This seems to depend on the actual content of the episode. There is another factor where the screen writer tries to end the series on a high note, so the tone might veer to the positive (but this does not seem to be proven out in the time series charts)

The author does say this: "In conclusion, while this is a relatively unrigorous study, it appears that we can predict with reasonable accuracy the average IMDB user ratings that will be assigned to an episode, so long as we know its overall sentiment score and the number of submitted votes." but the conclusion seems to come out of the blue.

(Remember, that I use PandaWhale to direct clients to - and we have discussions about these items) This one is looking like a conversation on how a piece of analysis is somewhat inconclusive and not presented well to people who will ultimately make decisions on programming or advertising

Ha, the post left me a little confused too - and, but it was late and I stuck it here anyway hoping that it might prompt some discussion that would help me work this out.

From what I can make out the conclusion is as follows:

The reviews on any episode can have positive or negative tone. This seems to depend on the actual content of the episode. There is another factor where the screen writer tries to end the series on a high note, so the tone might veer to the positive (but this does not seem to be proven out in the time series charts)

The author does say this: "In conclusion, while this is a relatively unrigorous study, it appears that we can predict with reasonable accuracy the average IMDB user ratings that will be assigned to an episode, so long as we know its overall sentiment score and the number of submitted votes." but the conclusion seems to come out of the blue.

(Remember, that I use PandaWhale to direct clients to - and we have discussions about these items) This one is looking like a conversation on how a piece of analysis is somewhat inconclusive and not presented well to people who will ultimately make decisions on programming or advertising

Mo Data
8:36 AM May 29 2014

Thank you for the reminder.

It does seem like the conclusion comes out of the blue, but nonetheless it's entertaining to think of the ways big data could be useful when it comes to entertainment (recommendations, for example).

Adam Rifkin
9:45 AM May 29 2014

Breaking Bad, Mad Men, How I Met Your Mother or Game of Thrones - predicting ratings with big data

Mo Data stashed this in Analysis Tips and Tricks

You May Also Like: