Realigning the Stars: Enhancing the IMDb Rating System

People who enjoy watching movies often rely on rankings to help them decide what to watch. While doing this myself recently, I realized that a large number of highly ranked films fell into the drama genre. This observation led me to believe that the ranking system might be biased towards certain genres.

I was browsing IMDb, one of the most popular websites for movie enthusiasts, which features movies from different countries and time periods. Its renowned ranking system is based on an extensive collection of user reviews. To investigate this further, I decided to download the complete dataset from IMDb and analyze it to see if I could develop a refined ranking system that took into account a wider range of factors.

The IMDb Rating System: Examining the Data

I was able to obtain data on 242,528 movies released between 1970 and 2019. For each movie, IMDb provided the following information: Rank, Title, ID, Year, Certificate, Rating, Votes, Metascore, Synopsis, Runtime, Genre, Gross, and SearchYear.

To ensure I had sufficient data for analysis, I set a minimum threshold of 500 reviews per movie. This filtering process resulted in a dataset of 33,296 movies. The table below presents a summary analysis of the fields in this reduced dataset:

FieldTypeNull CountMeanMedian
RankFactor0  
TitleFactor0  
IDFactor0  
YearInt020032006
CertificateFactor17587  
RatingInt06.16.3
VotesInt0210402017
MetascoreInt2235055.356
SynopsisFactor0  
RuntimeInt132104.9100
GenreFactor0  
GrossFactor21415  
SearchYearInt020032006

Note: In the R programming language, Factor is used to represent strings. Rank and Gross appear as factors in the original IMDb dataset due to the presence of thousands separators.

Before I could begin refining the movie scores, I needed to further examine the dataset. The fields Certificate, Metascore, and Gross had a significant number of missing values (over 50%), making them unsuitable for analysis. Rank is directly dependent on Rating (the variable I aimed to refine), and therefore provided no additional insights. Similarly, ID served as a unique identifier for each film and was not relevant for this analysis.

The fields Title and Synopsis contained short textual descriptions. While it would be possible to analyze them using natural language processing (NLP) techniques, the limited amount of text led me to exclude them from this particular analysis.

After this initial filtering step, I was left with the following fields: Genre, Rating, Year, Votes, SearchYear, and Runtime. The Genre field often included multiple genres per movie, separated by commas. To account for the combined effect of multiple genres, I transformed the Genre field using one-hot encoding. This transformation generated 22 new boolean fields, one for each genre, with a value of 1 indicating the presence of that genre in a movie and 0 otherwise.

Analyzing the IMDb Data

To understand the relationships between the variables, I calculated the correlation matrix.

A correlation matrix among all the remaining original columns and the new genre columns. Numbers close to zero result in blank spaces in the grid. Negative correlations result in red dots and positive correlations in blue dots. The dots are larger and darker the stronger the correlation is. (Visual highlights are described in the main article text.)

In this correlation matrix, a value close to 1 represents a strong positive correlation, while a value close to -1 indicates a strong negative correlation. The matrix revealed several interesting observations:

  • Year and SearchYear exhibited perfect correlation, suggesting they essentially carry the same information. Therefore, I decided to retain only the Year field.
  • Some fields displayed expected positive correlations, such as:
    • Music with Musical
    • Action with Adventure
    • Animation with Adventure
  • Similarly, some fields showed expected negative correlations:
    • Drama vs. Horror
    • Comedy vs. Horror
    • Horror vs. Romance
  • Examining the correlations with the target variable (Rating), I observed:
    • Significant positive correlations with Runtime and Drama.
    • Weaker correlations with Votes, Biography, and History.
    • A substantial negative correlation with Horror and weaker negative correlations with Thriller, Action, Sci-Fi, and Year.
    • No other noteworthy correlations.

These observations indicated that lengthy dramas tended to receive higher ratings, while shorter horror movies received lower ratings. Although I lacked the data to confirm this, I suspected that these correlations did not necessarily reflect the types of movies that typically generate the highest profits, such as those produced by Marvel or Pixar.

It’s possible that the user base who vote on IMDb might not be representative of the general movie-going audience. Users who take the time to rate movies on the website are likely more passionate about film and may have more refined cinematic preferences. My goal, however, was to minimize the influence of these common movie characteristics and develop a more balanced rating system.

Genre Distribution in the IMDb Rating System

Next, I wanted to investigate the distribution of each genre across different rating levels. To do this, I created a new field called Principal_Genre based on the first genre listed in the original Genre field. I visualized this relationship using a violin graph.

A violin plot showing the rating distribution for each genre.

This visualization reinforced the previous finding that Drama tends to be associated with higher ratings and Horror with lower ratings. Interestingly, it also highlighted two other genres with generally positive ratings: Biography and Animation. The reason these genres did not show strong correlations in the previous matrix could be attributed to the relatively small number of movies belonging to these genres. To confirm this, I generated a frequency bar plot of the different genres.

A bar graph showing how many movies of each genre were in the database. Comedy, Drama, and Action had frequencies around 6,000 or above; Crime and Horror were above 2,000; the rest were under 1,000.

The bar plot confirmed that Biography and Animation had fewer movies in the dataset, as did Sport and Adult. This smaller representation likely explains their weaker correlations with the Rating variable.

Other Variables in the IMDb Rating System

Moving on to the continuous variables, I began by examining the scatter plot of Rating against Year.

A scatter plot of rating and years.

As observed in the correlation matrix, Year showed a negative correlation with Rating. This suggests that as the year of release increases, the variance in movie ratings also tends to increase, with more recent movies receiving a wider spread of ratings, including lower ones.

Next, I generated a scatter plot of Rating against Votes.

A scatter plot of ratings and votes.

This plot revealed a clearer positive correlation: movies with a higher number of votes generally received higher ratings. However, the majority of movies in the dataset had a relatively low number of votes, and for those movies, there was greater variability in Rating.

Finally, I examined the relationship between Rating and Runtime.

A scatter plot between rating and runtime.

This scatter plot followed a similar pattern to the previous one but with a stronger correlation: longer movies tended to receive higher ratings. It is important to note, however, that there were very few instances of movies with extremely long runtimes.

Refining the IMDb Rating System

Armed with a better understanding of the data, I decided to experiment with different models to predict movie ratings based on the selected fields. My hypothesis was that the difference between the predictions of my best model and the actual Rating would highlight the influence of factors not captured in the dataset, such as acting, screenplay, and cinematography, and potentially reveal what makes certain movies stand out.

I began with the simplest model, linear regression. To evaluate the performance of each model, I used two common metrics: root-mean-square (RMSE) and mean absolute (MAE) errors. These metrics are widely used for regression tasks and are easy to interpret as they are on the same scale as the target variable.

The linear regression model yielded an RMSE of 1.03 and an MAE of 0.78. However, linear models make certain assumptions about the data: independence of errors, a median error of zero, and constant variance. If these assumptions hold true, a plot of “residuals vs. predicted values” should resemble a random cloud with no discernible pattern. To verify this, I created the residual plot.

Residual vs. predicted values scatterplot.

The plot revealed that while the residuals appeared randomly scattered for predicted values up to 7, a clear downward linear trend emerged for higher predicted values. This indicated that the assumptions of the linear model were violated. Additionally, the model produced predicted values exceeding the maximum possible rating of 10, suggesting an overflow issue.

Based on the earlier analysis, I suspected that the Votes field might be contributing to this issue. While a higher number of votes generally correlated with higher ratings, this effect was primarily observed in a few cases with an exceptionally high number of votes. To test this, I retrained the linear model without the Votes field.

Residual vs. predicted values scatterplot when the Votes field is removed.

The residual plot showed significant improvement without the Votes field. The residuals were more randomly distributed, and there was no longer an overflow of predicted values. Furthermore, since Votes reflects reviewer activity and not inherent movie characteristics, I decided to exclude this field from further analysis. While removing Votes slightly increased the errors (RMSE: 1.06, MAE: 0.81), I prioritized having a model that met the assumptions and had better feature selection over a marginal improvement in performance on the training data.

Evaluating the Performance of Other Models

Next, I explored various other models to determine which performed best. For each model, I employed random search to find the optimal hyperparameter values and 5-fold cross-validation to mitigate any potential model bias. The table below summarizes the estimated errors for each model:

ModelRMSEMAE
Neural Network1.0445960.795699
Boosting1.0466390.7971921
Inference Tree1.057040.8054783
GAM1.06151080.8119555
Linear Model1.0665390.8152524
Penalized Linear Reg1.0666070.8153331
KNN1.0667140.8123369
Bayesian Ridge1.0689950.8148692
SVM1.0734910.8092725

All models demonstrated relatively similar performance. This finding prompted me to delve deeper into the data using some of these models. I was particularly interested in understanding the influence of each field on movie ratings. One way to assess feature importance is by examining the coefficients of the linear model. However, to avoid any distortions caused by differing scales, I first standardized the data and retrained the linear model. The resulting weights are visualized in the following graph.

A bar graph of linear model weights ranging from nearly -0.25 for Horror to nearly 0.25 for Drama.

The graph clearly shows that Horror and Drama were the two most influential genres, with Horror having a negative impact on ratings and Drama having a positive impact. Other genres, such as Animation and Biography, also showed positive contributions, while Action, Sci-Fi, and Year had negative contributions. Notably, Principal_Genre did not have a substantial impact, suggesting that the specific genres a movie belongs to matter more than which genre is considered primary.

The generalized additive model (GAM) allowed for a more nuanced understanding of the impact of the continuous variable, Year.

A graph of Year vs. s(Year) using the generalized additive model. The s(Year) value follows a curve starting up near 0.6 for 1970, bottoming out below 0 at 2010, and increasing to near 0 again by 2019.

The GAM plot revealed a more intriguing relationship. While recent movies generally received lower ratings, the effect was not constant over time. Ratings reached their lowest point around 2010 but appeared to recover slightly in subsequent years. Further investigation would be needed to understand what factors might have contributed to this trend.

Neural networks achieved the lowest RMSE and MAE, indicating the best performance among the models tested. However, it’s important to note that none of the models achieved perfect prediction accuracy. This finding, however, was not entirely unexpected or undesirable given my objective. The available data allowed for a reasonable estimation of movie ratings, but it was evident that other factors not captured in the dataset play a significant role in influencing movie preferences. These factors likely include aspects like acting, screenplay quality, cinematography, and other subjective elements.

From my perspective, these uncaptured characteristics are precisely what make a movie truly enjoyable and worth watching. It’s not simply about whether a movie falls into a specific genre like drama, action, or science fiction. I seek out movies that offer something unique, thought-provoking, entertaining, or emotionally resonant.

With this in mind, I created a refined rating system by subtracting the predicted rating from the best model (neural network) from the actual IMDb rating. This adjustment aimed to isolate the impact of the uncaptured factors by removing the influence of Genre, Runtime, and Year.

IMDb Rating System Alternative: The Final Results

Let’s compare the top 10 movies based on my refined rating system versus the original IMDb ratings:

IMDb

TitleGenreIMDb RatingRefined Rating
Ko to tamo pevaAdventure,Comedy,Drama8.91.90
Dipu Number 2Adventure,Family8.93.14
El señor de los anillos: El retorno del reyAdventure,Drama,Fantasy8.92.67
El señor de los anillos: La comunidad del anilloAdventure,Drama,Fantasy8.82.55
Anbe SivamAdventure,Comedy,Drama8.82.38
Hababam Sinifi TatildeAdventure,Comedy,Drama8.71.66
El señor de los anillos: Las dos torresAdventure,Drama,Fantasy8.72.46
Mudras CallingAdventure,Drama,Romance8.72.34
InterestelarAdventure,Drama,Sci-Fi8.62.83
Volver al futuroAdventure,Comedy,Sci-Fi8.52.32

Mine

TitleGenreIMDb RatingRefined Rating
Dipu Number 2Adventure,Family8.93.14
InterestelarAdventure,Drama,Sci-Fi8.62.83
El señor de los anillos: El retorno del reyAdventure,Drama,Fantasy8.92.67
El señor de los anillos: La comunidad del anilloAdventure,Drama,Fantasy8.82.55
Kolah ghermezi va pesar khaleAdventure,Comedy,Family8.12.49
El señor de los anillos: Las dos torresAdventure,Drama,Fantasy8.72.46
Anbe SivamAdventure,Comedy,Drama8.82.38
Los caballeros de la mesa cuadradaAdventure,Comedy,Fantasy8.22.35
Mudras CallingAdventure,Drama,Romance8.72.34
Volver al futuroAdventure,Comedy,Sci-Fi8.52.32

As expected, there aren’t drastic changes in the top rankings. This is because the model errors were relatively small, and the top-rated movies likely possess qualities that are captured by both the model and general viewer sentiment.

Now, let’s examine the bottom 10 movies according to both ranking systems:

IMDb

TitleGenreIMDb RatingRefined Rating
Holnap történt - A nagy bulvárfilmComedy,Mystery1-4.86
Cumali Ceber: Allah Seni AlsinComedy1-4.57
BadangComedy,Fantasy1-4.74
Yyyreek!!! Kosmiczna nominacjaComedy1.1-4.52
Proud AmericanDrama1.1-5.49
Browncoats: Independence WarAction,Sci-Fi,War1.1-3.71
The Weekend It LivesComedy,Horror,Mystery1.2-4.53
Bolívar: el héroeAnimation,Biography1.2-5.34
Rise of the Black BatAction,Sci-Fi1.2-3.65
HatsukoiDrama1.2-5.38

Mine

TitleGenreIMDb RatingRefined Rating
Proud AmericanDrama1.1-5.49
Santa and the Ice Cream BunnyFamily,Fantasy1.3-5.42
HatsukoiDrama1.2-5.38
ReisBiography,Drama1.5-5.35
Bolívar: el héroeAnimation,Biography1.2-5.34
Hanum & Rangga: Faith & The CityDrama,Romance1.2-5.28
After Last SeasonAnimation,Drama,Sci-Fi1.7-5.27
Barschel - Mord in GenfDrama1.6-5.23
Rasshu raifuDrama1.5-5.08
KamifûsenDrama1.5-5.08

Similarly, the bottom rankings remain relatively consistent, although my refined list includes a higher proportion of dramas compared to the original IMDb rankings. This suggests that some dramas might be overrated simply for belonging to that genre.

Perhaps the most interesting comparison is the list of 10 movies with the greatest discrepancies between the original IMDb rating and my refined rating. These movies stand out because their uncaptured qualities contribute significantly to their overall appeal (or lack thereof), surpassing what would be expected based on their genre, runtime, and year of release.

TitleIMDb RatingRefined RatingDifference
Kanashimi no beradonna7.4-0.718.11
Jesucristo Superstar7.4-0.698.09
Pink Floyd The Wall8.10.038.06
Tenshi no tamago7.6-0.428.02
Jibon Theke Neya9.41.527.87
El baile7.80.007.80
Santa and the Three Bears7.1-0.707.80
La alegre historia de Scrooge7.5-0.247.74
Piel de asno7-0.747.74
17767.6-0.117.71

If I were a movie director aiming to produce a film with a high IMDb rating based on this analysis, I might be inclined to create a lengthy animated biographical drama that’s a remake of a classic film, such as “Amadeus.” While such a movie might achieve a favorable IMDb ranking, its commercial success is less certain.

What are your thoughts on the movies highlighted by my refined rating system? Do you agree with their rankings? Share your opinions in the comments below!

Licensed under CC BY-NC-SA 4.0