This is Part 2 of our “Analyzing the Movielens data” series.
In Part 1, we did the following:
- reviewed the organization of the data
- outlined a set of questions we’d like to ask
- created Juxt workflows to integrate the data from three different data sources
Now, let’s look at a couple of the questions.
- What is the average rating for each movie broken down by gender?
- What are the top 10 movies that men rate higher than women?
Average Rating by Gender
We start by fetching from the user DB, where we have already integrated the Users, Movies and Ratings data. This is done with the Fetch from User DB module using the key “movielens-dataset”.
Average rating by gender can be computed using the built-in Pivot Table library module. Since we want the average Ratings, we set the Value property to “rating” and the Aggregation property to “mean”. We Group By the “title”, and split the “gender” Column values into new columns.
Finally, we render the results as a HTML Data Table, which is as follows:
Top 10 Movies that Men rate higher than Women
Now that we have the average ratings by gender, we can do the following:
- Calculate the difference in ratings from men and women for each movie
- Sort the movies in descending order by the difference in ratings
- Take the top 10
The Calculate New Column module adds a new column to the dataset based on an expression we specify. The expression can be any mathematical equation which references existing columns. In this case, we simply subtract the mean ratings to create a new column “difference”
rating-mean_gender_M - rating-mean_gender_F
The module Top N with Feature of “difference” and a Count of 10, will give us the top 10 movies with the most difference in ratings.
And the results are in:
Please check out our screencast of building these workflows in Juxt.io: