Analyzing Movielens Data Part 1

This might be familiar – a perennial question that keeps coming up in our home – What movie to watch tonight?

There’s a ton of movie ratings from actual users in the Movielens dataset. Wouldn’t it be great to use all this data to help us pick the right movie everytime?
We’ll use a Movielens dataset that contains 1,000,209 anonymous ratings of 3,900 movies made by 6,040 MovieLens users. – data explained here

From this, let’s say we want to ask the following –

  1. What is the average rating for each movie broken down by gender
  2. List only movies that received at least 100 ratings
  3. Of those, list only the good ones – movies that got ratings of 4.3 or higher
  4. List the top 10 movies that, on average, men rate higher than women
  5. What genres of movies do programmers like?

Organization of Data

The data is distributed across three disparate data stores.

Movie data is in a Comma Separated Value (CSV) file in Amazon S3. This contains MovieID, Title and Genres

Users data is in a CSV file in Dropbox data store. This contains UserID, Gender, Age, Occupation and Zip-code

Ratings data is in a Relational Database table in PostgreSQL that contains UserID, MovieID, Rating and a Timestamp

Data Integration

Since the data is spread across multiple silos (Amazon S3, Dropbox) and multiple formats (CSV, PostgreSQL), we need to combine the relevant data into a form that is easier to work with.

In the figure below, functional modules are wired together to create the data integration. At a high level:

  • We load the data from the different sources
  • Combine them (Join) based on a common feature or column to create a virtual data source (user-id for ratings & users, movie-id for movies)
  • The combined data is stored in a user DB (in-memory cache) with the ID “movielens-dataset”. This can be fetched in subsequent modules for further analytics
Data Import Flow
Data Import Flow

Please check out the video version of the data integration

In the next post, we will get into building the flows needed to answer the questions we started with.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s