Building a Book Recommendation System using Matrix Factorization and Single Value Decomposition


Every monday, I hit the “Discover Weekly” list to see what Spotify’ came up with in personalized offerings, specifically tailored to the music I fed it the week before. Sometimes I’d come across a hidden gem (with a few music skips), and sometimes I’d be totally disappointed. But the truth is, I am usually at fault too, since I listen to completely different music styles depending on mood, so what happens is that the recommendation engine does not differentiate between moods and I end up having to do some manual work to reach the desired state of mind at the given timee.

Yet, moods set aside (product feature I’d highly recommend the Spotify team to consider/test), I always wondered how Spotify tended to figure out these titles even when there are no ratings in its system except the “save as favorite” button, which rather sends a classification signal than a magnitude one… Until I recently realized that they used a combination of different architectures for their recommendation engine:

1. Memory-based collaborative filtering Recommender: which focus on the relationship of users and items in question (ideal when the data contains ratings for the various items offered). Matrix Factorization here is a powerful mathematical tool to discover the latent interactions between users and items. Let’s say for example person A and B listen to Song X, and person B listens often to song Y, then A is very likely to like song Y as well.

2. Content-based Recommender: which focus on features of the items themselves. So instead of analyzing the proactive interactions of users/customers with the items, the analysis is mostly made at the level of the latter, examining and measuring therefore the similarity in the items’ characteristics. To stay in the music context, let’s say for example that you listen very often to song X and Y, and both these songs happen to be from an Italian Musician who uses distinct piano tunes and happens also to pertain to a Music genre and era specified in the song tag. 
This recommender method will use different machine learning techniques (e.g. Natural Language Processing, Audio Modeling etc.) to determine a song Z that has similar properties.


I wanted to get my hands dirty with the first type of Recommender as it seemed simpler than the second. Most importantly, I wanted to understand and provide a simple intuition for the math behind the algorithm, and perhaps also to lay a foundation for how recommendation systems work in practice before moving to more complicated models.

The 10k Books Dataset 
In this tutorial I picked the Goodbooks-10k dataset I found on Kaggle to get started. I had always feared being disappointed by a book after finishing a fascinating one, so I thought this would solve a personal struggle, and could be in general just a fun thing to run through friends who ask me for advice on what to read next. 
The zip file contains multiple datasets (book_tags, books, ratings, tags). We’ll use only the books and ratings datasets which contain columns that are relevant to our analysis.

First things first, let’s import the necessary libraries.

import pandas as pd
import numpy as np
import sklearn
from sklearn.decomposition import TruncatedSVD
import warnings

Let’s upload the datasets. The “books” dataset contains 23 columns. We’ll slice the data and drop variables to only keep columns of interest. We’ll keep the ratings dataset as is. 
Next, we’ll merge both datasets on “book_id”. Book_id is more reliable than original_title since some titles can have some variations in their formatting. Before we advance to creating a matrix, we’ll drop duplicates in pairwise combinations of user_id and book_id, as well as user_id and original_title.

books = pd.read_csv('books.csv', sep=',')
books = books.iloc[:, :16]
books = books.drop(columns=['title', 'best_book_id', 'work_id', 'books_count', 'isbn', 'isbn13', 'original_publication_year','language_code','work_ratings_count','work_text_reviews_count'])
books.head(5)
ratings = pd.read_csv('ratings.csv', sep=',')
ratings.head(5)
df = pd.merge(ratings, books, on="book_id")
df.head(5)
df1= df.drop_duplicates(['user_id','original_title'])
df1= df.drop_duplicates(['user_id','book_id'])
df1.head(10) #went down from 79701 to 79531
df1.shape #(79531, 8)

The Matrix Factorization Method & SVD — Intuition

We’re now about to create a matrix model using the Matrix Factorization method with the Single Value Decomposition model (SVD). You can find a number of excellent technical resources online to describe the models in deeper ways, but I’m going to break it down to you in simple terms here.

What we’ll do next is call the pivot function to create a pivot table where users take on different rows, books different columns, and respective ratings values within that table with a shape (m*n).

######################################
####MATRIX FACTORIZATION
######################################
books_matrix = df1.pivot_table(index = 'user_id', columns = 'original_title', values = 'rating').fillna(0)
books_matrix.shape #(28554, 794)
books_matrix.head()

If you look at the graphical representation below, you’ll have an idea what happened behind the scenes. Firstly, we created an A matrix with shape (m*d)=(books*user_id) and B matrix with shape (d*n)=(ratings*user_id).

Matrix graphical representation by Albertauyeung

The result is a product (matrix factorization) between the two matrices that mathematically computes values as follows:

We’ll need to create a set of training data — What our training data consists of is essentially smaller matrices that are factors of the ratings we want to predict. For this we’ll set another matrix X which is the transposition of the resulting matrix created above (“books_matrix”), also called invertible matrix.

X = books_matrix.values.T
X.shape
#Fitting the Model
SVD = TruncatedSVD(n_components=12, random_state=0)
matrix = SVD.fit_transform(X)
matrix.shape #(812, 12)

You’ll notice here that our newly created matrix is extremely sparse. Depending on which random state you specified, the number of columns are in the 5 digits, which means a 5 digit dimensional space. That’s where the SVD method intervenes.

Just like we would use a PCA/Kernel PCA feature extraction method on other datasets, SVD is another method we apply to matrices in recommendation applications. SVD can boil our dimensions down to smaller number to describe the variance in the data. What happens here is that SVD will look for latent features and extract them from the data, to go down from say 10.000 features to only 10, and will save us huge amounts of computational power, in addition to avoiding to overfit the data. For this exercise, I had set the number of components to our SVD to 12. What’s left now in order to apply the model is fitting and transforming the training data X.

Note: In general, after applying SVD, a common practice is to introduce a regularization term (the term on the right below) to avoid overfitting to the data:

The minimization term on the left is a minimization of errors on the ratings we have no information about (e.g. unknown current or future ratings). We can come up with those values with an objective function above mathematically, and in practice with methods such as Stochastic Gradient Descent (SGD). 
For the purpose of simplicity in this tutorial, I have disregarded introducing both terms in practice, and just filled the missing information with a null value (.fillna(0)).

Creating a Correlation Coefficient
Next we create a correlation coefficient function for all elements in the matrix with the numpy function np_corrcoef. We’ll call this “corr”.
Once we apply “corr” to a book wereally like, the function will compute all correlation coefficients with the remaining books and will return all the books that we are most likely to like as well.

import warnings
warnings.filterwarnings("ignore",category =RuntimeWarning)#to avoid RuntimeWarning #Base class for warnings about dubious runtime behavior.
corr = np.corrcoef(matrix)
corr.shape

Correlation coefficients range from 0 to 1, with 0 meaning no correlation exists between the two items, to 1 being the opposite. In our example, the closer we get to 1, the more likely the other books suggested have features that are highly correlated with the book you entered as input, and therefore you are more likely to like those books as a result.

Checking the Results
Let’s now check the results. I’ll create a vector called “titles” and list the items. I’ll pick a book I like as an index. “Memoirs of a Geisha” is one of my favorite fiction books, so let’s go with this.

title = books_matrix.columns
title_list = list(title)
samia = title_list.index('Memoirs of a Geisha')
corr_samia = corr[samia]
list(title[(corr_samia >= 0.9)])

After I run the entire code, here is a list of books the algorithm suggested:

Suggested Book list for Samia

This looks good! Already read a good amount out of the above and can testify some of their were on my top list at some point(e.g. Persepolis, Into the Wild, The Great Gatsby, 100 years of solitude, the Heart of the Matter etc.). I’m still curious though which latent features the algorithm used to pick “Think & Grow Rich” here as well, as I would have classified it in another category (nonfiction + other considerations), but then again this unveils a limitation in this algorithm which might be relevant to the weights of the independent variables we fed it.

Evaluating Results 
I evaluated this model just by having a look at the list of books the algorithms spit out, and since I had already read (and really liked) some of the titles recommended, and run the model through other people for fun to see if they would agree for the most part- I thought I’d call it a day. Of course, this is not the right way to do it if you have to provide sharp performance metrics.

For more accuracy, there are many ways to evaluate a recommender system, and the method will differ across types of recommenders (for example, content-based vs collaborative filtering). One way to do so is to apply a cross validation model — dividing your users into k-fold groups and going in a loop: taking (k-1) fold as training set, and testing on the remaining fold, the averaging the results of all. Perhaps this will be the topic of a later article:)

Disclaimer & Departing thoughts:) 
Finally, this was my first hands-on data science/machine learning post, so I hope this was a useful tutorial with intuitive explanations of the mathematics behind the models. 
Have fun building your own recommendation engine and don’t forget to subscribe to my channel or ask questions below for clarification! 🙂

Follow me on Linkedin
Connect on Twitter

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s