Recommendation Systems with Matrix Factorization

September 30, 2017

Recommendation systems are an important application in Machine Learning used by an increasing number of websites given its immediate impact on people’s choices. For example, Amazon recommends its customers products they should buy, Netflix recommends its subscribers movies to watch.

The Netflix Prize Competition was one of the first and largest machine learning contests, with a grand prize of $1 million. The problem was that of predicting what rating a user would give to a movie that they have not watched yet. Recommendation systems via matrix factorization was one of the premier algorithms that came out of the extensive research that followed.

Problem Setting

There are n users and m movies. Some users have rated some movies, and we have a number of triplets (i, j, y), where y is the rating given to movie j by user i.

For the matrix factorization approach to this problem, we will put the above information in a rating matrix Y. Y(i, j) is the rating user i gave to movie j. However, Y is incomplete since every user has not rated every movie. Often, Y is extremely sparse, with 1 in 10,000 entries filled in.

The objective is to estimate the missing values in the matrix by finding a model f(i, j) = y based on the available ratings, and then to recommend to each user new movies which are expected to be most liked by the user, i.e. movies that the user is predicted to give high ratings to.

Matrix Factorization

In matrix factorization, we model the situation as follows,

$$ Y = U \times X^T $$

where U is a matrix of dimension n x k, and X is a matrix of dimension m x k. The i_th row of matrix U is a k-dimensional vector attempting to model the preferences of user i, and the j_th row of matrix X is a k-dimensional vector attempting to model the properties of movie j. We seek therefore concise vector representations of users and movies.

Example for e-commerce websites such as Amazon: rating matrix R factorized as a product of user matrix U and product matrix P, each latent vector has dimension k

k is a hyper-parameter and controls how many parameters the model has. It is important to understand that, although attributes U and X should capture the most important aspects of users and movies, they are usually hard to visualize or interpret.

Vector values will be optimized by minimizing the discrepancy between predicted and actual ratings in the training set. The optimization problem is:

$$ L = \min_{U, X}\ \sum_{i,j}(y_{i,j} - U_i X_j^T)^2 + \lambda (\sum_i ||U_i||^2 + \sum_j ||X_j||^2) $$

The first term measures the squared error between the actual and predicted ratings, and the second term is a regularization term.

Once we use an optimization algorithm to find the values of U and X, the prediction Y(i, j) is simply U_iX_j^T.

Optimization algorithm

Approach 1: Alternating Least Squares

Model parameters can be optimized by the algorithm of Alternating Least Squares (ALS). It first fixes U and optimizes X, then fixes X and optimizes U, and repeats until convergence.

The following equations are the analytical solutions to the least squares problem:

$$ \begin{aligned} U_i &= (\sum_{j} X_j X_j^T + \lambda I_k)^{-1} \sum_{j} y_{i,j} X_j \quad (1) \\ X_j &= (\sum_{i} U_i U_i^T + \lambda I_k)^{-1} \sum_{i} y_{i,j} U_i \quad (2) \end{aligned} $$

Approach 2: Stochastic Gradient Descent

SGD iterates through all training samples y in Y, with the idea to converge more rapidly than gradient descent (which uses all samples in each iteration). Given a step-size alpha, it updates both users and movies matrices until convergence as follows.

For each observation y:

$$ \begin{aligned} U_i \leftarrow U_i - \alpha \dfrac{\partial L(i,j)}{\partial U_i} \\ X_j \leftarrow X_j - \alpha \dfrac{\partial L(i,j)}{\partial X_j} \end{aligned} $$

Comparison to other methods

Content-based recommendation systems rely solely on information about the user such as gender, age, location and information about movies such as genre, cast, etc. It doesn’t benefit from users’ history ratings.

Collaborative filtering by nearest neighbors needs a fixed definition of similarity between users based on their ratings and does not take into account movies features.

In collaborative filtering by matrix factorization method, the algorithm makes use of ratings submitted by users, and automatically learns latent representations for users as well as movies. It is possible to ask questions such as which other users are most similar to this user by defining a distance metric d(U_a, U_b) between user vectors, and similarly for movies. A commonly used distance metric is cosine distance.

Comparison to Singular Value Decomposition (SVD)

Singular Value Decomposition (SVD) is a method for finding the low-rank approximation of a matrix, which is what matrix factorization is also doing. However, SVD doesn’t work on incomplete matrices. Decomposing a matrix with missing values is equivalent to setting the missing values to 0, which will lead to undesirable results.