![Notes for Week 9 Recommender Systems](/assets/img/machine_learning/recommender_systems.jpg)
Notes for Week 9 Recommender Systems
这篇文章是我对Coursera上的课程Machine Learning 的总结,第九周后半部分有关Recommender Systems的内容。该课程由著名的机器学习专家Andrew Ng授课,是非常适合入门机器学习的课程。
Notes for Week 9 Recommender Systems
@(读书笔记)[Machine Learning]
Problem Formulation
有关电影评分的数据集:
Movie | Alice (1) | Bob (2) | Carol (3) | Dave (4) |
---|---|---|---|---|
Love at last | 5 | 5 | 0 | 0 |
Romance forever | 5 | ? | ? | 0 |
Cute puppies of love | ? | 4 | 0 | ? |
Nonstop car chases | 0 | 0 | 5 | 4 |
Swords vs. karate | 0 | 0 | 5 | ? |
- $ n_u $ = no. users
- $ n_m $ = no. movies
- $ r(i,j) $ = 1 if user $ j $ has rated movie $ i $
- $ y^{(i,j)} $ = rating given by user $ j $ to movie $ i $ (defined only if $ r(i,j) $ = 1)
Content-based recommendations
加入特征$ x_1 $ (电影是romance类型的程度)和$ x_2 $ (电影是action类型的程度)以后
Movie | Alice (1) | Bob (2) | Carol (3) | Dave (4) | $ x_1$ (romance) |
$ x_2 $(action) |
---|---|---|---|---|---|---|
Love at last | 5 | 5 | 0 | 0 | 0.9 | 0 |
Romance forever | 5 | ? | ? | 0 | 1.0 | 0.01 |
Cute puppies of love | ? | 4 | 0 | ? | 0.99 | 0 |
Nonstop car chases | 0 | 0 | 5 | 4 | 0.1 | 1.0 |
Swords vs. karate | 0 | 0 | 5 | ? | 0 | 0.9 |
基于上表,我们针对每一个用户$ j $,训练出一个参数$ \theta^{(j)} \in \mathbb{R}^3 $。最后,预测用户$ j $对电影$ i $的评分为$ (\theta^{(j)})^Tx^{(i)} $(对每一个用户,都训练一个linear regression hypothesis)
- $ \theta^{(j)} $ = parameter vector for user $ j $
- $ x^{(i)} $ = feature vector for movie $ i $
- $ j:r(i,j) = 1 $ = for all of the user $ j $ who has scored movie $ i $
- For user $ j $, movie $ i $, predict rating: $ (\theta^{(j)})^Tx^{(i)} $
优化目标: Given $ x^{(i)} $, to learn $ \theta^{(j)} $ (parameter for user $ j $):
[\min_{\theta^{(j)}}J(\theta^{(j)}) = \min_{\theta^{(j)}}\frac{1}{2}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2}\sum_{k=1}^n(\theta_k^{(j)})^2]
Given $ x^{(i)} $, to learn $ \theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)} $:
[\min_{\theta^{(1)},\dots,\theta^{(n_u)}}J(\theta^{(1)},\dots,\theta^{(n_u)}) = \min_{\theta^{(1)},\dots,\theta^{(n_u)}}\frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2]
思考:gradient descent怎么写?
[\theta_k^{(j)} := \theta_k^{(j)} - \alpha(\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})x_k^{(i)} + \lambda \theta_k^{(j)})]
Collaborative Filtering (Not Content-based)
与content-based (given $ x^{(i)} $, to learn $ \theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)} $)正相反,collaborative filtering是given $ \theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)} $, to learn $ x^{(i)} $:
[\min_{x^{(i)}}J(x^{(i)}) = \min_{x^{(i)}}\frac{1}{2}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2}\sum_{k=1}^n(x^{(i)}_k)^2]
given $ \theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)} $, to learn $ x^{(1)},\dots,x^{(n_m)} $:
[\min_{x^{(i)}}J(x^{(1)},\dots,x^{(n_m)}) = \min_{x^{(i)}}\frac{1}{2}\sum_{i=1}^{n_m}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x^{(i)}_k)^2]
思考:gradient descent怎么写?
[x_k^{(i)} := x_k^{(i)} - \alpha(\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})\theta_k^{(j)} + \lambda x_k^{(i)})]
通常的做法是:
- 猜测一组$ \theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)} $值
- 使用猜测的$ \theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)} $值训练出特征$ x^{(1)},\dots,x^{(n_m)} $
- 使用训练出的特征$ x^{(1)},\dots,x^{(n_m)} $优化$ \theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)} $值
- 使用优化的$ \theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)} $训练特征$ x^{(1)},\dots,x^{(n_m)} $
- 重复3、4的过程直到最终的hypothesis达到非常好的性能
st=>start: Start
e=>end
op1=>operation: Guess theta
op2=>operation: Use theta to learn feature x
op3=>operation: Use feature x to learn theta
cond=>condition: Is hypothesis optimized?
st->op1->op2->op3->cond
cond(yes)->e
cond(no)->op2
Collaborative Filtering Algorithm
我们可以将上述流程图所示的学习过程合并成一个公式同时训练$ \theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)} $值和特征$ x^{(1)},\dots,x^{(n_m)} $:
[J(x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)}) =]
[\frac{1}{2}\sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x^{(i)}k)^2 + \frac{\lambda}{2}\sum{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2]
- Initialize $ x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)} $ to small random values.
- Minimize $ J(x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)}) = $ using gradient descent (or an advanced optimization algorithm). E.g. for every $ j = 1,\dots,n_u,i=1,\dots,n_m $:
- $ x_k^{(i)} := x_k^{(i)} - \alpha(\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})\theta_k^{(j)} + \lambda x_k^{(i)}) $
- $ \theta_k^{(j)} := \theta_k^{(j)} - \alpha(\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})x_k^{(i)} + \lambda \theta_k^{(j)}) $
- For a user with parameters $ \theta $ and a movie with (learned) features $ x $, predict a star rating or $ \theta^Tx $.
Collaborative Filtering Algorithm (Vectorized Implementation)
Movie | Alice (1) | Bob (2) | Carol (3) | Dave (4) |
---|---|---|---|---|
Love at last | 5 | 5 | 0 | 0 |
Romance forever | 5 | ? | ? | 0 |
Cute puppies of love | ? | 4 | 0 | ? |
Nonstop car chases | 0 | 0 | 5 | 4 |
Swords vs. karate | 0 | 0 | 5 | ? |
[Y = \begin{bmatrix}
5 & 5 & 0 & 0
5 & ? & ? & 0
? & 4 & 0 & ?
0 & 0 & 5 & 4
0 & 0 & 5 & 0
\end{bmatrix} \in \mathbb{R}^{5\times 4}]
Low rank matrix factorization
[\begin{bmatrix}
(\theta^{(1)})^T(x^{(1)}) & (\theta^{(2)})^T(x^{(1)}) & \dots & (\theta^{(n_u)})^T(x^{(1)})
(\theta^{(1)})^T(x^{(2)}) & (\theta^{(2)})^T(x^{(2)}) & \dots & (\theta^{(n_u)})^T(x^{(2)})
\vdots & \vdots & \vdots & \vdots
(\theta^{(1)})^T(x^{(n_m)}) & (\theta^{(1)})^T(x^{(n_m)}) & \dots & (\theta^{(n_u)})^T(x^{(n_m)})
\end{bmatrix} = X\times \Theta^T]
[X = \begin{bmatrix}
-(x^{(1)})^T-
-(x^{(2)})^T-
\vdots
-(x^{(n_m)})^T-
\end{bmatrix}]
[\Theta = \begin{bmatrix}
-(\theta^{(1)})^T-
-(\theta^{(2)})^T-
\vdots
-(\theta^{(n_u)})^T-
\end{bmatrix}]
Finding related movies
只要找出特征最相似的就行:(例如电影$ i $与电影$ j $最相似的特征)
[\min | x^{(i)} - x^{(j)} | ] |
Mean Normalization
Problem: users who have not rated any movies
Movie | Alice (1) | Bob (2) | Carol (3) | Dave (4) | Eve (5) |
---|---|---|---|---|---|
Love at last | 5 | 5 | 0 | 0 | ? |
Romance forever | 5 | ? | ? | 0 | ? |
Cute puppies of love | ? | 4 | 0 | ? | ? |
Nonstop car chases | 0 | 0 | 5 | 4 | ? |
Swords vs. karate | 0 | 0 | 5 | ? | ? |
如果用户Eve没有对任何电影进行评价,那么$ \frac{1}{2}\sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 $由于没有$ r(i,j)=1 $的组合$ (i,j) $,不会对$ \theta $值的确定起到任何作用。而$ \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2 $倾向于将$ \theta $减到最小,所以对用户Eve的评分预测,我们只能得出
[\theta^{(5)} = \begin{bmatrix}
0
0 \end{bmatrix}]
详细情况可以分析以下公式:
[\min_{x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)}}J(x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)}) =]
[\min_{x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)}}\frac{1}{2}\sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x^{(i)}k)^2 + \frac{\lambda}{2}\sum{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2]
Mean normalization
[Y = \begin{bmatrix}
5 & 5 & 0 & 0 & ?
5 & ? & ? & 0 & ?
? & 4 & 0 & ? & ?
0 & 0 & 5 & 4 & ?
0 & 0 & 5 & 0 & ?
\end{bmatrix}]
对所有的电影评分做一个平均值,并得到以下评分向量$ \mu $:
[\mu = \begin{bmatrix}
2.5
2.5
2
2.25
1.25
\end{bmatrix}]
然后将原始数据集$ Y $与$ \mu $相减:
[Y_{new} := Y - \mu = \begin{bmatrix}
2.5 & 2.5 & -2.5 & -2.5 & ?
2.5 & ? & ? & -2.5 & ?
? & 2 & -2 & ? & ?
-2.25 & -2.25 & 2.75 & 1.75 & ?
-1.25 & -1.25 & 3.75 & -1.25 & ?
\end{bmatrix}]
使用新计算出来的数据集$ Y_{new} $训练新的参数$ \theta_{new}^{(j)} $和特征$ x_{new}^{(i)} $,预测用户$ j $对电影$ i $的预测变为:
[(\theta_{new}^{(j)})^T(x_{new}^{(i)}) + \mu^i]