UsingCo-clusteringforPredictingMovieRatinginNetﬂix资源-CSDN文库

需积分: 9 81 浏览量 2012-11-26 09:08:34 上传评论收藏 108KB PDF 举报

### 使用共聚类预测Netflix电影评分 #### 引言在大数据时代，预测缺失值的任务在销售预测和推荐系统等领域找到了广泛的应用。特别是在信息过滤领域，目标是根据用户的个人资料和其他用户已知偏好来确定一个项目（如电影或书籍）与特定用户的相关性。这种输入数据可以被看作是一个矩阵，其中每一行代表一个用户的评分，每一列则对应一部电影。任务在于填补该矩阵中的缺失条目。最近关于Bregman共聚类的研究[1][3]提出了一种方法，用于同时划分数据矩阵的行和列，并利用聚类结果预测这些缺失条目。通过结合行和列的聚类信息，共聚类技术能够产生更高质量的聚类，即使只需要单边聚类结果也是如此。此外，共聚类比传统的单边聚类更具可扩展性[1]。在本项目中，我们探讨了将共聚类与其他方法相结合，对Netflix数据集的一个特定子集进行缺失值预测。原始数据集非常庞大，包含超过1亿条来自近50万Netflix用户的评分记录，覆盖约1.7万部电影。每条评分都是从1到5的等级。由于此数据集规模巨大，因此从中抽样了一个子集用于初步研究。基于这个子集数据集，我们试图初步回答两个问题：第一，Bregman共聚类在Netflix数据集上的表现如何？第二，我们是否可以将共聚类作为一种中间步骤，将数据划分为块，然后在这些数据块上应用一些成本较高但更准确的方法，如奇异值分解(SVD)，以获得更高的性能？ #### 预测电影评分的方法 **2.1 缺失值预测** 缺失值预测是推荐系统中的核心问题之一。对于Netflix数据集而言，缺失值主要指的是用户未评价过的电影。缺失值预测可以通过多种方式实现，包括但不限于协同过滤、矩阵填充、以及本文重点介绍的共聚类等方法。 **2.2 协同过滤** 协同过滤是最常用的推荐系统技术之一，它基于“相似用户有相似兴趣”的假设。具体来说，协同过滤可以根据用户的行为和偏好将用户分组，并预测他们可能会喜欢哪些产品。协同过滤可以进一步细分为用户-用户协同过滤和物品-物品协同过滤两种类型。 **2.3 Bregman共聚类** Bregman共聚类是一种同时对数据矩阵的行和列进行聚类的技术，其基本思想是找到一组行和列的簇，使得每个簇内的数据尽可能相似，而不同簇之间的差异尽可能大。这种方法不仅可以揭示用户群体和电影群体之间的内在联系，而且还可以提高预测的准确性。 **2.4 共聚类作为预处理步骤** 将共聚类作为一种预处理步骤，在数据集上划分出不同的用户群和电影群，之后可以应用更复杂、更精确的算法，如SVD等，仅在这些已聚类的数据块上操作。这样做不仅可以显著降低计算成本，而且还可以保持较高的预测精度。这是因为共聚类可以捕捉到用户群体和电影群体之间的偏好关系，从而使得后续的预测更加准确。 ### 结论通过对Netflix数据集的分析，可以看出共聚类技术在预测缺失电影评分方面具有很大的潜力。不仅可以直接应用于缺失值预测，还可以作为一种预处理步骤，为后续更复杂的预测模型提供更有意义的数据分区。未来的研究可以进一步探索如何优化共聚类参数，以适应更广泛的推荐场景，并提高预测性能。

资源推荐

资源详情

资源评论

Using Co-clustering for Predicting Movie Rating in Netﬂix

Tuyen Huynh and Duy Vu

1. INTRODUCTION

Prediction of missing values has recently found many prac-

tical applications in sales forecasting and recommendation

systems. It is a main task in information ﬁltering where

the goal is to identify the relevance of an item such as a

movie or a book to a given user based on the user proﬁle

and/or known preferences of the other users. We can also

view these input data as a matrix where each row represents

for the ratings of one user and one column corresponds to

a movie. The task now is to impute missing entries of the

matrix.

The recent study on Bregman co-clustering [1][3] has pro-

posed a method for simultaneously partitioning rows and

columns of such a data matrix and then using clustering

results to predict those missing entries. By incorporating

both row and column clustering information, a kind of sta-

tistical regularization technique, co-clustering can yield bet-

ter quality clusters even only single-sided clustering results

are needed. In addition, co-clustering is more scalable than

traditional single-sided clustering [1].

In this project, we explore the combination of co-clustering

andothermethodsformissingvaluepredictiononapartic-

ular subset of the original Netﬂix dataset. The original is a

huge dataset which contains over 100 million of ratings from

480 thousands Netﬂix users over 17 thousand movies. Each

rating is on a scale from 1 to 5. Since this dataset is too

large, a subset of it is sampled for the pilot study. In this

project based on the subset dataset we would like to initially

answer two following questions. The ﬁrst one is how well the

Bregman co-clustering performs on the Netﬂix dataset? The

second one is whether we can use the co-clustering as an in-

termediate step for partitioning the data into blocks, then

applying some expensive but more accurate methods such

as using SVD on co-clusters of data to achieve higher perfor-

mance? The intuition of this idea is that the co-clustering

method can capture the favorite relations between groups of

users and groups of movies. Therefore, this approach does

not only make the dataset scalable to SVD, but it may also

help to improve the predicting performance.

2. METHODS FOR PREDICTING MOVIE

RATING

2.1 Missing Value Prediction Using Co-clustering

In [2],[4], it has been shown that the missing value pre-

diction problem can be formulated as a weighted matrix

approximation problem where the weights are 1’s for known

values and 0’s for unknown ones, then we can use co-clustering

for ﬁnding the best matrix approximation based on some cri-

teria. The assumption is that the original matrix has a low

parameter structure involving properties of row and column

clusters. By minimizing a desired loss function, co-clustering

can ﬁnd that low parameter structure and this structure is

used for reconstructing the approximate matrix. Let Z be

the random variable that takes values in the matrix, U and

V be discrete random variables whose values are the row

and column indices respectively, and

U and

V be discrete

random variables which takes values on the row cluster and

column cluster indices. Then [2] shows that, for a given

co-clustering, there are only six distinct sets of summary

statistics or co-clustering bases which can be used for ap-

proximating the matrix

= {{

U }, {

V }} C

= {{

V }}

= {{

V }, {

U }} C

= {{

V }, {

V }}

= {{

V }, {

U }, {

V }} C

= {{

U,V }, {U,

V }}

We also call these co-clustering bases as schemes. More-

over, using the Minimum Bregman Information (MBI), we

can ﬁnd the best approximation matrix

Z for a given co-

clustering, a given co-clustering basis, and a given Bregman

divergence. Table 1 and 2 present the best approximation

solutions of each basis for squared Euclidean distance and

I-divergence, where the expectation in those formulae are

interpreted as follows:

• E[Z]: the average value of the entire matrix

• E[Z|U ]andE[Z|V ]: row averages and column aver-

ages respectively

• E[Z|

U ]andE[Z|

V ]: row cluster averages and column

cluster averages respectively

• E[Z|U,

V ]andE[Z|

U,V ]: row column cluster averages

and column row cluster averages respectively. In other

words, they are the average of each row in a column

cluster and the average of each column in a row cluster.

• E[Z|

V ]: co-cluster averages

2.2 Missing value prediction based on Singu-

lar Value Decomposition (SVD)

SVD is a popular matrix factorization technique which has

been used a lot in data mining, especially in dimensionality

reduction. Given a m × n matrix R, SVD decomposes that

matrix into:

R = U · S · V



Table 1: Best matrix approximation for squared Eu-

clidean distance

Coclustering basis C Approximation matrix

E[Z|

U ]+E[Z|

V ]-E[Z]

E[Z|

V ]

E[Z|U]-E[Z|

U ]+E[Z|

V ]

E[Z|V ]-E[Z|

V ]+E[Z|

V ]

E[Z|U]-E[Z|

U ]+E[Z|V ]-E[Z|

V ]

+ E[Z|

V ]

E[Z|U,

V ]+E[Z|

U,V ]-E[Z|

V ]

Table 2: Best matrix approximation I-divergence

Coclustering basis C Approximation matrix

E[Z|

U]×E[Z|

V ]

E[Z]

E[Z|

V ]

E[Z|U]×E[Z|

V ]

E[Z|

E[Z|V ]×E[Z|

V ]

E[Z|

V ]

E[Z|U]×E[Z|V ]×E[Z|

V ]

E[Z|

U]×E[Z|

V ]

E[Z|U,

V ]×E[Z|

U,V ]

E[Z|

V ]

where U and V are two orthogonal matrices of size m × m

and n × n respectively, and S is a diagonal matrix of size

m × n. All diagonal entries of S are non-negative and called

the singular values of matrix R. The diagonal entries of S

are often sorted in a decreasing order. The most important

property of SVD is that we can take the ﬁrst k largest singu-

lar values and construct an approximated matrix of rank-k

of R as following:

= U

· S

· V



where U

, S

,andV

are reduced matrices of U, S, V re-

spectively. It has been shown that R

is the best rank-k

approximation of the matrix R in case of squared error loss.

In other words, R

is the one that minimizes the Frobenius

norm ||R − R

|| =

[R(i, j) − R

(i, j)]

.Soconsider-

ing the matrix R as our rating matrix, we can use the values

of R

to approximate the values of R. The assumption of

this approach is that there are latent relationships between

users and movies which aﬀect the rating of a user for a given

movie. Speciﬁcally, we assume there are a set of k factors

which decide how a user rates a movie, and these factors

can be captured by the rank-k SVD. In [6],[7], the authors

has shown that the performance of this approach is com-

parable to other collaborative ﬁltering methods. However,

one problem with this approach is that it is computationally

expensive and is not scalable to large dataset. So instead of

performing this approach on the original rating matrix, we

ﬁrst run the co-clustering algorithm to partition the original

matrix into blocks and apply this approach on each block.

Another problem with this approach is that the original rat-

ing matrix R has a lot of missing entries which make it diﬃ-

cult to compute the rank-k SVD of R.Onewaytosolvethis

problem is to ﬁll in those missing values with some reason-

able values. In [6],[7], the authors use the average ratings

of movies to ﬁll in those missing values. We have tried that

approach but the performance is very bad. So we propose

to use the predicting values based on co-clustering as the

guessing values for those missing values in the original rat-

ing matrix R. Then we compute the rank-k approximation

of R based on SVD, and use it for predicting the ﬁnal

values of those missing entries in R. In summary, the fol-

lowing steps are performed on each cocluster of the original

rating matrix:

• Get the matrix block matrix R

corresponding to the

cocluster (i, j)

• Select a scheme, and ﬁll in the missing values of R

based on that scheme

• Compute the rank-k SVD of R

to obtain U

, S

,and

• Compute the matrix U

1/2

and S

1/2



• Compute the predicted rating for each entry in the

test set by calculating the dot product between the

appropriate row of U

1/2

and column of S

1/2



3. DATA ANALYSIS

The ﬁrst important step of a data mining process is to

analyze the dataset and discover statistical characteristics

of the speciﬁc dataset. The analysis could help us to se-

lect the appropriate method which could both reduce the

computational cost and achieve an acceptable performance.

This section presents some characteristics of the subset Net-

ﬂix dataset and discusses some potential approaches based

on this analysis.

The ﬁrst two charts, Figure 1 and 2, are the histogram of the

average ratings by movies and the histogram of the average

ratings by users, respectively. Both charts show that the

ratings are biased toward high values. The average values

for both of the graphs are around 3.6 and most of ratings

are from 2 to 5.

1 1.5 2 2.5 3 3.5 4 4.5 5

500

1000

1500

2000

2500

3000

3500

Average Movie Rating

Users

Average Rating Distribution By Movie

Figure 1: The Average Rating Distribution By

Movie

Figure 3 is a curve that is computed by dividing movies

into bins based on their average ratings. For each bin, we

compute the average number of ratings for each movie. In

the other words, the chart presents the number of ratings

for each movie given the average rating of that movie. From

this graph, we can induce that the highly-rated movies are

剩余6页未读，继续阅读

评论收藏

内容反馈

yl_l101

粉丝: 1
资源: 1

Using Co-clustering for Predicting Movie Rating in Netﬂix

DisCo：Distributed Co-clustering with MapReduce

DBSCAN聚类(密度聚类算法)-基于密度的聚类算法-聚类可视化-MATLAB代码

论文研究-Bi-clustering for error-bounded linear patterns in gene expression data.pdf

源码文档KMEANS-聚类算法实现程序源码文档KMEANS-聚类算法实现程序

人工智能-项目实践-聚类-通过聚类分析交易流水检测异常交易.zip

案例数据集《多元统计分析-聚类分析-K-均值聚类应用场景-电信用户》

K均值聚类(K-Means聚类)-聚类算法-聚类可视化-MATLAB代码

人工智能-项目实践-聚类-针对中文的话题（主题）聚类，采用single pass聚类算法.zip

k-means聚类算法k-means聚类算法k-means聚类算法k-means聚类算法.txt

人工智能-项目实践-聚类-Chinese-whisper 聚类算法（由于涉及公司代码保护，只显示文档）.zip

案例数据集《多元统计分析-聚类分析-K-均值聚类（K-中值、K-众数）-陶器化学成分》

人工智能-项目实践-聚类-使用numpy实现的聚类算法（包括时空聚类算法）.zip

UCI常用数据集-聚类、分类.zip

Asymmetric Co-Teaching for Unsupervised Cross-Domain

Multiple-Co-clustering-master_co-clustering_群集_

co-attachment聚类分析算法(源码以及可执行文件)

SPSS教程-聚类分析-附实例操作

机器学习-数据预处理-聚类-回归-单车数据集

k-聚类（Matlab实现）

Iris-K-Means-Clustering-master_iris-kmeans_iris_iris聚类算法_聚类_数据开发

R-聚类分析111111111111

An novel Spectral Clustering Algorithm Using Low-rank Approximation

Deep-Neural-Network-for-Clustering.zip_Deep Clustering_Kmeans_cl

Text Documents Clustering using K-Means Algorithm

最新资源