使用并行矩阵分解的推荐系统_C++_R

共38个文件

rd：8个

cpp：6个

r：5个

版权申诉

143 浏览量 2023-04-13 23:41:29 上传评论收藏 138KB ZIP 举报

推荐系统是现代大数据应用中的重要组成部分，特别是在电子商务、社交媒体和娱乐产业中，它们能够根据用户的历史行为和偏好预测用户可能感兴趣的内容。本项目名为“使用并行矩阵分解的推荐系统”，涉及C++和R两种编程语言，这表明我们将探讨如何在多核处理器环境下利用矩阵分解技术来构建高效的推荐引擎。矩阵分解是推荐系统中的核心算法之一，特别是奇异值分解（Singular Value Decomposition, SVD）和基于交替最小二乘法（Alternating Least Squares, ALS）的方法。这两种方法通过将用户-物品交互矩阵分解为几个低秩矩阵的乘积，来捕获隐藏的用户兴趣和物品属性，从而实现个性化推荐。 1. **奇异值分解（SVD）**：SVD将原始的高维稀疏矩阵分解为三个矩阵的乘积：U × Σ × V^T，其中U和V^T分别表示用户和物品的隐向量，Σ是包含奇异值的对角矩阵。通过找到这些向量，可以计算用户对未评价物品的潜在评分，进而进行推荐。 2. **交替最小二乘法（ALS）**：ALS是在线性代数优化问题中广泛使用的一种方法，尤其适用于大规模数据集。在推荐系统中，ALS通过交替更新用户和物品的因子矩阵来最小化预测评分与实际评分之间的误差，实现高效训练。 3. **并行处理**：由于矩阵分解通常涉及大量计算，尤其是在大型数据集上，因此并行化处理至关重要。使用C++和R的并行库，如OpenMP（C++）或snow、foreach（R），可以在多核处理器上分配任务，提高计算速度，降低计算时间。 4. **C++**：C++是一种高效、底层的编程语言，适合处理性能敏感的应用，如并行计算。其标准模板库（STL）和C++11及更高版本引入的并行算法提供了强大的工具来实现并行化。 5. **R语言**：R是统计分析和数据科学的首选语言，虽然它的执行速度较慢，但通过接口调用C++代码（如使用Rcpp库）可以提高性能。在推荐系统中，R可以用于数据预处理、模型评估和结果可视化。 6. **项目结构**：“recosystem-master”这个文件名暗示了项目是使用Git版本控制的，并且可能是开源项目。项目目录可能包括源代码、数据集、配置文件、测试脚本和文档等。 7. **应用场景**：这种推荐系统可能应用于电影推荐、商品推荐、新闻推送等领域，通过对用户行为的深度学习，提供个性化的服务。 8. **优化与扩展**：为了进一步提升推荐系统的性能，可以考虑引入其他机器学习技术，如深度学习的神经网络模型，或者结合协同过滤和内容过滤的混合推荐方法。这个项目涉及的关键技术包括并行矩阵分解、C++编程、R数据分析、以及推荐系统的设计和优化。对于希望深入理解推荐系统和并行计算的人来说，这是一个极好的实践项目。

资源推荐

资源详情

资源评论

收起资源包目录

使用并行矩阵分解的推荐系统_C++_R_下载.zip （38个子文件）

recosystem-master

recosystem.Rproj 376B

man

output.Rd 3KB

predict.Rd 3KB

tune.Rd 5KB

Reco.Rd 1KB

train.Rd 5KB

data_source.Rd 3KB

output_format.Rd 1KB

src

register_routines.c 438B

Makevars 484B

reco-predict.cpp 6KB

reco-read-data.cpp 2KB

reco-read-data.h 3KB

reco-tune.cpp 4KB

mf.h 3KB

register_routines.h 413B

reco-train.cpp 5KB

reco-output.cpp 5KB

mf.cpp 115KB

reco-utils.h 3KB

Makevars.win 544B

LICENSE 76B

not_in_package

simulate.R 795B

.Rbuildignore 28B

DataSource.R 5KB

RecoSys.R 29KB

Output.R 1KB

RecoModel.R 969B

NAMESPACE 231B

.gitignore 56B

DESCRIPTION 1KB

vignettes

introduction.Rmd 10KB

README.md 11KB

inst

COPYRIGHTS 178B

AUTHORS 397B

NEWS.Rd 5KB

dat

smalltest.txt 76KB

smalltrain.txt 96KB

### IMPORTANT NOTES > The API of this package has changed since version 0.4, due > to the API change of LIBMF 2.01 and some other design improvement. - The `cost` option in `$train()` and `$tune()` has been expanded to and replaced by `costp_l1`, `costp_l2`, `costq_l1`, and `costq_l2`, to allow for more flexibility of the model. - A new `loss` parameter in `$train()` and `$tune()` to specify loss function. - Data input and output are now managed in a unified way via functions `data_file()`, `data_memory()`, `out_file()`, `out_memory()`, and `out_nothing()`. See section **Data Input and Output** below. - As a result, a number of arguments in functions `$tune()`, `$train()`, `$output()`, and `$predict()` now should be objects returned by these input/output functions. ## Recommender System with the recosystem Package ### About This Package `recosystem` is an R wrapper of the `LIBMF` library developed by Yu-Chin Juan, Wei-Sheng Chin, Yong Zhuang, Bo-Wen Yuan, Meng-Yuan Yang, and Chih-Jen Lin (https://www.csie.ntu.edu.tw/~cjlin/libmf/), an open source library for recommender system using parallel matrix factorization. ### Highlights of LIBMF and recosystem `LIBMF` is a high-performance C++ library for large scale matrix factorization. `LIBMF` itself is a parallelized library, meaning that users can take advantage of multicore CPUs to speed up the computation. It also utilizes some advanced CPU features to further improve the performance. `recosystem` is a wrapper of `LIBMF`, hence it inherits most of the features of `LIBMF`, and additionally provides a number of user-friendly R functions to simplify data processing and model building. Also, unlike most other R packages for statistical modeling that store the whole dataset and model object in memory, `LIBMF` (and hence `recosystem`) can significantly reduce memory use, for instance the constructed model that contains information for prediction can be stored in the hard disk, and output result can also be directly written into a file rather than be kept in memory. ### A Quick View of Recommender System The main task of recommender system is to predict unknown entries in the rating matrix based on observed values, as is shown in the table below: | | item_1 | item_2 | item_3 | ... | item_n | |--------|--------|--------|--------|-----|--------| | user_1 | 2 | 3 | ?? | ... | 5 | | user_2 | ?? | 4 | 3 | ... | ?? | | user_3 | 3 | 2 | ?? | ... | 3 | | ... | ... | ... | ... | ... | | | user_m | 1 | ?? | 5 | ... | 4 | Each cell with number in it is the rating given by some user on a specific item, while those marked with question marks are unknown ratings that need to be predicted. In some other literatures, this problem may be named collaborative filtering, matrix completion, matrix recovery, etc. In `recosystem`, we provide convenient functions for model training, parameter tuning, model exporting, and model prediction. ### Data Input and Output Each step in the recommender system involves data input and output, as the table below shows: | Step | Input | Output | |------------------|-------------------|----------------------------------| | Model training | Training data set | -- | | Parameter tuning | Training data set | -- | | Exporting model | -- | User matrix `P`, item matrix `Q` | | Prediction | Testing data set | Predicted values | Data may have different formats and types of storage, for example the input data set may be saved in a file or stored as R objects, and users may want the output results to be directly written into file or to be returned as R objects for further processing. In `recosystem`, we use two classes, `DataSource` and `Output`, to handle data input and output in a unified way. An object of class `DataSource` specifies the source of a data set (either training or testing), which can be created by the following two functions: - `data_file()`: Specifies a data set from a file in the hard disk - `data_memory()`: Specifies a data set from R objects - `data_matrix()`: Specifies a data set from a sparse matrix And an object of class `Output` describes how the result should be output, typically returned by the functions below: - `out_file()`: Result should be saved to a file - `out_memory()`: Result should be returned as R objects - `out_nothing()`: Nothing should be output More data source formats and output options may be supported in the future along with the development of this package. ### Data Format The data file for training set needs to be arranged in sparse matrix triplet form, i.e., each line in the file contains three numbers ``` user_index item_index rating ``` User index and item index may start with either 0 or 1, and this can be specified by the `index1` parameter in `data_file()` and `data_memory()`. For example, with `index1 = FALSE`, the training data file for the rating matrix in the beginning of this article may look like ``` 0 0 2 0 1 3 1 1 4 1 2 3 2 0 3 2 1 2 ... ``` From version 0.4 `recosystem` supports two special types of matrix factorization: the binary matrix factorization (BMF), and the one-class matrix factorization (OCMF). BMF requires ratings to take value from `{-1, 1}`, and OCMF requires all the ratings to be positive. Testing data file is similar to training data, but since the ratings in testing data are usually unknown, the `rating` entry in testing data file can be omitted, or can be replaced by any placeholder such as `0` or `?`. The testing data file for the same rating matrix would be ``` 0 2 1 0 2 2 ... ``` Example data files are contained in the `<recosystem>/dat` (or `<recosystem>/inst/dat`, for source package) directory. ### Usage of recosystem The usage of `recosystem` is quite simple, mainly consisting of the following steps: 1. Create a model object (a Reference Class object in R) by calling `Reco()`. 2. (Optionally) call the `$tune()` method to select best tuning parameters along a set of candidate values. 3. Train the model by calling the `$train()` method. A number of parameters can be set inside the function, possibly coming from the result of `$tune()`. 4. (Optionally) export the model via `$output()`, i.e. write the factorization matrices `P` and `Q` into files or return them as R objects. 5. Use the `$predict()` method to compute predicted values. Below is an example on some simulated data: ```r library(recosystem) set.seed(123) # This is a randomized algorithm train_set = data_file(system.file("dat", "smalltrain.txt", package = "recosystem")) test_set = data_file(system.file("dat", "smalltest.txt", package = "recosystem")) r = Reco() opts = r$tune(train_set, opts = list(dim = c(10, 20, 30), lrate = c(0.1, 0.2), costp_l1 = 0, costq_l1 = 0, nthread = 1, niter = 10)) opts ``` ``` $min $min$dim [1] 20 $min$costp_l1 [1] 0 $min$costp_l2 [1] 0.1 $min$costq_l1 [1] 0 $min$costq_l2 [1] 0.01 $min$lrate [1] 0.1 $min$loss_fun [1] 0.9804937 $res dim costp_l1 costp_l2 costq_l1 costq_l2 lrate loss_fun 1 10 0 0.01 0 0.01 0.1 0.9996368 2 20 0 0.01 0 0.01 0.1 1.0040111 3 30 0 0.01 0 0.01 0.1 0.9967101 4 10 0 0.10 0 0.01 0.1 0.9930384 5 20 0 0.10 0 0.01 0.1 0.9804937 6 30 0 0.10 0 0.01 0.1 0.9921565 7 10 0 0.01 0 0.10 0.1 0.9857116 8 20 0 0.01 0 0.10 0.1 1.0006225 9 30 0 0.01 0 0.10 0.1 0.9891277 10 10 0 0.10 0 0.10 0.1 0.9826748 11 20 0 0.10 0 0.10 0.1 0.9807865 12 30 0 0.10 0

评论收藏

内容反馈

版权申诉