nprize-read-only.tar.gz_NetflixPrize_Only_RMSE

版权申诉

72 浏览量 2022-09-23 00:30:54 上传评论收藏 54KB GZ 举报

共41个文件

svn-base：19个

c：9个

h：4个

资源详情

资源评论

资源推荐

收起资源包目录

nprize-read-only.tar.gz （41个子文件）

nprize-read-only

moviebin2userbin.c 4KB

README.txt 6KB

qualify2bin.py 2KB

mix2.c 6KB

netflix.h 2KB

global.c 1KB

basic.h 2KB

COPYING.txt 15KB

weight.h 85B

utest.h 1KB

makefile 416B

utest.c 8KB

weight.c 2KB

netflix2moviebin.py 5KB

usvdns1b.c 5KB

ubaseline1.c 13KB

usvdbkw1.c 4KB

.svn

pristine

24705b080bd0f026290203c6ae4bdbf14105fdeb.svn-base 2KB

ab509f6f57c683587049d9ebc7705c1123e736d6.svn-base 13KB

ebc979b52897d6514cbba048cb4e9e1ce9208594.svn-base 416B

d9b8f0a29668ce9fc95eac8ead6a0fce22ef4eff.svn-base 1KB

483b4d15271efbb4443990db2521edcd36bc161d.svn-base 2KB

a81f5146f919d9dff04f4cce6400e6b3f708e09d.svn-base 13KB

276eea76d33ddc63e1ea4f40afff5b5763065e3e.svn-base 3KB

9fef47f23760a1d0f8c758b2a7d6f9a16bcb3251.svn-base 6KB

d690f11061596cb8713ff8e75b6bd6e34db85226.svn-base 5KB

215db6c4dfba4c17b8ab4ebf6b8e8edc42b7d15d.svn-base 8KB

cceb8fcfd7077c47c839094d2d83bc68d1287531.svn-base 2KB

25fb05964628ec648234e55f87eb90834ca4902c.svn-base 4KB

2bb6a5491d31861db33fd674f0107bcf58536d44.svn-base 5KB

4512ea930cb2de43aed3cd1aed7b71b5c375c3a8.svn-base 1KB

ff05c8a9894849e21dcaa7df40001d90d5f27fb0.svn-base 85B

c3139b8ad079f00ff29c4445a8cc4064948fb422.svn-base 2KB

086522998bce7aa9e5a07c02b0336895c8940a4d.svn-base 15KB

32e586b1c4838f9ca0111d465e7a1cdb647b7db3.svn-base 4KB

50478cc1faa87a3cbec37a75868d278cbd7ffdc4.svn-base 6KB

wc.db 42KB

tmp

entries 3B

format 3B

basic.c 13KB

countusers.py 3KB

######################################################################## # Netflix Prize Tools # Copyright (C) 2007-8 Ehud Ben-Reuven # udi@benreuven.com # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation version 2. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301,USA. ######################################################################## With the set of tools available here I was able to reach a RMSE of 0.9046 on a qualified data. This was achieved by the following steps: 0) Arrange a development environment: * Cygwin - download the entire development category and "lapack" package from the "Math" category. * Ubuntu 7.10 (I use it to run on a 64 bit CPU) - download the build environment with "aptitude install build-essential" and the LAPACK package with "apt-get install lapack3-dev" 1) Create a sub-directory called "input". The "input" directory should be located in the working directory, the directory where you run the code. You can use symbolic link to physically locate this directory on a different drive ("ln -s /another-location ./input") Download the data file from netflix (http://www.netflixprize.com/download) Run gunzip and "tar xvf" to open the downloaded file. Your input directory should end with the files: probe.txt, qualifying.txt and a directory called training_set. 2) Create a directory called "data" in the working directory, it will be used to store the results of the tools. Again you can use symbolic link to a different drive. 3) Run "python countusers.py" to generate the file "data/users" which converts a compressed user representation (19bits) to their values in netflix files (21 bits with gaps). This code also performs a sanity check on the data. 4) Run "python qualify2bin.py" to convert the qualifying file (input/qualifying.txt) to a binary file in a binary format (data/qualify.bin) which will be latter used to generate the submission to netflix. 5) Run "python netflix2moviebin.py" this will generate a binary representation of the entire data (training, probe and qualifying) in data/movie_entry.bin The file is organized according to movies and the start of each movie is stored at data/movie_index.bin. The size of movie_entry file was reduced by throwing some information from the date of each entry, however the exact date information is kept at data/movie_date.bin 6) Make (run "make moviebin2userbin") and run moviebin2userbin. This will read the files generated in step 5 and will generate new file: data/user_entry.bin which keeps the same information ordered according to users. The location of each user in the file is kept at data/user_index.bin . Note that in this format the entire information on date is kept. 7) Make utest0b1 and use it to remove the baseline from the training data using: utest0b1 -se data/b.bin NOTE: All the commands that start with "uXXX" have several flags, see utest.c. NOTE: These commands generate a log file at data/log.txt NOTE: The "-se data/b.bin" indicates where to store the resulting errors. The above utest0b1 command generates the base line errors as described in Improved Neighborhood-based Collaborative Filtering by the BellKor team. Their baseline was improved by adding the following steps: * For each user entry, remove correlation with the number of other entries that same user made on the day. * Perform a final pass in which the global average error is removed. 8) Repeat the run to generate a base line for both training and probe data: utest0b1 -a -se data/xb.bin NOTE: The "-a" flag controls if the run is on the training or on all (training+probe) data. 9) Make and run SVD with weights on both training and all data: usvdbkw1 -le data/b.bin -l 600 -se data/usvdbkw1-b-l600.bin usvdbkw1 -le data/xb.bin -a -l 600 -se data/usvdbkw1-xb-l600.bin NOTE: The "-l 600" indicates that 600 features are computed. NOTE: The "-le data/..." indicates from where to read the inital errors. The program usvdbkw1 performs SVD Factorization as described in section 4.3 of "Modeling Relationships at Multiple Scales to Improve Accuracy of Large Recommender Systems" again by the BellKor team. Note that their method was a order of magnitude faster than Simon's Funck method. Their method was augmented by adding a linear time weights to the computation. 10) I found that running the first step (removing movie average) of the baseline helped at this stage: utest0b1 -l 1 -le data/usvdbkw1-b-l600.bin -se data/7.bin utest0b1 -l 1 -a -le data/usvdbkw1-xb-l600.bin -se data/x7.bin NOTE: how "-l 1" is used to control what base-line steps are performed. 11) The x7.bin gave a qualified RMSE of 0.9064. This can be repeated as follows utest0b1 -l 0 -le data/x7.bin -sq data/t.txt The file "data/t.txt" is the wanted result, you can check its validity using netflix script by running "check_format data/t.txt" You should gzip the result ("gzip data/r.txt") and compute MD5 hash of it ("md5sum data/t.txt.gz") and now you are ready to post the result to netflix. 12) In order to improve this result make and run the NSVD1 method as described in section 3.8 of "Improving regularized singular value decomposition for collaborative filtering" by Arik Paterek utest10 -le data/b.bin -l 30 -se data/u10-a-le-b-l-30.bin utest10 -le data/xb.bin -a -l 30 -se data/u10-a-le-xb-l-30.bin 13) Again I found an improvement in removing the overal average. utest0b1 -l 1 -le data/u10-a-le-b-l-30.bin -bl 1 -se data/10 utest0b1 -l 1 -a -le data/u10-a-le-xb-l-30.bin -bl 1 -se data/x10 13) At this point you are ready to find out what are the best mixture weights for combining 10 with 13 when using the training data. utest0b1 -l 0 -le data/7.bin -le data/10 The weights that are displayed and should be used on the entire data, for example: utest0b1 -l 0 -lew 0.765888 -lew 0.231938 -lew -0.001085 -le data/x7.bin -le data/x10 -sq data/t.txt You can now process t.txt as described in step 11