xgboost-2015资源-CSDN文库

共391个文件

py：56个

md：50个

r：47个

5星 · 超过95%的资源需积分: 10 122 浏览量 2016-03-29 21:57:21 上传评论 1 收藏 1.13MB ZIP 举报

《XGBoost：2015年的版本解析与应用探讨》 XGBoost，全称为“Extreme Gradient Boosting”，是一款高效、灵活且可扩展的梯度提升框架，尤其在机器学习竞赛中广受青睐。本篇文章将聚焦于2015年的XGBoost版本，探讨其特性、使用方法以及与后续版本的差异。 XGBoost的核心在于它的优化算法，它通过改进传统的梯度提升决策树（GBDT）来实现更快的训练速度和更好的预测性能。2015年的XGBoost版本虽然不支持Windows操作系统，但在Linux和Mac OS上运行稳定，为数据科学家提供了强大的工具箱。我们来理解XGBoost的主要功能。XGBoost允许用户自定义损失函数，这使得它能适应各种类型的问题，如分类、回归甚至排序。在2015年的版本中，它已经具备了并行化计算的能力，通过分布式计算框架，大大提升了处理大规模数据集的效率。同时，它引入了正则化项，有效防止模型过拟合，增强了模型的泛化能力。在具体使用中，2015年的XGBoost提供了丰富的API，支持Python、R、Java等多种编程语言。用户可以通过设置参数来调整模型的行为，例如控制树的数量、深度、学习率等。值得注意的是，对于不支持Windows的情况，用户可以借助虚拟机或Docker等工具在Windows环境下运行XGBoost。与2016年3月之后的新版本相比，2015年的XGBoost可能缺少了一些新特性，例如对Windows的原生支持、更多的内置评估指标和优化算法。然而，这个旧版本仍然具有相当的价值，尤其对于那些只需要基本功能且不需要最新特性的项目，它可以提供稳定且高效的解决方案。在实际项目中，2015版XGBoost可以用于各种应用场景。比如在风控领域，它可以帮助识别欺诈交易；在推荐系统中，它可以预测用户可能感兴趣的内容；在医疗诊断中，它能够辅助识别疾病等。无论是在学术研究还是工业实践中，2015年的XGBoost都是一个可靠的工具。总结来说，2015年的XGBoost虽然在某些方面有所限制，但其核心优势——高效、稳定和灵活性，使其在机器学习领域依然具有广泛的应用价值。随着技术的发展，XGBoost不断进化，但每一个历史版本都记录了它成长的足迹，值得我们去了解和研究。对于初次接触或寻求特定功能的用户，了解并掌握2015年的XGBoost版本，无疑能为解决问题提供新的思路和方法。

资源推荐

资源详情

资源评论

收起资源包目录

xgboost-2015 （391个子文件）

00Index 682B

create_wrap.bat 442B

xgboost.bib 912B

xgboost_assert.c 756B

allreduce_robust.cc 47KB

allreduce_base.cc 32KB

rabit_wrapper.cc 7KB

engine_mpi.cc 6KB

local_recover.cc 4KB

model_recover.cc 4KB

lazy_recover.cc 4KB

engine_empty.cc 4KB

speed_test.cc 3KB

engine.cc 2KB

basic.cc 1012B

lazy_allreduce.cc 1007B

broadcast.cc 475B

engine_mock.cc 474B

engine_base.cc 444B

setup.cfg 41B

mushroom-col.conf 1KB

yearpredMSD.conf 998B

machine.conf 967B

mushroom.conf 947B

mq2008.conf 754B

xgboost4j_wrapper.cpp 19KB

xgboost_wrapper.cpp 18KB

xgboost_R.cpp 11KB

xgboost_main.cpp 11KB

dmlc_simple.cpp 6KB

io.cpp 3KB

updater.cpp 1KB

gbm.cpp 539B

vignette.css 4KB

agaricus-lepiota.data 365KB

machine.data 9KB

DESCRIPTION 1KB

Doxyfile 10KB

agaricus-lepiota.fmap 2KB

.gitignore 735B

.gitignore 324B

.gitignore 57B

.gitignore 43B

.gitignore 30B

.gitignore 27B

.gitignore 18B

.gitignore 17B

.gitignore 15B

quantile.h 26KB

allreduce_robust.h 24KB

allreduce_base.h 20KB

model.h 18KB

param.h 15KB

socket.h 15KB

rabit.h 14KB

xgboost_wrapper.h 13KB

io.h 12KB

rabit-inl.h 11KB

engine.h 11KB

sparse_batch_page.h 9KB

base64-inl.h 8KB

xgboost4j_wrapper.h 7KB

thread_buffer.h 7KB

allreduce_robust-inl.h 6KB

thread.h 6KB

allreduce_mock.h 6KB

data.h 6KB

libsvm_parser.h 6KB

dmatrix.h 6KB

gbm.h 5KB

rabit_wrapper.h 5KB

xgboost_R.h 5KB

utils.h 5KB

config.h 5KB

group_data.h 4KB

objective.h 3KB

evaluation.h 3KB

io.h 3KB

random.h 3KB

fmap.h 2KB

helper_utils.h 2KB

updater.h 2KB

bitmap.h 2KB

io.h 2KB

io.h 1KB

timer.h 1KB

iterator.h 1013B

omp.h 957B

math.h 923B

rabit_serializable.h 637B

sync.h 362B

updater_histmaker-inl.hpp 30KB

updater_colmaker-inl.hpp 30KB

objective-inl.hpp 23KB

evaluation-inl.hpp 19KB

learner-inl.hpp 19KB

gbtree-inl.hpp 19KB

updater_skmaker-inl.hpp 15KB

共 391 条

Binary Classification ==== This is the quick start tutorial for xgboost CLI version. You can also checkout [../../doc/README.md](../../doc/README.md) for links to tutorial in python or R. Here we demonstrate how to use XGBoost for a binary classification task. Before getting started, make sure you compile xgboost in the root directory of the project by typing ```make``` The script runexp.sh can be used to run the demo. Here we use [mushroom dataset](https://archive.ics.uci.edu/ml/datasets/Mushroom) from UCI machine learning repository. ### Tutorial #### Generate Input Data XGBoost takes LibSVM format. An example of faked input data is below: ``` 1 101:1.2 102:0.03 0 1:2.1 10001:300 10002:400 ... ``` Each line represent a single instance, and in the first line '1' is the instance label,'101' and '102' are feature indices, '1.2' and '0.03' are feature values. In the binary classification case, '1' is used to indicate positive samples, and '0' is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive. First we will transform the dataset into classic LibSVM format and split the data into training set and test set by running: ``` python mapfeat.py python mknfold.py agaricus.txt 1 ``` The two files, 'agaricus.txt.train' and 'agaricus.txt.test' will be used as training set and test set. #### Training Then we can run the training process: ``` ../../xgboost mushroom.conf ``` mushroom.conf is the configuration for both training and testing. Each line containing the [attribute]=[value] configuration: ```conf # General Parameters, see comment for each definition # can be gbtree or gblinear booster = gbtree # choose logistic regression loss function for binary classification objective = binary:logistic # Tree Booster Parameters # step size shrinkage eta = 1.0 # minimum loss reduction required to make a further partition gamma = 1.0 # minimum sum of instance weight(hessian) needed in a child min_child_weight = 1 # maximum depth of a tree max_depth = 3 # Task Parameters # the number of round to do boosting num_round = 2 # 0 means do not save any model except the final round model save_period = 0 # The path of training data data = "agaricus.txt.train" # The path of validation data, used to monitor training process, here [test] sets name of the validation set eval[test] = "agaricus.txt.test" # The path of test data test:data = "agaricus.txt.test" ``` We use the tree booster and logistic regression objective in our setting. This indicates that we accomplish our task using classic gradient boosting regression tree(GBRT), which is a promising method for binary classification. The parameters shown in the example gives the most common ones that are needed to use xgboost. If you are interested in more parameter settings, the complete parameter settings and detailed descriptions are [here](../../doc/parameter.md). Besides putting the parameters in the configuration file, we can set them by passing them as arguments as below: ``` ../../xgboost mushroom.conf max_depth=6 ``` This means that the parameter max_depth will be set as 6 rather than 3 in the conf file. When you use command line, make sure max_depth=6 is passed in as single argument, i.e. do not contain space in the argument. When a parameter setting is provided in both command line input and the config file, the command line setting will override the setting in config file. In this example, we use tree booster for gradient boosting. If you would like to use linear booster for regression, you can keep all the parameters except booster and the tree booster parameters as below: ```conf # General Parameters # choose the linear booster booster = gblinear ... # Change Tree Booster Parameters into Linear Booster Parameters # L2 regularization term on weights, default 0 lambda = 0.01 # L1 regularization term on weights, default 0 f ```agaricus.txt.test.buffer``` exists, and automatically loads from binary buffer if possible, this can speedup training process when you do training many times. You can disable it by setting ```use_buffer=0```. - Buffer file can also be used as standalone input, i.e if buffer file exists, but original agaricus.txt.test was removed, xgboost will still run * Deviation from LibSVM input format: xgboost is compatible with LibSVM format, with the following minor differences: - xgboost allows feature index starts from 0 - for binary classification, the label is 1 for positive, 0 for negative, instead of +1,-1 - the feature indices in each line *do not* need to be sorted alpha = 0.01 # L2 regularization term on bias, default 0 lambda_bias = 0.01 # Regression Parameters ... ``` #### Get Predictions After training, we can use the output model to get the prediction of the test data: ``` ../../xgboost mushroom.conf task=pred model_in=0003.model ``` For binary classification, the output predictions are probability confidence scores in [0,1], corresponds to the probability of the label to be positive. #### Dump Model This is a preliminary feature, so far only tree model support text dump. XGBoost can display the tree models in text files and we can scan the model in an easy way: ``` ../../xgboost mushroom.conf task=dump model_in=0003.model name_dump=dump.raw.txt ../../xgboost mushroom.conf task=dump model_in=0003.model fmap=featmap.txt name_dump=dump.nice.txt ``` In this demo, the tree boosters obtained will be printed in dump.raw.txt and dump.nice.txt, and the latter one is easier to understand because of usage of feature mapping featmap.txt Format of ```featmap.txt: <featureid> <featurename> <q or i or int>\n ```: - Feature id must be from 0 to number of features, in sorted order. - i means this feature is binary indicator feature - q means this feature is a quantitative value, such as age, time, can be missing - int means this feature is integer value (when int is hinted, the decision boundary will be integer) #### Monitoring Progress When you run training we can find there are messages displayed on screen ``` tree train end, 1 roots, 12 extra nodes, 0 pruned nodes ,max_depth=3 [0] test-error:0.016139 boosting round 1, 0 sec elapsed tree train end, 1 roots, 10 extra nodes, 0 pruned nodes ,max_depth=3 [1] test-error:0.000000 ``` The messages for evaluation are printed into stderr, so if you want only to log the evaluation progress, simply type ``` ../../xgboost mushroom.conf 2>log.txt ``` Then you can find the following content in log.txt ``` [0] test-error:0.016139 [1] test-error:0.000000 ``` We can also monitor both training and test statistics, by adding following lines to configure ```conf eval[test] = "agaricus.txt.test" eval[trainname] = "agaricus.txt.train" ``` Run the command again, we can find the log file becomes ``` [0] test-error:0.016139 trainname-error:0.014433 [1] test-error:0.000000 trainname-error:0.001228 ``` The rule is eval[name-printed-in-log] = filename, then the file will be added to monitoring process, and evaluated each round. xgboost also supports monitoring multiple metrics, suppose we also want to monitor average log-likelihood of each prediction during training, simply add ```eval_metric=logloss``` to configure. Run again, we can find the log file becomes ``` [0] test-error:0.016139 test-negllik:0.029795 trainname-error:0.014433 trainname-negllik:0.027023 [1] test-error:0.000000 test-negllik:0.000000 trainname-error:0.001228 trainname-negllik:0.002457 ``` ### Saving Progress Models If you want to save model every two round, simply set save_period=2. You will find 0002.model in the current folder. If you want to change the output folder of models, add model_dir=foldername. By default xgboost saves the model of last round. #### Continue from Existing Model If you want to continue boosting from existing model, say 0002.model, use ``` ../../xgboost mushroom.conf model_i

评论收藏

内容反馈