Binary Classification
====
This is the quick start tutorial for xgboost CLI version. You can also checkout [../../doc/README.md](../../doc/README.md) for links to tutorial in python or R.
Here we demonstrate how to use XGBoost for a binary classification task. Before getting started, make sure you compile xgboost in the root directory of the project by typing ```make```
The script runexp.sh can be used to run the demo. Here we use [mushroom dataset](https://archive.ics.uci.edu/ml/datasets/Mushroom) from UCI machine learning repository.
### Tutorial
#### Generate Input Data
XGBoost takes LibSVM format. An example of faked input data is below:
```
1 101:1.2 102:0.03
0 1:2.1 10001:300 10002:400
...
```
Each line represent a single instance, and in the first line '1' is the instance label,'101' and '102' are feature indices, '1.2' and '0.03' are feature values. In the binary classification case, '1' is used to indicate positive samples, and '0' is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive.
First we will transform the dataset into classic LibSVM format and split the data into training set and test set by running:
```
python mapfeat.py
python mknfold.py agaricus.txt 1
```
The two files, 'agaricus.txt.train' and 'agaricus.txt.test' will be used as training set and test set.
#### Training
Then we can run the training process:
```
../../xgboost mushroom.conf
```
mushroom.conf is the configuration for both training and testing. Each line containing the [attribute]=[value] configuration:
```conf
# General Parameters, see comment for each definition
# can be gbtree or gblinear
booster = gbtree
# choose logistic regression loss function for binary classification
objective = binary:logistic
# Tree Booster Parameters
# step size shrinkage
eta = 1.0
# minimum loss reduction required to make a further partition
gamma = 1.0
# minimum sum of instance weight(hessian) needed in a child
min_child_weight = 1
# maximum depth of a tree
max_depth = 3
# Task Parameters
# the number of round to do boosting
num_round = 2
# 0 means do not save any model except the final round model
save_period = 0
# The path of training data
data = "agaricus.txt.train"
# The path of validation data, used to monitor training process, here [test] sets name of the validation set
eval[test] = "agaricus.txt.test"
# The path of test data
test:data = "agaricus.txt.test"
```
We use the tree booster and logistic regression objective in our setting. This indicates that we accomplish our task using classic gradient boosting regression tree(GBRT), which is a promising method for binary classification.
The parameters shown in the example gives the most common ones that are needed to use xgboost.
If you are interested in more parameter settings, the complete parameter settings and detailed descriptions are [here](../../doc/parameter.md). Besides putting the parameters in the configuration file, we can set them by passing them as arguments as below:
```
../../xgboost mushroom.conf max_depth=6
```
This means that the parameter max_depth will be set as 6 rather than 3 in the conf file. When you use command line, make sure max_depth=6 is passed in as single argument, i.e. do not contain space in the argument. When a parameter setting is provided in both command line input and the config file, the command line setting will override the setting in config file.
In this example, we use tree booster for gradient boosting. If you would like to use linear booster for regression, you can keep all the parameters except booster and the tree booster parameters as below:
```conf
# General Parameters
# choose the linear booster
booster = gblinear
...
# Change Tree Booster Parameters into Linear Booster Parameters
# L2 regularization term on weights, default 0
lambda = 0.01
# L1 regularization term on weights, default 0
f ```agaricus.txt.test.buffer``` exists, and automatically loads from binary buffer if possible, this can speedup training process when you do training many times. You can disable it by setting ```use_buffer=0```.
- Buffer file can also be used as standalone input, i.e if buffer file exists, but original agaricus.txt.test was removed, xgboost will still run
* Deviation from LibSVM input format: xgboost is compatible with LibSVM format, with the following minor differences:
- xgboost allows feature index starts from 0
- for binary classification, the label is 1 for positive, 0 for negative, instead of +1,-1
- the feature indices in each line *do not* need to be sorted
alpha = 0.01
# L2 regularization term on bias, default 0
lambda_bias = 0.01
# Regression Parameters
...
```
#### Get Predictions
After training, we can use the output model to get the prediction of the test data:
```
../../xgboost mushroom.conf task=pred model_in=0003.model
```
For binary classification, the output predictions are probability confidence scores in [0,1], corresponds to the probability of the label to be positive.
#### Dump Model
This is a preliminary feature, so far only tree model support text dump. XGBoost can display the tree models in text files and we can scan the model in an easy way:
```
../../xgboost mushroom.conf task=dump model_in=0003.model name_dump=dump.raw.txt
../../xgboost mushroom.conf task=dump model_in=0003.model fmap=featmap.txt name_dump=dump.nice.txt
```
In this demo, the tree boosters obtained will be printed in dump.raw.txt and dump.nice.txt, and the latter one is easier to understand because of usage of feature mapping featmap.txt
Format of ```featmap.txt: <featureid> <featurename> <q or i or int>\n ```:
- Feature id must be from 0 to number of features, in sorted order.
- i means this feature is binary indicator feature
- q means this feature is a quantitative value, such as age, time, can be missing
- int means this feature is integer value (when int is hinted, the decision boundary will be integer)
#### Monitoring Progress
When you run training we can find there are messages displayed on screen
```
tree train end, 1 roots, 12 extra nodes, 0 pruned nodes ,max_depth=3
[0] test-error:0.016139
boosting round 1, 0 sec elapsed
tree train end, 1 roots, 10 extra nodes, 0 pruned nodes ,max_depth=3
[1] test-error:0.000000
```
The messages for evaluation are printed into stderr, so if you want only to log the evaluation progress, simply type
```
../../xgboost mushroom.conf 2>log.txt
```
Then you can find the following content in log.txt
```
[0] test-error:0.016139
[1] test-error:0.000000
```
We can also monitor both training and test statistics, by adding following lines to configure
```conf
eval[test] = "agaricus.txt.test"
eval[trainname] = "agaricus.txt.train"
```
Run the command again, we can find the log file becomes
```
[0] test-error:0.016139 trainname-error:0.014433
[1] test-error:0.000000 trainname-error:0.001228
```
The rule is eval[name-printed-in-log] = filename, then the file will be added to monitoring process, and evaluated each round.
xgboost also supports monitoring multiple metrics, suppose we also want to monitor average log-likelihood of each prediction during training, simply add ```eval_metric=logloss``` to configure. Run again, we can find the log file becomes
```
[0] test-error:0.016139 test-negllik:0.029795 trainname-error:0.014433 trainname-negllik:0.027023
[1] test-error:0.000000 test-negllik:0.000000 trainname-error:0.001228 trainname-negllik:0.002457
```
### Saving Progress Models
If you want to save model every two round, simply set save_period=2. You will find 0002.model in the current folder. If you want to change the output folder of models, add model_dir=foldername. By default xgboost saves the model of last round.
#### Continue from Existing Model
If you want to continue boosting from existing model, say 0002.model, use
```
../../xgboost mushroom.conf model_i
没有合适的资源?快使用搜索试试~ 我知道了~
源自DataCastle微额贷款人品预测比赛,火星人代码.zip
共397个文件
py:74个
pdf:51个
r:39个
需积分: 5 0 下载量 152 浏览量
2023-10-01
15:37:46
上传
评论
收藏 87.29MB ZIP 举报
温馨提示
源自DataCastle微额贷款人品预测比赛,火星人代码
资源推荐
资源详情
资源评论
收起资源包目录
源自DataCastle微额贷款人品预测比赛,火星人代码.zip (397个子文件)
00Index 682B
column_3C_weka.arff 25KB
column_2C_weka.arff 24KB
create_wrap.bat 442B
xgboost.bib 912B
setup.cfg 41B
xgboost4j_wrapper.cpp 19KB
xgboost_wrapper.cpp 18KB
xgboost_main.cpp 11KB
dmlc_simple.cpp 6KB
io.cpp 3KB
updater.cpp 1KB
gbm.cpp 539B
vignette.css 4KB
missTrain_1050.csv 70KB
missTest_1050.csv 16KB
catefea.csv 335B
kaggle_avg.csv 69B
kaggle_geomean.csv 69B
kaggle_rankavg.csv 46B
method1.csv 34B
method2.csv 34B
method3.csv 34B
kaggle_vote.csv 34B
column_3C.dat 12KB
column_2C.dat 12KB
DESCRIPTION 1KB
竞赛报告书(用户人品预测大赛) (火星人).docx 32KB
竞赛报告书(用户人品预测大赛) .docx 21KB
代码说明.docx 13KB
agaricus-lepiota.fmap 2KB
.gitattributes 378B
.gitignore 735B
.gitignore 574B
.gitignore 43B
.gitignore 18B
.gitignore 17B
.gitignore 15B
quantile.h 26KB
model.h 18KB
param.h 15KB
xgboost_wrapper.h 13KB
sparse_batch_page.h 9KB
base64-inl.h 8KB
xgboost4j_wrapper.h 7KB
thread_buffer.h 7KB
thread.h 6KB
data.h 6KB
libsvm_parser.h 6KB
dmatrix.h 6KB
gbm.h 5KB
utils.h 5KB
config.h 5KB
group_data.h 4KB
objective.h 3KB
evaluation.h 3KB
random.h 3KB
fmap.h 2KB
helper_utils.h 2KB
updater.h 2KB
bitmap.h 2KB
io.h 2KB
io.h 1KB
iterator.h 1013B
omp.h 957B
math.h 923B
sync.h 362B
updater_histmaker-inl.hpp 30KB
updater_colmaker-inl.hpp 30KB
objective-inl.hpp 23KB
evaluation-inl.hpp 19KB
learner-inl.hpp 19KB
gbtree-inl.hpp 19KB
updater_skmaker-inl.hpp 15KB
updater_basemaker-inl.hpp 15KB
simple_fmatrix-inl.hpp 12KB
page_fmatrix-inl.hpp 12KB
simple_dmatrix-inl.hpp 11KB
gblinear-inl.hpp 10KB
page_dmatrix-inl.hpp 8KB
updater_distcol-inl.hpp 6KB
updater_refresh-inl.hpp 6KB
updater_prune-inl.hpp 3KB
updater_sync-inl.hpp 2KB
MANIFEST.in 496B
Booster.java 17KB
Trainer.java 8KB
DMatrix.java 8KB
BasicWalkThrough.java 6KB
CustomObjective.java 6KB
DataLoader.java 4KB
NativeUtils.java 4KB
XgboostJNI.java 3KB
Initializer.java 3KB
GeneralizedLinearModel.java 3KB
CVPack.java 3KB
PredictFirstNtree.java 2KB
BoostFromPrediction.java 2KB
ExternalMemory.java 2KB
PredictLeafIndices.java 2KB
共 397 条
- 1
- 2
- 3
- 4
资源评论
天天501
- 粉丝: 606
- 资源: 4665
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功