Indiana University, Bloomington, USA
School of Informatics and Computing
Red-RF: Reduced Random Forest for big data using priority voting & dynamic data reduction
README file
May 2015
=============================================================================================
CITATION:
=========
Please cite the following paper(s) when using Red-RF:
H. Mohsen, H. Kurban, K. Zimmer, M. Jenne and M. Dalkilic. Red-RF: Reduced Random Forests using priority voting & dynamic data reduction. Proceedings of the 4th IEEE International Congress on Big Data (IEEE BigData Congress'2015), 118-125, New York, NY, June-July 2015.
H. Mohsen, H. Kurban, M. Jenne and M. Dalkilic (2014). A New Set of Random Forests with Varying Dynamic Data Reduction and Voting Techniques. Proceedings of the 2014 IEEE International Conference on Data Science and Advanced Analytics (IEEE DSAA’2014), 309-405, Shanghai, China, October-November 2014.
CODE EXECUTION:
===============
The given code runs directly on the given cancer data (input file should be in running directory).
To run the code against new dataset, the user needs to adjust the following global variables in the code as desirable:
// Names of the attributes excluding the label. It is CRUCIAL you have n items in this array if you have n attributes. You may not worry much about attribute names so you may call them "att1", "att2", etc.
attributeNames={"CT","UCSi","UCSh","MA","SICZ","BN","BC","NN","M"};
// Change to the prefix in the new input file.
public static String dataPrefix="cancer";
// The type of impurity measure used. 2 for Gini index, 1 for entropy and 0 for Error rate. Default value is 2 for Gini.
public static int typeOfImpurity=2;
// The maximum branching factor in forest trees.
public static int numberOfBranches=5;
// N', the size of the sample used to build each tree in the random forest.
public static int NPrime = 15;
// m, the number of attributes randomly chosen at each split while building forests trees.
public static int m = (int) Math.ceil(Math.sqrt(attributeNames.length));
// Number of trees in the original whole forest
public static int forestSize =150;
// Size of the dataset (number of rows/records)
public static int dataSize=683;
The code does 10-fold cross validation. When execution is over, the average accuracy and execution times are printed on console. An example is attached (screenshot.png).
INPUT FILE:
===========
In the input CSV file:
- All attribute values must be numerical. For categorical values, preprocess them to be numerical (2 categorical values could be concerted to 1 and 4 for example).
- Labels must be 0 and 1 (not "0" and "1" - no quotations) and they must in the last column in your input CSV file.
HEAP & ROC Files:
=================
Heap and ROC files are generated in running directory.
To generate heap distribution or ROC plots for the new data set:
For the heap distribution:
- When execution is over, the generated file will be called prefix_heap.txt. Run the Heap commands in attached R-commands.txt. Commands are based on pROC R library.
For the ROC plot:
- When execution is over, there will be a generated file called prefix_ROC.txt. Run the ROC commands in attached R-commands.txt (histogram generation).
CONTACT:
========
For inquiries, please contact us at [email protected] (or @indiana.edu).