没有合适的资源？快使用搜索试试~ 我知道了~

文库首页安全技术网络安全18308045_谷正阳_midterm_project-part21

18308045_谷正阳_midterm_project-part21

需积分: 0 0 下载量 137 浏览量 2022-08-03 23:10:15 上传评论收藏 1.01MB PDF 举报

温馨提示

试读

12页

School of Data and Computer Science, Sun Yat-sen University18308045 谷正阳Different

资源详情

资源评论

资源推荐

School of Data and Computer Science, Sun Yat-sen University

Random Forest

18308045 谷正阳

June 13, 2021

Contents

1 Importance function 2

1.1 Purity function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Information gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Information gain ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Negative Gini gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Discretization 4

2.1 The rst step: to calculate all split points . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 The second step: to choose the best split point . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Comparison between Numpy version and PyTorch version . . . . . . . . . . . . . . . . 5

3 Experiment 5

3.1 Dierent importance functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.2 Dierent types of features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.3 Dierent number of features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.4 Dierent number of trees in a forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Conclusions 12

1 Importance function

This function is used to evaluate how good a split of dataset is. The basic idea is that if a label

in each subset after a split dominates that subset, the split is good.

1.1 Purity function

To evaluate whether a label dominates a subset, we need to build a special function. Since we do

binary classication here, we just input the probability of one label in the subset into the function.

The return value should be high if the probability is close to 0 or 1. Otherwise, the value should be

low. There are two common functions to evaluate the purity of a set, one of which is entropy, the

other of which is negtive Gini index. Entropy has the form of

B(q) = −(q log(q) + (1 − q) log(1 − q)), (1)

while negtive Gini index has the form of

Neg_Gini(p) = −(1 − (p

+ (1 − p

))). (2)

Based on these two purity functions, we build three importance functions.

1.2 Information gain

Information gain is to calculate the gain of entropy after the split. It has the form of

Gain(A) = B(

p + n

) −



v=1

+ n

p + n

+ n

). (3)

Because after the discretization every attributes’ domains are binary, and p, p + n are the same when

comparing splits, the actual equation used here is

Neg_Remainder_M ul_N(A) = (p

+ n

)(−B(

+ n

)) + (p

+ n

)(−B(

+ n

)). (4)

The function NEG_REMAINDER_MUL_N below takes all the features as inputs and returns an

array containing all their Neg_Remainder_Mul_N.

1 def NEG_B(q):

2 I_q = 1 − q

3 return xlogy(q, q) + xlogy(I_q, I_q)

5 def NEG_REMAINDER_MUL_N(S, F):

6 label = S[:, −1:]

7 S1 = S[:, F]

9 p1 = np.sum(S1 & label)

10 N1 = np.sum(S1)

12 p0 = np.sum(label) − p1

13 N0 = S.shape[0] − N1

15 return N1

NEG_B(p1 / N1) + N0

NEG_B(p0 / N0)

1.3 Information gain ratio

The information gain3 has a weakness, which is that if the domain of an attribute is too big, the

second term



v=1

) can be really small. Therefore, the information gain prefer those

attributes with larger dommain. The information gain ratio has the form of

Gain_ratio(A) =

Gain(A)

−



v=1

p+n

log(

p+n

)

, (5)

which is actually add a penalty on the information gain3. If the domain of an attribute is large, the

−



v=1

p+n

log(

p+n

) will be large, so its gain ratio is small.

However, since all the attributes here have binary domains, the information gain ratio is the

same as the information gain.

1.4 Negative Gini gain

This has a similar form as the information gain3, since it’s actually replace the log(p) with its

approximate substitution p − 1. This can be calculated faster than information gain since p − 1 is

easier to calculate than log(p).

1 def NEG_GINI_MINUS_1(p):

2 return p

2 + (1 − p)

4 def NEG_GINI_INDEX_MINUS_1_MUL_N(S, F):

5 label = S[:, −1:]

6 S1 = S[:, F]

8 p1 = np.sum(S1 & label)

9 N1 = np.sum(S1)

11 p0 = np.sum(label) − p1

剩余11页未读，继续阅读

ISO 27001:2022英文版 ISO 27001:2022中文版(本人译稿，再也不改了版) ISO 27002:2022英文版 ISO 27002:2022中文版(本人译稿，再也不改了版) 全部为文字版PDF文件，带完整目录标签。

ISO SAE 21434中文版

安全认证cisp教材全套

5星 · 资源好评率100%

cisp教材全套，最全的CISP电子版教材，总共20章节分20个PDF文件

评论收藏

内容反馈

宏馨

粉丝: 18
资源: 293

上传资源快速赚钱

我的内容管理展开

我的资源快来上传第一个资源

我的收益

登录查看自己的收益

我的积分登录查看自己的积分

我的C币登录后查看C币余额

我的收藏

我的下载

下载帮助

前往需求广场，查看用户热搜

18308045_谷正阳_midterm_project-part21

评论0

最新资源

18308045_谷正阳_midterm_project-part21

评论0

midterm-project-aagamsh1：GitHub Classroom创建的midterm-project-aagamsh1

Midterm-test-6.rar_in

HWH_2D_Midterm_0608420-QQ

adet-midterm-q1-sampedro-james

Rutgers-ECE-Algorithms-Midterm2-Solution

Midterm_K-means_CUDA:带有CUDA的K-Means算法

Midterm_Project_matlab_Mine!_

java-spring-midterm-project

midterm-keanu-ben:一个中期项目，以收集和解释数据为中心

DataMining_Midterm-源码.rar

DATA2040_Midterm_Project

Midterm_project_UI

Midterm-project

midterm-decision-maker

midterm-project

Midterm-Family-Vault

midterm1-restaurantselect：由GitHub Classroom创建的midterm1-restaurantselect

asa_2011_midterm.pdf

Midterm-Exam.zip

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Goby红队版-win-x64-2.4.7版本

Chrome Header Editor 插件

ISO SAE 21434-2021 中文版.pdf

安全认证cisp教材全套

OpenVAS GVM 中文翻译补丁

2024最新：Hvv中常见的面试问题

现代永磁同步电机控制原理及MATLAB仿真__袁雷编著1

全面的安全基线核查清单

最新资源