没有合适的资源?快使用搜索试试~ 我知道了~
18308045_谷正阳_midterm_project-part21
需积分: 0 0 下载量 137 浏览量
2022-08-03
23:10:15
上传
评论
收藏 1.01MB PDF 举报
温馨提示
试读
12页
School of Data and Computer Science, Sun Yat-sen University18308045 谷正阳Different
资源详情
资源评论
资源推荐
School of Data and Computer Science, Sun Yat-sen University
Random Forest
18308045 谷正阳
June 13, 2021
Contents
1 Importance function 2
1.1 Purity function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Information gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Information gain ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Negative Gini gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Discretization 4
2.1 The rst step: to calculate all split points . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 The second step: to choose the best split point . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Comparison between Numpy version and PyTorch version . . . . . . . . . . . . . . . . 5
3 Experiment 5
3.1 Dierent importance functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Dierent types of features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Dierent number of features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Dierent number of trees in a forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Conclusions 12
1
1 Importance function
This function is used to evaluate how good a split of dataset is. The basic idea is that if a label
in each subset after a split dominates that subset, the split is good.
1.1 Purity function
To evaluate whether a label dominates a subset, we need to build a special function. Since we do
binary classication here, we just input the probability of one label in the subset into the function.
The return value should be high if the probability is close to 0 or 1. Otherwise, the value should be
low. There are two common functions to evaluate the purity of a set, one of which is entropy, the
other of which is negtive Gini index. Entropy has the form of
B(q) = −(q log(q) + (1 − q) log(1 − q)), (1)
while negtive Gini index has the form of
Neg_Gini(p) = −(1 − (p
2
+ (1 − p
2
))). (2)
Based on these two purity functions, we build three importance functions.
1.2 Information gain
Information gain is to calculate the gain of entropy after the split. It has the form of
Gain(A) = B(
p
p + n
) −
V
v=1
p
v
+ n
v
p + n
B(
p
v
p
v
+ n
v
). (3)
Because after the discretization every attributes’ domains are binary, and p, p + n are the same when
comparing splits, the actual equation used here is
Neg_Remainder_M ul_N(A) = (p
0
+ n
0
)(−B(
p
0
p
0
+ n
0
)) + (p
1
+ n
1
)(−B(
p
1
p
1
+ n
1
)). (4)
The function NEG_REMAINDER_MUL_N below takes all the features as inputs and returns an
array containing all their Neg_Remainder_Mul_N.
1 def NEG_B(q):
2 I_q = 1 − q
3 return xlogy(q, q) + xlogy(I_q, I_q)
4
5 def NEG_REMAINDER_MUL_N(S, F):
6 label = S[:, −1:]
2
7 S1 = S[:, F]
8
9 p1 = np.sum(S1 & label)
10 N1 = np.sum(S1)
11
12 p0 = np.sum(label) − p1
13 N0 = S.shape[0] − N1
14
15 return N1
*
NEG_B(p1 / N1) + N0
*
NEG_B(p0 / N0)
1.3 Information gain ratio
The information gain3 has a weakness, which is that if the domain of an attribute is too big, the
second term
V
v=1
p
v
+n
v
p
+
n
B(
p
v
p
v
+n
v
) can be really small. Therefore, the information gain prefer those
attributes with larger dommain. The information gain ratio has the form of
Gain_ratio(A) =
Gain(A)
−
V
v=1
p
v
+n
v
p+n
log(
p
v
+n
v
p+n
)
, (5)
which is actually add a penalty on the information gain3. If the domain of an attribute is large, the
−
V
v=1
p
v
+n
v
p+n
log(
p
v
+n
v
p+n
) will be large, so its gain ratio is small.
However, since all the attributes here have binary domains, the information gain ratio is the
same as the information gain.
1.4 Negative Gini gain
This has a similar form as the information gain3, since it’s actually replace the log(p) with its
approximate substitution p − 1. This can be calculated faster than information gain since p − 1 is
easier to calculate than log(p).
1 def NEG_GINI_MINUS_1(p):
2 return p
**
2 + (1 − p)
**
2
3
4 def NEG_GINI_INDEX_MINUS_1_MUL_N(S, F):
5 label = S[:, −1:]
6 S1 = S[:, F]
7
8 p1 = np.sum(S1 & label)
9 N1 = np.sum(S1)
10
11 p0 = np.sum(label) − p1
3
剩余11页未读,继续阅读
宏馨
- 粉丝: 18
- 资源: 293
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- alu.v
- H21-282学习参考.pdf
- QuestionTwo.java
- QuestionOne.java
- AWS Certified Solutions Architect Study Guide -SAA-C03 .docx
- 校园小情书微信小程序源码 社区小程序前后端开源 校园表白墙交友小程序.rar
- OA办公自动化管理系统(Struts1.2+Hibernate3.0+Spring2+DWR).rar
- 简历-求职简历-word-文件-简历模版免费分享-应届生-高颜值简历模版-个人简历模版-简约大气-大学生在校生-求职-实习
- 南京邮电大学数学实验:熟练掌握 Matlab 软件的基本命令和操作
- 简历-求职简历-word-文件-简历模版免费分享-应届生-高颜值简历模版-个人简历模版-简约大气-大学生在校生-求职-实习
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0