【免费】机器学习十大算法1资源-CSDN文库

机器学习

需积分: 0 60 浏览量 2022-08-03 22:42:01 上传评论收藏 4.94MB PDF 举报

资源详情

资源评论

资源推荐

Chapter 1

C4.5

Naren Ramakrishnan

Contents

1.1 Introduction ..............................................................1

1.2 Algorithm Description ....................................................3

1.3 C4.5 Features .............................................................7

1.3.1 Tree Pruning ......................................................7

1.3.2 Improved Use of Continuous Attributes ............................8

1.3.3 Handling Missing Values ..........................................9

1.3.4 Inducing Rulesets ................................................ 10

1.4 Discussion on Available Software Implementations ...................... 10

1.5 Two Illustrative Examples ............................................... 11

1.5.1 Golf Dataset ..................................................... 11

1.5.2 Soybean Dataset ................................................. 12

1.6 Advanced Topics ........................................................ 13

1.6.1 Mining from Secondary Storage .................................. 13

1.6.2 Oblique Decision Trees .......................................... 13

1.6.3 Feature Selection ................................................ 13

1.6.4 Ensemble Methods .............................................. 14

1.6.5 Classiﬁcation Rules .............................................. 14

1.6.6 Redescriptions ................................................... 15

1.7 Exercises ............................................................... 15

References ................................................................... 17

1.1 Introduction

C4.5 [30] is a suite of algorithms for classiﬁcation problems in machine learning and

data mining. It is targeted at supervised learning: Given an attribute-valued dataset

where instances are described by collections of attributes and belong to one of a set

of mutually exclusive classes, C4.5 learns a mapping from attribute values to classes

that can be applied to classify new, unseen instances. For instance, see Figure 1.1

where rows denote speciﬁc days, attributes denote weather conditions on the given

day, and the class denotes whether the conditions are conducive to playing golf.

Thus, each row denotes an instance, described by values for attributes such as Out-

look (a ternary-valued random variable) Temperature (continuous-valued), Humidity

2 C4.5

Day Outlook Temperature Humidity Windy Play Golf?

1 Sunny 85 85 False No

2 Sunny 80 90 True No

3 Overcast 83 78 False Yes

4 Rainy 70 96 False Yes

5 Rainy 68 80 False Yes

6 Rainy 65 70 True No

7 Overcast 64 65 True Yes

8 Sunny 72 95 False No

9 Sunny 69 70 False Yes

10 Rainy 75 80 False Yes

11 Sunny 75 70 True Yes

12 Overcast 72 90 True Yes

13 Overcast 81 75 False Yes

14 Rainy 71 80 True No

Figure 1.1 Example dataset input to C4.5.

(also continuous-valued), and Windy (binary), and the class is the Boolean PlayGolf?

class variable. All of the data in Figure 1.1 constitutes “training data,” so that the

intent is to learn a mapping using this dataset and apply it on other, new instances

that present values for only the attributes to predict the value for the class random

variable.

C4.5, designed by J. Ross Quinlan, is so named because it is a descendant of the

ID3 approach to inducing decision trees [25], which in turn is the third incarnation in

a series of “iterative dichotomizers.” A decision tree is a series of questions systemat-

ically arranged so that each question queries an attribute (e.g., Outlook) and branches

based on the value of the attribute. At the leaves of the tree are placed predictions of

the class variable (here, PlayGolf?). A decision tree is hence not unlike the series of

troubleshooting questions you might ﬁnd in your car’s manual to help determine what

could be wrong with the vehicle. In addition to inducing trees, C4.5 can also restate its

trees in comprehensible rule form. Further, the rule postpruning operations supported

by C4.5 typically result in classiﬁers that cannot quite be restated as a decision tree.

The historical lineage of C4.5 offers an interesting study into how different sub-

communities converged on more or less like-minded solutions to classiﬁcation. ID3

was developed independently of the original tree induction algorithm developed by

Friedman [13], which later evolved into CART [4] with the participation of Breiman,

Olshen, and Stone. But, from the numerous references to CART in [30], the design

decisions underlying C4.5 appear to have been inﬂuenced by (to improve upon) how

CART resolved similar issues, such as procedures for handling special types of at-

tributes. (For this reason, due to the overlap in scope, we will aim to minimize with

the material covered in the CART chapter, Chapter 10, and point out key differences

at appropriate junctures.) In [25] and [36], Quinlan also acknowledged the inﬂuence

of the CLS (Concept Learning System [16]) framework in the historical development

1.2 Algorithm Description 3

of ID3 and C4.5. Today, C4.5 is superseded by the See5/C5.0 system, a commercial

product offered by Rulequest Research, Inc.

The fact that two of the top 10 algorithms are tree-based algorithms attests to

the widespread popularity of such methods in data mining. Original applications of

decision trees were in domains with nominal valued or categorical data but today

they span a multitude of domains with numeric, symbolic, and mixed-type attributes.

Examples include clinical decision making, manufacturing, document analysis, bio-

informatics, spatial data modeling (geographic information systems), and practically

any domain where decision boundaries between classes can be captured in terms of

tree-like decompositions or regions identiﬁed by rules.

1.2 Algorithm Description

C4.5 is not one algorithm but rather a suite of algorithms—C4.5, C4.5-no-pruning,

and C4.5-rules—with many features. We present the basic C4.5 algorithm ﬁrst and

the special features later.

The generic description of how C4.5 works is shown in Algorithm 1.1. All tree

induction methods begin with a root node that represents the entire, given dataset and

recursively split the data into smaller subsets by testing for a given attribute at each

node. The subtrees denote the partitions of the original dataset that satisfy speciﬁed

attribute value tests. This process typically continues until the subsets are “pure,” that

is, all instances in the subset fall in the same class, at which time the tree growing is

terminated.

Algorithm 1.1 C4.5(D)

Input: an attribute-valued dataset D

1: Tree = {}

if D is “pure” OR other stopping criteria met then

3: terminate

4: end if

5: for all attribute a ∈ D do

6: Compute information-theoretic criteria if we split on a

7: end for

8: a

best

= Best attribute according to above computed criteria

9: Tree = Create a decision node that tests a

best

in the root

10: D

= Induced sub-datasets from D based on a

best

11: for all D

12: Tree

= C4.5(D

)

13:

Attach Tree

to the corresponding branch of Tree

14: end for

15: return Tree

4 C4.5

Yes

No No

Outlook

Humidity

Windy

Sunny Rainy

Overcast

>75<=75 FalseTrue

Figure 1.2 Decision tree induced by C4.5 for the dataset of Figure 1.1.

Figure 1.1 presents the classical “golf” dataset, which is bundled with the C4.5

installation. As stated earlier, the goal is to predict whether the weather conditions

on a particular day are conducive to playing golf. Recall that some of the features are

continuous-valued while others are categorical.

Figure 1.2 illustrates the tree induced by C4.5 using Figure 1.1 as training data

(and the default options). Let us look at the various choices involved in inducing such

trees from the data.

What types of tests are possible? As Figure 1.2 shows, C4.5 is not restricted

to considering binary tests, and allows tests with two or more outcomes. If the

attribute is Boolean, the test induces two branches. If the attribute is categorical,

the test is multivalued, but different values can be grouped into a smaller set of

options with one class predicted for each option. If the attribute is numerical,

then the tests are again binary-valued, and of the form {≤ θ?,> θ?}, where θ

is a suitably determined threshold for that attribute.

How are tests chosen? C4.5 uses information-theoretic criteria such as gain

(reduction in entropy of the class distribution due to applying a test) and

gain ratio (a way to correct for the tendency of gain to favor tests with many

outcomes). The default criterion is gain ratio. At each point in the tree-growing,

the test with the best criteria is greedily chosen.

How are test thresholds chosen? As stated earlier, for Boolean and categorical

attributes, the test values are simply the different possible instantiations of that

attribute. For numerical attributes, the threshold is obtained by sorting on that

attribute and choosing the split between successive values that maximize the

criteria above. Fayyad and Irani [10] showed that not all successive values need

to be considered. For two successive values v

and v

i+1

of a continuous-valued

1.2 Algorithm Description 5

attribute, if all instances involving v

and all instances involving v

i+1

belong to

the same class, then splitting between them cannot possibly improve informa-

tion gain (or gain ratio).

How is tree-growing terminated? A branch from a node is declared to lead

to a leaf if all instances that are covered by that branch are pure. Another way

in which tree-growing is terminated is if the number of instances falls below a

speciﬁed threshold.

Howareclass labelsassignedto the leaves?Themajority class of the instances

assigned to the leaf is taken to be the class prediction of that subbranch of the

tree.

The above questions are faced by any classiﬁcation approach modeled after trees and

similar, or other reasonable, decisions are made by most tree induction algorithms.

The practical utility of C4.5, however, comes from the next set of features that build

upon the basic tree induction algorithm above. But before we present these features,

it is instructive to instantiate Algorithm 1.1 for a simple dataset such as shown in

Figure 1.1.

We will work out in some detail how the tree of Figure 1.2 is induced from

Figure 1.1. Observe how the ﬁrst attribute chosen for a decision test is the Outlook

attribute. To see why, let us ﬁrst estimate the entropy of the class random variable

(PlayGolf?). This variable takes two values with probability 9/14 (for “Yes”) and

5/14 (for “No”). The entropy of a class random variable that takes on c values with

probabilities p

, p

,...,p

is given by:



i=1

−p

log

The entropy of PlayGolf? is thus

−(9/14) log

(9/14) − (5/14) log

(5/14)

or 0.940. This means that on average 0.940 bits must be transmitted to communicate

information about the PlayGolf? random variable. The goal of C4.5 tree induction is

to ask the right questions so that this entropy is reduced. We consider each attribute in

turn to assess the improvement in entropy that it affords. For a given random variable,

say Outlook, the improvement in entropy, represented as Gain(Outlook), is calculated

as:

Entropy(PlayGolf? in D) −



|D|

Entropy(PlayGolf? in D

)

where v is the set of possible values (in this case, three values for Outlook), D denotes

the entire dataset, D

is the subset of the dataset for which attribute Outlook has that

value, and the notation |·|denotes the size of a dataset (in the number of instances).

This calculation will show that Gain(Outlook) is 0.940−0.694 = 0.246. Similarly,

we can calculate that Gain(Windy) is 0.940 −0.892 = 0.048. Working out the above

calculations for the other attributes systematically will reveal that Outlook is indeed

剩余215页未读，继续阅读

评论收藏

内容反馈

普通网友

粉丝: 17
资源:
314

机器学习十大算法1

评论0

最新资源

机器学习十大算法1

评论0

机器学习十大算法

机器学习十大算法1：C4.5

机器学习十大算法：AdaBoost.pdf

机器学习十大算法之KNN算法实现

机器学习十大算法之一 C4.5算法

机器学习十大算法：kNN

机器学习十大算法的每个算法的核心思想、工作原理、适用情况及优缺点.docx

机器学习十大经典算法

机器学习十大算法，完整集合

机器学习十大算法的每个算法的核心思想、工作原理、适用情况及优缺点

机器学习十大算法：CART

机器学习十大算法：AdaBoost

机器学习十大算法：SVM

轻松看懂机器学习十大常用算法知识分享.docx

机器学习十大算法.pdf

机器学习十大算法3：SVM

最新资源