III
An Improved C4.5 Algorithm and Application
Abstract
With the continuous development of science and technology, there is an urgent
need to extract useful information from the vast amounts of data technology. Data
mining has become one of the most popular information technologies. C4.5 algorithm
is the most classical of ten classical algorithms for data mining algorithms. Data
mining technology plays a very important role with the high utilization rate. C4.5
algorithm is a decision tree algorithm based on classification rules, which is presented
in the form of a tree. C4.5 algorithm improves ID3 algorithm, based on information
gain ration instead of information gain as the standards of the selected root attribute,
overcoming the deficiencies of the bias select value attribute when the attribute is
selected using information gain, which is useful to discretize continuous attributes.
The most important feature of the C4.5 algorithm is the contribution rules easier to
understand, the achievements of those who do not need to know any mining objects in
your field of expertise, and fast classification classifier with high accuracy. C4.5
algorithm has now been widely applied to various fields of economy, industry,
medicine, agriculture, etc., so the C4.5 algorithm research is significantly important.
C4.5 algorithm inadequacies exist in many places. C4.5 algorithm in data redundancy
may result in the complexity of the algorithm is too large. In this paper C4.5
algorithm has been improved in these aspect, and renamed R-C4.5 algorithm.
The algorithm specific improvements: calculate the elements in each attribute
information entropy, compare the same property value of each information entropy. If
values are similar, then calculate the similarity of the set of elements; if the similarity
coefficient is high, then the description of the nature of the two elements of the same
or similar, the two elements merge to form a new element. Similarity calculation uses
improved JACCARD coefficient. The aim of such the change is not the simple
comparison of two similar degrees on the number of elements in the collection, but
compares similar degrees of collection elements in proportion.
The improvement of the C4.5 algorithm enhanced the procession mechanism.
With the attribute of information entropy reduction, this removed redundant attributes