introduction to data mining

所需积分/C币:9 2016-02-20 13:53:14 21.62MB PDF

学习数据挖掘很实用的一本入门书籍,英文原本,老师推荐的读物。很好的一本书对于初学者
112 Chapter 3 Exploring Data 344444445464646464747484848484849494949494950 5050505050505050505151515151515151515252525253 5454545454545555555555555556565656565657575757 5757575758585858585858595959606060606060616161 6161616262626263636363636363636364646464646464 6565656565666667676767676767676868686969696970 717272727374767777777779 Figure 3. 4. Sepal length data from the Iris data set 6667788888999999 5:000000000111111122223444425555556666667777777788888999 6:00000011111122223333333334444444555566777777789999 7:0122234677779 Figure 3.5. Stem and leaf plot for the sepal length from the Iris data set 4:566667788888999999 5:000000000011111111122223444444 5:5555555666666777777778888888999 6:00000011111122223333333334444444 6:5555566777777778889999 7:0122234 7:677779 Figure 3.6. Stem and leaf plot for the sepal length from the Iris data set when buckets corresponding to digits are split Once the counts are available for each bin, a bar plot is constructed such that each bin is represented by one bar and the area of each bar is proportional to the number of values(objects)that fall into the corresponding range. If all ntervals are of equal width, then all bars are the same width and the height of a bar is proportional to the number of values in the corresponding bin Example 3.8. Figure 3.7 shows histograms(with 10 bins)for sepal length sepal width, petal length, and petal width. Since the shape of a histogram can depend on the number of bins, histograms for the same data, but with 20 bins, are shown in Figure 3.8 There are variations of the histogram plot. A relative(frequency)his togran replaces the count by the relative frequency. However, this is just a 3.3 Visualization 113 a)Sepal length.( b)Sepal width. (c)Petal length.(d )Petal width igure 3. 7. Histograms of four Iris attributes (10 bins) ! (a)Sepal length.(b)Sepal width. (c)Petal length.(d)Petal width Figure 3.8 Histograms of four Iris attributes (20 bins) change in scale of the y axis, and the shape of the histogram does not change Another common variation, especially for unordered categorical data, is the Pareto histogram, which is the same as a normal histogram except that the categories are sorted by count so that the count is decreasing from left to right Two-Dimensional Histograms Two-dimensional histograms are also pos- sible. Each attribute is divided into intervals and the two sets of intervals define two-dimensional rectangles of values Example 3. 9. Figure 3.9 shows a two-dimensional histogram of petal length and petal width. Because each attribute is split into three bins, there are nine rectangular two-dimensional bins. The height of each rectangular bar indicates the number of objects(flowers in this case) that fall into each bin. Most of the flowers fall into only three of the bins-those along the diagonal. It is not possible to see this by looking at the one-dimensional distributions 114 Chapter 3 Exploring Data c Pe Petal Le Figure 3.9. Two-dimensional histogram of petal length and width in the Iris data set While two-dimensional histograms can be used to discover interesting facts about how the values of two attributes co-occur, they are visually more com- licated. For instance, it is easy to imagine a situation in which some of the columns are hidden by other Box plots Box plots are another method for showing the distribution of the values of a single numerical attribute. Figure 3.10 shows a labeled box plot for sepal length. The lower and upper ends of the box indicate the sth and 75th percentiles, respectively, while the line inside the box indicates the value of the 50th percentile. The top and bottom lines of the tails indicate the 10th goth percentiles. Outliers are shown by "+ marks. Box plots are relatively compact, and thus, many of them can be shown on the same plot. Simplified versions of the box plot, which take less space, can also be used Example 3.10. The box plots for the first four attributes of the Iris data set are shown in Figure 3. 11. Box plots can also be used to compare how attributes vary between different classes of objects, as shown in Figure 3.12 Pie Chart A pie chart is similar to a histogram, but is typically used with categorical attributes that have a relatively small number of values. Instead of showing the relative frequency of different values with the area or height of a bar, as in a histogram, a pie chart uses the relative area of a circle to indicate relative frequency. Although pie charts are common in popular articles, they 3.3 Visualization 115 goth percentile ←—75 percentile Sepal Length Sepal Width Petal Length F Figure 3. 10. Description of Figure 3. 11. Box plot for Iris attributes box plot for sepal length sepal Length Sopel wdth Painl Lr pelal Sepal Length Sepal Length Pela width (a) Setosa (b Versicolour (c)Virginica Figure 3. 12. Box plots of attributes by Iris species are used less frequently in technical publications because the size of relative areas can be hard to judge. Histograms are preferred for technical work Example 3.11. Figure 3.13 displays a pie chart that shows the distribution of Iris species in the Iris data set. In this case, all three fower types have the same frequency. Percentile Plots and Empirical Cumulative Distribution Functions a type of diagram that shows the distribution of the data more quantitatively is the plot of an empirical cumulative distribution function. While this type of olot may sound complicated, the concept is straightforward. For each value of a statistical distribution, a cumulative distribution function(CDF) shows 116 Chapter 3 Explo ploring data Setosa sinica Versicolour Figure 3. 13. Distribution of the types of Iris flo ability that a p lue. For each obs empirical cumulative distribution function (ECDF) shows the fraction of points that are less than this value. Since the number of points is finite, the empirical cumulative distribution function is a step function Example 3. 12. Figure 3.14 shows the ECDFs of the Iris attributes. The percentiles of an attribute provide similar information. Figure 3. 15 shows the percentile plots of the four continuous attributes of the Iris data set from Table 3.2. The reader should compare these figures with the histograms given in Figures 3. 7 and 3.8 Scatter Plots Most people are familiar with scatter plots to some extent and they were used in Section 2. 4.5 to illustrate linear correlation. Each data object is plotted as a point in the plane using the values of the two attributes as z and y coordinates. It is assumed that the attributes are either integer- or real-valued Example 3. 13. Figure 3. 16 shows a scatter plot for each pair of attributes of the Iris data set. The different species of Iris are indicated by different markers. The arrangement of the scatter plots of pairs of attributes in this type of tabular format, which is known as a scatter plot matrix, provides an organized way to examine a number of scatter plots simultaneousl 3.3 Visuali (a) Sepal Length (b)Sepal Width (c)Petal Length (d)Petal width Figure 3. 14. Empirical CDFs of four Iris attributes Percentile Figure 3.15. Percentile plots for sepal length, sepal width, petal length, and petal width 118 Chapter 3 Exploring Data. 十田 曾+ x※ 四2点06答gg Gtd Joo 吾 o 5 yl dual edas yp M edas q ual leled 44p!M [elad

...展开详情

评论 下载该资源后可以进行评论 4

qq_19566385 很好的一本书,很适合初学者学习~
2017-11-13
回复
dracozq 书严重不完整
2017-02-15
回复
dantinwach 书并不完整,第一章和第二章,第五章和第七章都缺失了。不建议下载。
2017-02-02
回复
konigsberg 很好上的书,数据挖掘必读!
2016-09-08
回复
img

关注 私信 TA的资源

上传资源赚积分,得勋章
最新资源