DataPreparationforDataMining资源-CSDN文库

需积分: 9 121 浏览量 2017-12-28 10:12:14 上传评论收藏 4.01MB PDF 举报

本文是关于数据挖掘入门书籍《Data Preparation for Data Mining》的内容摘录。该书由Dorian Pyle编写，出版于1999年。书中不仅介绍了数据准备和探索的重要性，还详尽讲解了从数据获取到数据预处理、变量处理以及数据集的准备等一系列数据挖掘前期工作。它强调数据准备是机器学习模型开发中最基本、也是最重要的部分，为读者提供了一整套的数据预处理方法论和技巧。书中首先强调了数据探索的重要性，指出数据探索是一个过程，而非一次性活动。作者解释了这个世界上的事物本质是如何影响数据准备的。此外，书中提到了数据准备本身也是一个过程，包括了获取数据、采样、变量处理、缺失值处理等步骤。对于非数值变量的处理、变量的规范化和重新分布、缺失和空值的替换、序列变量的处理以及数据集的准备等主题都有详细的探讨。书中还包含了一些示例代码，可供读者在CD-ROM上使用，以及拓展阅读的附录B，指导读者进一步深入学习。数据准备是数据挖掘成功的关键，因为它确保了输入到挖掘算法的数据质量和准确性。从基础准备入手，学习如何获取和整理数据是进行有效数据挖掘的先决条件。数据探索环节涉及到了对数据集进行初步分析，了解数据的结构和内容，这能够帮助挖掘者识别出数据中的模式和异常值。在变量处理方面，书中提及了数据标准化和重新分布变量的重要性。标准化是将数据转换为具有特定平均值和标准差的格式，常见的标准化方法有最小-最大标准化和Z分数标准化等。重新分布变量则是为了使数据呈现更加符合分析要求的形式，如在处理连续变量时，可能需要根据数据的分布情况进行适当的转换。处理非数值变量（如类别数据）是数据预处理中的一个难题。非数值变量需要被编码为数值形式，以便算法可以处理。这包括了诸如独热编码、标签编码等技术。对于缺失值的处理，书中建议替换策略，如使用均值、中位数、众数或基于模型的方法来填充缺失值。对于序列变量，需要了解时间序列分析的基本概念，处理时间相关的数据。准备数据集的目的是为了构建出一个适合于挖掘的高质量数据集。这个过程可能包括选择特征、创建新特征、转换特征，以及最终的数据清洗工作，确保数据集的整洁和一致性。数据集准备好之后，可以用于机器学习模型的训练和验证，最终用于预测或分类任务。对于初学者而言，这本书不仅提供了数据挖掘领域的基础知识，还着重讲解了数据预处理的实际操作技巧，是一本入门级的宝典。对于那些希望在数据科学领域深造的读者来说，书中包含的实践技巧和示例代码将为他们理解数据准备的复杂性以及如何有效地处理数据提供宝贵的指导。

资源推荐

资源详情

资源评论

Data Preparation for Data Mining

Dorian Pyle

Senior Editor: Diane D. Cerra

Director of Production & Manufacturing: Yonie Overton

Production Editor: Edward Wade

Editorial Assistant: Belinda Breyer

Cover Design: Wall-To-Wall Studios

Text Design & Composition: Rebecca Evans & Associates

Technical Illustration: Dartmouth Publishing, Inc.

Copyeditor: Gary Morris

Proofreader: Ken DellaPenta

Indexer: Steve Rath

Printer: Courier Corp.

Designations used by companies to distinguish their products are often claimed

as trademarks or registered trademarks. In all instances where Morgan Kaufmann

Publishers, Inc. is aware of a claim, the product names appear in initial capital or all

capital letters. Readers, however, should contact the appropriate companies for more

complete information regarding trademarks and registration.

Morgan Kaufmann Publishers, Inc.

Editorial and Sales Office

340 Pine Street, Sixth Floor

San Francisco, CA 94104-3205

USA

Telephone 415-392-2665

Facsimile 415-982-2665

Email mkp@mkp.com

WWW http://www.mkp.com

Order toll free 800-745-7323

Preface

What This Book Is About

This book is about what to do with data to get the mo

st out of it. There is a lot more to that

statement than first meets the eye.

Much information is available today about data warehouses, data mining, KDD, OLTP,

OLAP, and a whole alphabet soup of other acronyms that describe techniques and

methods of storing, accessing, visualizing, and using data. There are books and

magazines about building models for making predictions of all types—fraud, marketing,

new customers, consumer demand, economic statistics, stock movement, option prices,

weather, sociological behavior, traffic demand, resource needs, and many more.

In order to use the techniques, or make the predictions, industry professionals almost

universally agree that one of the most important parts of any such project, and one of the

most time-consuming and difficult, is data preparation. Unfortunately, data preparation

has been much like the weather—

as the old aphorism has it, “Everyone talks about it, but

no one does anything about it.” This book takes a detailed look at the problems in

preparing data, the solutions, and how to use the solutions to get the most out of the

data—whatever you want to use it for. This book tells you what can be done about it,

exactly how it can be done, and what it achieves, and puts a powerful kit of tools directly in

your hands that allows you to do it.

How important is adequate data preparation? After finding the right problem to solve, data

preparation is often the

key to solving the problem. It can easily be the difference between

success and failure, between useable insights and incomprehensible murk, between

worthwhile predictions and useless guesses.

For instance, in one case data carefully prepared for warehousing proved useless for

modeling. The preparation for warehousing had destroyed the useable information content

for the needed mining project. Preparing the data for mining, rather than warehousing,

produced a 550% improvement in model accuracy. In another case, a commercial baker

achieved a bottom-

line improvement approaching $1 million by using data prepared with the

techniques described in this book instead of previous approaches.

Who This Book Is For

This book is written primarily for the computer savvy analyst or modeler who works with

data on a daily basis and who wants to use data mining to get the most out of data. The

type of data the analyst works with is not important. It may be financial, marketing,

business, stock trading, telecommunications, healthcare, medical, epidemiological,

genomic, chemical, process, meteorological, marine, aviation, physical, credit, insurance,

retail, or any type of data requ

iring analysis. What is important is that the analyst needs to

get the most information out of the data.

At a second level, this book is also intended for anyone wh

o needs to understand the issues

in data preparation, even if they are not directly involved in preparing or working with data.

Reading this book will give anyone who uses analyses provided from an analyst’s work a

much better understanding of the results

and limitations that the analyst works with, and a far

deeper insight into what the analyses mean, where they can be used, and what can be

reasonably expected from any analysis.

Why I Wrote It

There are many good books available today that discuss how to collect data, particularly

in government and business. Simply look for titles about databases and data

warehousing. There are many equally good books about data

mining that discuss tools

and algorithms. But few, if any books, address what to do with the “dirty data” after it is

collected and before exploring it with a data mining tool. Yet this part of the process is

critical.

I wrote this book to address that gap in the process between identifying data and building

models. It will take you from the point where data has been identified in some form or

other, if not assembled. It will walk you through the process of identifying an appropriate

problem, relating the data back to the world from which it was collected, assembling the

data into mineable form, discovering problems with the data, fixing the problems, and

discovering what is in the data—that is, whether continuing with mining will deliver what

you need. It walks you through the whole process, starting with data discovery, and

deposits you on the very doorstep of building a data-mined model.

This is not an easy journey, but it is one that I have trodden many times in many projects.

There is a “beaten path,” and my express purpose in writing this book is to show exactly

where the p

ath leads, why it goes where it does, and to provide tools and a map so that you

can tread it again on your own when you need to.

Special Features

A CD-ROM acco

mpanies the book. Preparing data requires manipulating it and looking at

it in various ways. All of the actual data manipulation techniques that are conceptually

described in the book, mainly in Chapters 5 through 8 and 10, are illustrated by C

programs. F

or ease of understanding, each technique is illustrated, so far as possible, in a

separate, well-commented C source file. If compiled as an integrated whole, these

provide an automated data preparation tool.

The CD-ROM also includes demonstration versions of other tools mentioned, and useful

剩余465页未读，继续阅读

评论收藏

内容反馈

VincentWongF

粉丝: 0
资源: 7

Data Preparation for Data Mining

Data preparation for data mining.pdf

Data Preparation for Data.Mining Using SAS

Data Mining

data mining

Data Mining with Rattle and R.pdf

Data Mining With Rattle and R

Introduction to data mining

DataMining

Datamining

dataMining

Data_Mining__Concepts_and_Techniques__3rd_Edition

Data Preprocessing.pdf

MK.Java.Data.Mining.Strategy.Standard.and.Practice

Data-Science-Preparation

计算机毕业设计 期末设计 基于大数据的股票数据可视化分析与预测系统 Python+LSTM预测模型 股票 爬虫 Tensorflow

PSG 3D 三维测绘系统

VRPTW 的 Solomon 标准测试数据集

最新资源

计算机毕业设计期末设计基于大数据的股票数据可视化分析与预测系统 Python+LSTM预测模型股票爬虫 Tensorflow