genomic, chemical, process, meteorological, marine, aviation, physical, credit, insurance,
retail, or any type of data requ
iring analysis. What is important is that the analyst needs to
get the most information out of the data.
At a second level, this book is also intended for anyone wh
o needs to understand the issues
in data preparation, even if they are not directly involved in preparing or working with data.
Reading this book will give anyone who uses analyses provided from an analyst’s work a
much better understanding of the results
and limitations that the analyst works with, and a far
deeper insight into what the analyses mean, where they can be used, and what can be
reasonably expected from any analysis.
Why I Wrote It
There are many good books available today that discuss how to collect data, particularly
in government and business. Simply look for titles about databases and data
warehousing. There are many equally good books about data
mining that discuss tools
and algorithms. But few, if any books, address what to do with the “dirty data” after it is
collected and before exploring it with a data mining tool. Yet this part of the process is
critical.
I wrote this book to address that gap in the process between identifying data and building
models. It will take you from the point where data has been identified in some form or
other, if not assembled. It will walk you through the process of identifying an appropriate
problem, relating the data back to the world from which it was collected, assembling the
data into mineable form, discovering problems with the data, fixing the problems, and
discovering what is in the data—that is, whether continuing with mining will deliver what
you need. It walks you through the whole process, starting with data discovery, and
deposits you on the very doorstep of building a data-mined model.
This is not an easy journey, but it is one that I have trodden many times in many projects.
There is a “beaten path,” and my express purpose in writing this book is to show exactly
ath leads, why it goes where it does, and to provide tools and a map so that you
can tread it again on your own when you need to.
Special Features
A CD-ROM acco
mpanies the book. Preparing data requires manipulating it and looking at
it in various ways. All of the actual data manipulation techniques that are conceptually
described in the book, mainly in Chapters 5 through 8 and 10, are illustrated by C
programs. F
or ease of understanding, each technique is illustrated, so far as possible, in a
separate, well-commented C source file. If compiled as an integrated whole, these
provide an automated data preparation tool.
The CD-ROM also includes demonstration versions of other tools mentioned, and useful