![](https://csdnimg.cn/release/download_crawler_static/89337510/bg1.jpg)
1
© A. Kassambara
2015
Multivariate
Analysis I
Alboukadel Kassambara
Practical Guide To
Cluster Analysis in R
Edition 1 sthda.com
Unsupervised Machine Learning
![](https://csdnimg.cn/release/download_crawler_static/89337510/bg2.jpg)
2
Copyright ©2017 by Alboukadel Kassambara. All rights reserved.
Published by STHDA (http://www.sthda.com), Alboukadel Kassambara
Contact:AlboukadelKassambara<alboukadel.kassambara@gmail.com>
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, without the prior
written permission of the Publisher. Requests to the Publisher for permission should
be addressed to STHDA (http://www.sthda.com).
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in
preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials.
Neither the Publisher nor the authors, contributors, or editors,
assume any liability for any injury and/or damage
to persons or property as a matter of products liability,
negligence or otherwise, or from any use or operation of any
methods, products, instructions, or ideas contained in the material herein.
For general information contact Alb oukadel Kassambara <alb oukadel.kassambara@gmail.com>.
![](https://csdnimg.cn/release/download_crawler_static/89337510/bg3.jpg)
0.1. PREFACE 3
0.1 Preface
Large amounts of data are collected every day from satellite images, bio-medical,
security, marketing, web search, geo-spatial or other automatic equipment. Mining
knowledge from these big data far exceeds human’s abilities.
Clustering
is one of the important data mining methods for discovering knowledge
in multidimensional data. The goal of clustering is to identify pattern or groups of
similar objects within a data set of interest.
In the litterature, it is referred as “pattern recognition” or “unsupervised machine
learning” - “unsupervised” because we are not guided by a priori ideas of which
variables or samples belong in which clusters. “Learning” because the machine
algorithm “learns” how to cluster.
Cluster analysis is popular in many fields, including:
•
In cancer research for classifying patients into subgroups according their gene
expression profile. This can be useful for identifying the molecular profile of
patients with good or bad prognostic, as well as for understanding the disease.
•
In marketing for market segmentation by identifying subgroups of customers with
similar profiles and who might be receptive to a particular form of advertising.
•
In City-planning for identifying groups of houses according to their type, value
and location.
This book provides a practical guide to unsupervised machine learning or cluster
analysis using R software. Additionally, we developped an R package named factoextra
to create, easily, a ggplot2-based elegant plots of cluster analysis results. Factoextra
official online documentation: http://www.sthda.com/english/rpkgs/factoextra
![](https://csdnimg.cn/release/download_crawler_static/89337510/bg4.jpg)
4
0.2 About the author
Alboukadel Kassambara is a PhD in Bioinformatics and Cancer Biology. He works since
many years on genomic data analysis and visualization. He created a bioinformatics
tool named GenomicScape (www.genomicscape.com) which is an easy-to-use web tool
for gene expression data analysis and visualization.
He developed also a website called STHDA (Statistical Tools for High-throughput Data
Analysis, www.sthda.com/english), which contains many tutorials on data analysis
and visualization using R software and packages.
He is the author of the R packages
survminer
(for analyzing and drawing survival
curves),
ggcorrplot
(for drawing correlation matrix using ggplot2) and
factoextra
(to easily extract and visualize the results of multivariate analysis such PCA, CA,
MCA and clustering). You can learn more about these packages at: http://www.
sthda.com/english/wiki/r-packages
Recently, he published two books on data visualization:
1. Guide to Create Beautiful Graphics in R (at: https://goo.gl/vJ0OYb).
2. Complete Guide to 3D Plots in R (at: https://goo.gl/v5gwl0).
![](https://csdnimg.cn/release/download_crawler_static/89337510/bg5.jpg)
Contents
0.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.2 About the author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.3 Key features of this book . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.4 How this book is organized? . . . . . . . . . . . . . . . . . . . . . . . 10
0.5 Book website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
0.6 Executing the R codes from the PDF . . . . . . . . . . . . . . . . . . 16
I Basics 17
1 Introduction to R 18
1.1 Install R and RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2 Installing and loading R packages . . . . . . . . . . . . . . . . . . . . 19
1.3 Getting help with functions in R . . . . . . . . . . . . . . . . . . . . . 20
1.4 Importing your data into R . . . . . . . . . . . . . . . . . . . . . . . 20
1.5 Demo data s ets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6 Close your R/RStudio session . . . . . . . . . . . . . . . . . . . . . . 22
2 Data Preparation and R Packages 23
2.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Required R Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Clustering Distance Measures 25
3.1 Methods for measuring distances . . . . . . . . . . . . . . . . . . . . 25
3.2 What type of distance measures should we choose? . . . . . . . . . . 27
3.3 Data standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Distance matrix computation . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Visualizing distance matrices . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5