UNIVERSITÀ DEGLI STUDI DI MILANO
DIPARTIMENTO DI INFORMATICA
SEDE DI CREMA
Wine quality prediction
A project report for Information Management course
Crema – 2015
Introduction
An article “Modeling wine preferences by data mining from physicochemical properties” by Paulo
Cortez, António Cerdeira, Fernando Almeida, Telmo Matos and José Reis published on
sciencedirect.com in 2009 reviews and proposes a data mining approaches to predict wine taste
quality evualations. The analysis is based on dataset which is a large compared to other taking in
account the domain of the work.
The article reviews three techniques used for the predictions: the support vector machine, the
multiple regression, neural network methods.
The support vector machine will be replicated in this work as an outperforming method in accuracy
for this prediction. The naive Bayes classifier will be applied as an addition alernative classification
method beside the SVM.
Dataset
The dataset is decoupled to two separate CSV files. One of them contains samples of white wines and
another of red wine. The dataset contains 12 numerous attributes:
1. Fixed acidity
2. Volatile acidity
3. Citric acid
4. Residual sugar
5. Chlorides
6. Free sulfur dioxide
7. Total sulfur dioxide
8. Density
9. PH
10. Sulphates
11. Alcohol
12. Quality
The first 11 attributes are inputs which include physicochemical objective tests (e.g. PH values) and
the 12
th
is output based on sensory data (median of at least 3 evaluations made by wine experts).
Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
Dataset investigation
To start investigate the data, some metrics were calculated for each attributes. The metrics involve
mean, median, min, max, amplitude (max - min), standard deviation.
Mean
Median
Min
Max
Max - min
Standard
deviation
Fixed acidity
6.855
6.8
3.8
14.2
10.4
0.844
Volatile acidity
0.278
0.26
0.08
1.1
1.02
0.101
Citric acid
0.334
0.32
0
1.66
1.66
0.121
Residual sugar
6.391
5.2
0.6
65.8
65.2
5.072
Chlorides
0.046
0.043
0.009
0.346
0.337
0.022
Free sulfur dioxide
35.308
34
2
289
287
17.007
Total sulfur dioxide
138.361
134
9
440
431
42.498
Density
0.994
0.99374
0.98711
1.03898
0.05187
0.003
PH
3.188
3.18
2.72
3.82
1.1
0.151
Sulphates
0.49
0.47
0.22
1.08
0.86
0.114
Alcohol
10.514
10.4
8
14.2
6.2
1.231
Quality
5.878
6
3
9
6
0.886
Table 1: white wine metrics
Mean
Median
Min
Max
Max - min
Standard
deviation
Fixed acidity
8.32
7.9
4.6
15.9
11.3
1.741
Volatile acidity
0.528
0.52
0.12
1.58
1.46
0.179
Citric acid
0.271
0.26
0
1
1
0.195
Residual sugar
2.539
2.2
0.9
15.5
14.6
1.41
Chlorides
0.087
0.079
0.012
0.611
0.599
0.047
Free sulfur dioxide
15.875
14
1
72
71
10.46
Total sulfur dioxide
46.468
38
6
289
283
32.895
Density
0.997
0.99675
0.99007
1.00369
0.01362
0.002
PH
3.311
3.31
2.74
4.01
1.27
0.154
Sulphates
0.658
0.62
0.33
2
1.67
0.17
Alcohol
10.423
10.2
8.4
14.9
6.5
1.066
Quality
5.636
6
3
8
5
0.808
Table 2: red wine metrics