Computational Statistics with Matlab
Mark Steyvers
May 13, 2011
Contents
1 Sampling from Random Variables 4
1.1 Standard distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Sampling from non-standard distributions . . . . . . . . . . . . . . . . . . . 7
1.2.1 Inverse transform sampl ing with discrete variables . . . . . . . . . . . 8
1.2.2 Inverse transform sampl ing with continuous variables . . . . . . . . . 11
1.2.3 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Markov Chain Monte Carlo 15
2.1 Monte Carlo integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Putting it together: Markov chain Monte Carlo . . . . . . . . . . . . . . . . 18
2.4 Metropolis Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Metropolis-Hastings Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Metropolis-Hastings for Multivariate Distributions . . . . . . . . . . . . . . . 26
2.6.1 Blockwise updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.2 Componentwise updating . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.7 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 Basic concepts in Bayesian Data Analysi s 39
3.1 Parameter Estimation Approaches . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.1 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.2 Maximum a posteriori . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.3 Posterior Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Example: Estimating a Weibull distribution . . . . . . . . . . . . . . . . . . 42
3.3 Example: Logisti c Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Example: Mallows Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Directed Graphical Models 46
4.1 A Short Review of Probability Theory . . . . . . . . . . . . . . . . . . . . . 46
4.2 The Burglar Alarm Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Conditional probability tables . . . . . . . . . . . . . . . . . . . . . . 49
4.2.2 Explaining away . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.3 Joint distributions and independence relationships . . . . . . . . . . . 52
1
CONTENTS 2
4.3 Graphical Model Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Example: Consensus Modeling with Gaussian variables . . . . . . . . 54
5 Approximate Inference in Graphical Models 57
5.1 Prior predictive distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Posterior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.1 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.2 MCMC Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.3 Example: Posterior Inference for the consensus m odel with normally
distributed variable s . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.4 Example: Posterior Infere nce for the consensus model with contaminants 66
6 Sequential Monte Carlo 69
6.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1.1 Example HMM with discrete outcomes and states . . . . . . . . . . . 71
6.1.2 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Bayesian Fil t ering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3 Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.1 Sampling Importance Resampling (SIR ) . . . . . . . . . . . . . . . . 74
6.3.2 Direct Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Note to Students
Exercises
This course book contains a number of exe rcises in which you are asked to simulate Matlab
code, produce new code, as well as produce graphical illustrations and answe r s to questions.
The exercises marked with ** are optional exercises that can be skipped when time is limited.
Organizing answers to exercises
It is helpful to maintain a document that organizes all the material related to the exercises.
Matlab can facilitate part of this organization using the “publish” option. For example , if
you have a Matlab script that produces a figure, you can publish the code as well as the figure
produced by the code to a single ex ternal document. You can find the publishing option in
the Matlab editor under t he file menu. You can also use the publish function directly in the
command window. You can change the publi sh configuration (look under the file menu of
the editor window) to produce pdfs, Word documents and a number of file formats.
For this course however, you can hand in your answers in any way you see fit and
sometimes, it mi ght be useful to just hand in handwritten answers.
1
Matlab documentation
It will probably happen many times that you will need to find the name of a Matlab function
or a de sc r iption of the input and output variables for a given Matlab function. It is strongly
recommended to have the Matl ab documentation running in a separate window f or quick
consultation. You can access the Matlab documentation by typing doc in the command
window. For speci fic help on a given matlab function, such as the function fprintf,
you can type doc fprintf to get a help screen in the matlab documentation window or
help fprint f to get a description in the m atlab command window.
1
note: in this version, a new chapter 3 was inserted which is not fully completed yet
3
Chapter 1
Sampling from Random Variables
Probabilistic models proposed by researchers are often too compli cated for analytic ap-
proaches. Incr easingly, re searchers rely on computational, numerical-based methods when
dealing with complex probabilistic models. By using a computational approach, the re-
searcher is freed from making unrealistic assumptions required for some analytic techniques
(e.g. such as normality and independence).
The key to most approximation approaches is the ability to sample from distributions.
Sampling is needed to predict how a particular model will behave under some set of cir-
cumstances, and to find appropriate values for the latent variables (“parameters”) when
applying models to experimental data. Most c omputational sampling approaches turn the
problem of sampling from complex distributions into subproblems involving simpler sampling
distributions. In this chapter, we will illustrate two sampling approaches: the inverse trans-
formation method and rejection sampling. These approaches are appropriate mostly for the
univariate case where we are dealing with single-valued outcomes. In the next chapter, we
discuss Markov chain Monte Carlo approaches that can operate efficiently with multivariate
distributions.
1.1 Standard distributions
Some distributions are used so often, that they become part of a standard set of distributions
supported by Matlab. The Matlab Statistics Toolbox supports a large number of probabil-
ity distributions. Using Matlab, it becomes quite easy to calcul ate the probability de nsity,
cumulative density of these distributions, and to sample random values f r om these distri bu-
tions. Table 1.1 lists some of the standard distributions supported by Matlab. The Matlab
documentation lists m any more distributions that can be simulated with Matlab. Using
online resources, it is often easy to find support for a number of other common distributions.
To illustrate how we can use some of the se functions, Listing 1.1 shows Matlab code that
visualizes the Normal(µ, σ) distribution where µ = 100 and σ = 15. To make things concrete,
imagine that this distribution represents the observed variability of IQ coefficients in some
population. The code shows how to display the probability density and the cumulative
4