概率图模型

所需积分/C币:16 2018-07-03 11:19:42 56.83MB PDF
27
收藏 收藏
举报

Statistical learning refers to a set of tools for modeling and understanding complex datasets. It is a recently developed area in statistics and blends with parallel developments in computer science and, in particular, machine learning. The field encompasses many methods such as the lasso and sparse regression, classification and regression trees, and boosting and support vector machines. With the explosion of “Big Data” problems, statistical learning has be- come a very hot field in many scientific areas as well as marketing, finance, and other business disciplines. People with statistical learning skills are in high demand. One of the first books in this area—The Elements of Statistical Learning (ESL) (Hastie, Tibshirani, and Friedman)—was published in 2001, with a second edition in 2009. ESL has become a popular text not only in statis- tics but also in related fields. One of the reasons for ESL’s popularity is its relatively accessible style. But ESL is intended for individuals with ad- vanced training in the mathematical sciences. An Introduction to Statistical Learning (ISL) arose from the perceived need for a broader and less tech- nical treatment of these topics. In this new book, we cover many of the same topics as ESL, but we concentrate more on the applications of the methods and less on the mathematical details. We have created labs illus- trating how to implement each of the statistical learning methods using the popular statistical software package R . These labs provide the reader with valuable hands-on experience.
of the material in the earlier sections. The book begins with the topic of probabilistic inference. Inference refers to the problem of calculating the conditional probability distribu tion of a subset of the nodes in a graph given another subset of the nodes Much effort has gone into the design of efficient and accurate inference al gorithms. The book covers three categories of inference algorithms-exact algorithms, variational algorithms and Monte Carlo algorithms. The first chapter, by Cowell, is a tutorial chapter that covers the basics of exact infer ence, with particular focus on the popular junction tree algorithm. This ma- terial should be viewed as basic for the understanding of graphical models a second chapter by Cowell picks up where the former leaves off and covers advanced issues arising in exact inference. Kiaeralff presents a method for increasing the efficiency of the junction tree algorithm. The basic idea is to take advantage of additional independencies which arise due to the partic ular messages arriving at a clique this leads to a data structure known as a "nested junction tree Dechter presents an alternative perspective on exact inference, based on the notion of "bucket elimination. This is a unifying perspective that provides insight into the relationship between junction tree and conditioning algorithms, and insight into space/time tradeoffs Variational methods provide a framework for the design of approximate inference algorithms. Variational algorithms are deterministic algorithms that provide bounds on probabilities of interest. The chapter by Jordan, Ghahramani, Jaakkola, and Saul is a tutorial chapter that provides a general overview of the variational approach, emphasizing the important role of convexity. The ensuing article by Jaakkola and Jordan proposes a new method for improving the mean field approximation (a particular form of variational approximation). In particular, the authors propose to use mixture distributions as approximating distributions within the mean field formalism The inference section closes with two chapters on Monte Carlo meth- ods. Monte Carlo provides a general approach to the design of approximate algorithms based on stochastic sampling. MacKays chapter is a tutorial presentation of Monte Carlo algorithms, covering simple methods such as rejection sampling and importance sampling, as well as more sophisticated methods based on Markov chain sampling. A key problem that arises with the Markov chain Monte Carlo approach is the tendency of the algorithms to exhibit random-walk behavior, this slows the convergence of the algo- rithms. Neal presents a new approach to this problem, showing how a so- phisticated form of overrelaxation can cause the chain to move more sys tematically along surfaces of high probabilit The second section of the book addresses the issue of Independence Much of the aesthetic appeal of the graphical model formalism comes from the " Markov properties"that graphical models embody. A Markov prop- erty is a relationship between the separation properties of nodes in a graph (e. g,, the notion that a subset of nodes is separated from another subset of nodes, given a third subset of nodes) and conditional independencies in the family of probability distributions associated with the graph(e. g ,a is independent of b given C, where A, B and C are subsets of random variables). In the case of directed graphs and undirected graphs the rela- tionships are well understood(cf. Lauritzen, 1997). Chain graphs, however which are mixed graphs containing both directed and undirected edges are less well understood. The chapter by Richardson explores two of the Markov properties that have been proposed for chain graphs and identifies natural "spatial"conditions on Markov properties that distinguish between these Markov properties and those for both directed and undirected graphs Chain graphs appear to have a richer conditional independence semantics than directed and undirected graphs The chapter by Studeny and Vejnarova addresses the problem of charac- terizing stochastic dependence. Studeny and Vejnarova discuss the proper- ties of the multiinformation function, a general information-theoretic func- tion from which many useful quantities can be computed, including the conditional mutual information for all disjoint subsets of nodes in a graph The book then turns to the topic of learning. The section on Founda- tions for Learning contains two articles that cover fundamental concepts that are used in many of the following articles. The chapter by heckerman is a tutorial article that covers many of the basic ideas associated with learning in graphical models. The focus is on Bayesian methods, both for parameter learning and for structure learning. Neal and Hinton discuss the expectation-maximization(EM)algorithm. EM plays an important role in the graphical model literature, tying together inference and learning prob- lems. In particular, EM is a method for finding maximum likelihood (or maximum a posteriori) parameter values, by making explicit use of a prob- abilistic inference(the“ E step”). Thus E based approaches to learning generally make use of inference algorithms as subroutines. Neal and Hinton describe the eM algorithm as coordinate ascent in an appropriately-defined cost function. This point of view allows them to consider algorithms that take partial E steps, and provides an important justification for the use of approximate inference algorithms in learning The section on Learning from Data contains a variety of papers con cerned with the learning of parameters and structure in graphical models Bishop provides an overview of latent variable models, focusing on prob abilistic principal component analysis, mixture models topographic maps and time series analysis. eM algorithms are developed for each case article by Buhmann complements the Bishop article, describing methods for dimensionality reduction, clustering, and data visualization, again with the EM algorithm providing the conceptual framework for the design of the algorithms. Buhmann also presents learning algorithms based on approxi mate inference and deterministic annealing friedman and Goldszmidt focus on the problem of representing and learning the local conditional probabilities for graphical models. In partic ular, they are concerned with representations for these probabilities that make explicit the notion of "context-specific independence, "where, for ex ample, a is independent of B for some values of C but not for others. This representation can lead to significantly more parsimonious models than standard techniques. Geiger, Heckerman, and Meek are concerned with the blem of model selection for graphical models with hidden(unobserved) nodes. They develop asymptotic methods for approximating the marginal likelihood and demonstrate how to carry out the calculations for severa cases of practical interest The paper by Hinton, Sallans, and Ghahramani describes a graphical model called the " hierarchical community of experts"in which a collection of local linear models are used to fit data. As opposed to mixture models, in which each data point is assumed to be generated from a single local model their model allows a data point to be generated from an arbitrary subset of the available local models. Kearns, Mansour, and Ng provide a care ful analysis of the relationships between EM and the K-means algorithm They discuss an "information-modeling tradeoff, which characterizes the ability of an algorithm to both find balanced assignments of data to model components, and to find a good overall fit to the data Monti and Cooper discuss the problem of structural learning in networks with both discrete and continuous nodes. They are particularly concerned with the issue of the discretization of continous data, and how this impacts the performance of a learning algorithm. Saul and Jordan present a method for unsupervised learning in layered neural networks based on mean field theory. They discuss a mean field approximation that is tailored to the case of large networks in which each node has a large number of parents Smith and Whittaker discuss tests for conditional independence tests in graphical Gaussian models. They show that several of the appropriate statistics turn out to be functions of the sample partial correlation coef- ficient. They also develop asymptotic expansions for the distributions of the test statistics and compare their accuracy as a function of the dimen sionality of the model. Spiegelhalter, Best, Gilks, and Inskip describe an application of graphical models to the real-life problem of assessing the ef fectiveness of an immunization program. They demonstrate the use of the graphical model formalism to represent statistical hypotheses of interest and show how Monte Carlo methods can be used for inference. Finally williams provides an overview of Gaussian processes, deriving the gaus sian process approach from a Bayesian point of view, and showing how it can be applied to problems in nonlinear regression, classification, and hierarchical modeling This volume arose from the proceedings of the International School on Neural Nets "E.R. Caianiello, held at the Ettore Maiorana Centre for Scientific Culture in Erice, Italy, in September 1996. Lecturers from the school contributed chapters to the volume and additional authors were asked to contribute chapters to provide a more complete and authoritative coverage of the field. All of the chapters have been carefully edited, following a review process in which each chapter was scrutinized by two anonymous reviewers and returned to authors for improvement There are a number of people to thank for their role in organizing the Erice meeting. First I would like to thank Maria Marinaro, who initiated the ongoing series of Schools to honor the memory of E R. Caianiello, and who co-organized the first meeting. David Heckerman was also a co-organizer of the school, providing helpful advice and encouragement throughout. Anna Esposito at the University of Salerno also deserves sincere thanks for her help in organizing the meeting The staff at the Ettore Maiorana Centre were exceedingly professional and helpful, initiating the attendees of the school into the wonders of Erice. Funding for the School was provided by the NAtO Advanced Study Institute program; this program provided generous support that allowed nearly 80 students to attend the meeting I would also like to thank Jon Heiner, Thomas Hofmann, Nuria Oliver Barbara Rosario, and Jon Yi for their help with preparing the final docu ment Finally, i would like to thank Barbara rosario, whose fortuitous atten dance as a participant at the Erice meeting rendered the future condition ally independent of the past Michael I. Jordan INTRODUCTION TO INFERENCE FOR BAYESIAN NETWORKS ROBERT COWELL City Universitg, London The School of mathematics, Actuarial Science and Statistics City University, Northampton Square, London ECiE OHT 1. Introduction The field of Bayesian networkS, and graphical models in general, has grown enormously over the last few years, with theoretical and computational developments in many areas. As a consequence there is now a fairly large set of theoretical concepts and results for newcomers to the field to learn. This tutorial aims to give an overview of some of these topics, which hopefully will provide such newcomers a conceptual framework for following the more detailed and advanced work. It begins with revision of some of the basic axioms of probability theory 2. Basic axioms of probability Probability theory, also known as inductive logic, is a system of reason ing under uncertainty, that is under the absence of certainty. Within the Bayesian framework, probability is interpreted as a numerical measure of the degree of consistent belief in a proposition, consistency being with the data at hand Early expert systems used deductive, or Boolean, logic, encapsulated by sets of production rules. Attempts were made to cope with uncertainty using probability theory, but the calculations became prohibitive, and the use of probability theory for inference in expert systems was abandoned. It is with the recent development of efficient computational algorithms that probability theory has had a revival within the ai community Let us begin with some basic axioms of probability theory. The prob ability of an event A, denoted by P(A), is a number in the interval [o which obeys the following axioms 1 P(A)=l if and only if A is certain 10 ROBERT COWELL 2 If A and B are mutually exclusive, then P(A or B)= P(a)+P(B) We will be dealing exclusively with discrete random variables and their probability distributions. Capital letters will denote a variable, or perhaps a set of variables, lower case letter will denote values of variables. Thus suppose a is a random variable having a finite number of mutually exclusive states(al,., an). Then P(A)will be represented by a vector of non negative real numbers P(A)=(1,., In)where P(A= ai)=Ti is a scalar and ∑ a basic concept is that of conditional probability, a statement of which takes the form: Given the event b=b the probability of the event A C, written P(A=ab=b)=a. It is important to understand that this is not saying: "If b= b is true then the probability of a= a is r ". Instead it says: "If B= b is true, and any other information to hand is irrelevant to A, then P(A=a)=a". To see this, consider what the probabilities would be if the state of a was part of the extra information Conditional probabilities are important for building Bayesian networks as we shall see But bayesian networks are also built to facilitate the calcu- lation of conditional probabilities, namely the conditional probabilities for variables of interest given the data(also called evidence)at hand The fundamental rule for probability calculus is the product rule P(A and B)=P(A B)P(B) This equation tells us how to combine conditional probabilities for individ- ual variables to define joint probabilities for sets of variables 3. Bayes’ theorem The simplest form of Bayes'theorem relates the joint probability P(A and B) written as P(A, B)-of two events or hypotheses a and B in terms of marginal and conditional probabilities P(A, B)=P(ABP(B)=P(BAP(A By rearrangement we easily obtain P(AB) P(BIA)P(A) P(B) which is Bayes'theorem This can be interpreted as follows. We are interested in A, and we begin with a prior probability P(A)for our belief about A, and then we observe IOr more generally P(A and BIC)=P(AlB,C)P(BIC) INTRODUCTiON TO INFERENCE FOR BAYESIAN NETWorKs 11 B. Then Bayes'theorem,(3), tells us that our revised belief for A,the posterior probability P(A|B)is obtained by multiplying the prior P(A) by the ratio P(B A)/P(B). The quantity P(B A), as a function of varying A for fixed B, is called the likelihood of A. We can express this relationship in the form: posterior∝ prior× likelihood P(A|B)∝P(A)P(B|A). Figure 1 illustrates this prior-to-posterior inference process. Each diagram A B B P(A)P(B A) P(A, B) P(BP(AJB) Figure 1. Bayesian inference as reversing the arrows represents in different ways the joint distribution P(A, B), the first repre sents the prior beliefs while the third represents the posterior beliefs. Often we will think of A as a possible“ cause” of the“ effect”B, the downward arrow represents such a causal interpretation. The"inferential" upwards arrow then represents an argument against the causal fow,, from the observed effect to the inferred cause.(We will not go into a definition of causality”here) Bayesian networks are generally more complicated than the ones in Figure 1, but the general principles are the same in the following sense A Bayesian network provides a model representation for the joint distri- bution of a set of variables in terms of conditional and prior probabilities in which the orientations of the arrows represent influence, usually though not always of a causal nature such that these conditional probabilities for these particular orientations are relatively straightforward to specify(from data or eliciting from an expert). When data are observed, then typically an inference procedure is required This involves calculating marginal prob abilities conditional on the observed data using Bayes'theorem, which is diagrammatically equivalent to reversing one or more of the Bayesian net work arrows. The algorithms which have been developed in recent years 12 ROBERT COWELL allows these calculations to be performed in an efficient and straightfor- ward manner 4. Simple inference problems Let us now consider some simple examples of inference The first is simply Bayes theorem with evidence included on a simple two node network; the remaining examples treat a simple three node problem 4.1. PROBLEM I P(Y)and hence P(Y=y). Applying Bayes'theorem we obtain stria Suppose we have the simple model X+Y, and are given: P(X), P(Y X) and Y=y. The problem is to calculate P(XY=y Now from P(X), P(rX)we can calculate the marginal di ution P(XY=y P(Y=yXP(X) (Y=y) 4.2 PROBLEM II Suppose now we have a more complicated model in which X is a par ent of both Y and Z:Z+X Y with specified probabilities P(X) P(Y X) and P(Z X), and we observe Y= y. The problem is to calculate P(ZY=y). Note that the joint distribution is given by P(X,Y,z) P(YXP(ZX)P(X). A'brute force' method is to calculate 1. The joint distribution P(X,Y,Z) 2. The marginal distribution P(Y)and thence P(Y=y 3. The marginal distribution P(Z,Y and thence P(Z,Y=y 4. P(ZY=y)=P(Z,Y=g/P(Y=y) An alternative method is to exploit the given factorization 1. Calculate P(XY=y)=P(Y=yXP(X)/P(Y=y) using Bayes' theorem, where P(Y=y)=2x P(Y=yIX)P(X) 2. Find P(ZlY=y)=xP(Z|)P(XIY=y note that the first step essentially reverses the arrow between X andY. Although the two methods give the same answer, the second is generally more efficient. For example, suppose that all three variables have 10 states Then the first method in explicitly calculating P(X, Y, a requires a table of 1000 states. In contrast the largest table required for the second method has size 100. This gain in computational efficiency by exploiting the given factorizations is the basis of the arc-reversal method for solving influence

...展开详情
试读 127P 概率图模型
立即下载 低至0.43元/次 身份认证VIP会员低至7折
一个资源只可评论一次,评论内容不能少于5个字
您会向同学/朋友/同事推荐我们的CSDN下载吗?
谢谢参与!您的真实评价是我们改进的动力~
  • 分享精英

关注 私信
上传资源赚钱or赚积分
最新推荐
概率图模型 16积分/C币 立即下载
1/127
概率图模型第1页
概率图模型第2页
概率图模型第3页
概率图模型第4页
概率图模型第5页
概率图模型第6页
概率图模型第7页
概率图模型第8页
概率图模型第9页
概率图模型第10页
概率图模型第11页
概率图模型第12页
概率图模型第13页
概率图模型第14页
概率图模型第15页
概率图模型第16页
概率图模型第17页
概率图模型第18页
概率图模型第19页
概率图模型第20页

试读结束, 可继续阅读

16积分/C币 立即下载