Abstract
Optimal Transport (OT) is a mathematical gem at the interface between probability,
analysis and optimization. The goal of that theory is to define geometric tools that
are useful to compare probability distributions. Let us briefly sketch some key ideas
using a vocabulary that was first introduced by Monge two centuries ago: a probability
distribution can be thought of as a pile of sand. Peaks indicate where likely observations
are to appear. Given a pair of probability distributions—two different piles of sand—
there are, in general, multiple ways to morph, transport or reshape the first pile so that
it matches the second. To every such transport we associate an a “global” cost, using
the “local” consideration of how much it costs to move a single grain of sand from one
location to another. The goal of optimal transport is to find the least costly transport,
and use it to derive an entire geometric toolbox for probability distributions.
Despite this relatively abstract description, optimal transport theory answers many
basic questions related to the way our economy works: In the “mines and factories”
problem, the sand is distributed across an entire country, each grain of sand represents
a unit of a useful raw resource; the target pile indicates where those resources are
needed, typically in factories, where they are meant to be processed. In that scenario,
one seeks the least costly way to move all these resources, knowing the entire logistic
cost matrix needed to ship resources from any storage point to any factory.
Transporting optimally two abstract distributions is also extremely relevant for
mathematicians, in the sense that it defines a rich geometric structure on the space of
probability distributions. That structure is canonical in the sense that it borrows, in
arguably the most natural way, key geometric properties of the underlying “ground”
space on which these distributions are defined. For instance, when the underlying space
is Euclidean, key concepts such as interpolation, barycenters, convexity or gradients of
functions extend very naturally to distributions when endowed with an optimal trans-
port geometry. OT has a rich and varied history. Earlier contributions originated from
Monge’s work in the 18th century, to be later rediscovered under a different formalism
by Tolstoi in the 1920’s, Kantorovich, Hitchcock and Koopmans in the 1940’s. The
problem was solved numerically by Dantzig in 1949 and others in the 1950’s within the
framework of linear programming, paving the way for major industrial applications in
the second half of the 20th century. OT was later rediscovered under a different light
by analysts in the 90’s, following important work by Brenier and others, as well as in
the computer vision/graphics fields under the name of earth mover’s distances. Recent
years have witnessed yet another revolution in the spread of OT, thanks to the emer-
gence of approximate solvers that can scale to sizes and dimensions that are relevant
to data sciences. Thanks to this newfound scalability, OT is being increasingly used to
unlock various problems in imaging sciences (such as color or texture processing), com-
puter vision and graphics (for shape manipulation) or machine learning (for regression,