This page represents a patchwork of sources I have consulted to deepen my understanding of **Optimal Transport** and its associated applications to Machine Learning & Deep Learning.
From: Peyré, Gabriel, and Marco Cuturi. “Computational Optimal Transport.” arXiv, March 18, 2020. [http://arxiv.org/abs/1803.00567](http://arxiv.org/abs/1803.00567).
## Introduction
Optimal transport (OT) theory can be informally described using the words of the French mathematician Gaspard Monge (1746–1818): A worker with a shovel in hand has to move a large pile of sand lying on a construction site. The goal of the worker is to erect with all that sand a target pile with a prescribed shape (for example, that of a giant sand castle). Naturally, the worker wishes to minimize her total effort, quantified for instance as the total distance or time spent carrying shovelfuls of sand. Mathematicians interested in OT cast that problem as that of comparing two probability distributions — two different piles of sand of the same volume. They consider all of the many possible ways to morph, transport or reshape the first pile into the second, and associate a “global” cost to every such transport, using the “local” consideration of how much it costs to move a grain of sand from one place to another. Mathematicians are interested in the properties of that least costly transport, as well as in its efficient computation. That smallest cost not only defines a distance between distributions, but it also entails a rich geometric structure on the space of probability distributions. That structure is canonical in the sense that it borrows key geometric properties of the underlying “ground” space on which these distributions are defined. For instance, when the underlying space is Euclidean, key concepts such as interpolation, barycenters, convexity or gradients of functions extend naturally to the space of distributions endowed with an OT geometry.
OT has been (re)discovered in many settings and under different forms, giving it a rich history. While Monge’s seminal work was motivated by an engineering problem, Tolstoi in the 1920s and Hitchcock, Kantorovich and Koopmans in the 1940s established its significance to logistics and economics. Dantzig solved it numerically in 1949 within the framework of linear programming, giving OT a firm footing in optimization. OT was later revisited by analysts in the 1990s, notably Brenier, while also gaining fame in computer vision under the name of earth mover’s distances. Recent years have witnessed yet another revolution in the spread of OT, thanks to the emergence of approximate solvers that can scale to large problem dimensions. As a consequence, OT is being increasingly used to unlock various problems in imaging sciences (such as color or texture processing), graphics (for shape manipulation) or machine learning (for regression, classification and generative modeling).
The shortest path principle guides most decisions in life and sciences: When a commodity, a person or a single bit of information is available at a given point and needs to be sent at a target point, one should favor using the least possible effort. This is typically reached by moving an item along a straight line when in the plane or along geodesic curves in more involved metric spaces. The theory of optimal transport generalizes that intuition in the case where, instead of moving only one item at a time, one is concerned with the problem of moving simultaneously several items (or a continuous distribution thereof) from one configuration onto another.
At the turn of the millenium, researchers in computer, imaging and more generally data sciences understood that optimal transport theory provided very powerful tools to study distributions in a different and more abstract context, that of comparing distributions readily available to them under the form of bags-of-features or descriptors.
## Assignment and Monge Problem
Given a cost matrix $(C_{i,j})_{i \in [n],j \in [m]}$, assuming $n = m$, the optimal assignment problem seeks for a bijection $\sigma$ in the set $\text{Perm}(n)$ of permutations of $n$ elements solving
$\min_{\sigma \in \text{Perm}(n)} \frac{1}{n} \sum_{i=1}^{n} C_{i,\sigma(i)}.$
One could naively evaluate the cost function above using all permutations in the set $\text{Perm}(n)$. However, that set has size $n!$, which is gigantic even for small $n$. Consider, for instance, that such a set has more than $10^{100}$ elements [Dantzig, 1983] when $n$ is as small as 70. That problem can therefore be solved only if there exist efficient algorithms to optimize that cost function over the set of permutations.
### Uniqueness
Note that the optimal assignment problem may have several optimal solutions. Suppose, for instance, that $n = m = 2$ and that the matrix $C$ is the pairwise distance matrix between the four corners of a 2-D square of side length 1, as represented in the left plot of Figure 2.2. In that case only two assignments exist, and they are both optimal.
### Monge Problem between Discrete Measures
![[Pasted image 20240121230012.png]]
For discrete measures
$\alpha = \sum_{i=1}^{n} a_i \delta_{x_i}, \quad \text{and} \quad \beta = \sum_{j=1}^{m} b_j \delta_{y_j},$
the Monge problem [1781] seeks a map that associates to each point $x_i$ a single point $y_j$ and which must push the mass of $\alpha$ toward the mass of $\beta$, namely, such a map $ T: \{x_1, \ldots, x_n\} \rightarrow \{y_1, \ldots, y_m\} $ must verify that
$\forall j \in [m], \quad b_j = \sum_{i:T(x_i)=y_j} a_i,$
which we write in compact form as $T_\# \alpha = \beta$. Because all the elements of $b$ are positive, that map is necessarily surjective. This map should minimize some transportation cost, which is parameterized by a function $c(x, y)$ defined for points $(x, y) \in X \times Y$,
$\min_{T} \left\{ \sum_{i} c(x_i, T(x_i)) : T_\# \alpha = \beta \right\}.$
Such a map between discrete points can be of course encoded, assuming all $x