This page represents a patchwork of sources I have consulted to deepen my understanding of **Optimal Transport** and its associated applications to Machine Learning & Deep Learning. From: Peyré, Gabriel, and Marco Cuturi. “Computational Optimal Transport.” arXiv, March 18, 2020. [http://arxiv.org/abs/1803.00567](http://arxiv.org/abs/1803.00567). ## Introduction Optimal transport (OT) theory can be informally described using the words of the French mathematician Gaspard Monge (1746–1818): A worker with a shovel in hand has to move a large pile of sand lying on a construction site. The goal of the worker is to erect with all that sand a target pile with a prescribed shape (for example, that of a giant sand castle). Naturally, the worker wishes to minimize her total effort, quantified for instance as the total distance or time spent carrying shovelfuls of sand. Mathematicians interested in OT cast that problem as that of comparing two probability distributions — two different piles of sand of the same volume. They consider all of the many possible ways to morph, transport or reshape the first pile into the second, and associate a “global” cost to every such transport, using the “local” consideration of how much it costs to move a grain of sand from one place to another. Mathematicians are interested in the properties of that least costly transport, as well as in its efficient computation. That smallest cost not only defines a distance between distributions, but it also entails a rich geometric structure on the space of probability distributions. That structure is canonical in the sense that it borrows key geometric properties of the underlying “ground” space on which these distributions are defined. For instance, when the underlying space is Euclidean, key concepts such as interpolation, barycenters, convexity or gradients of functions extend naturally to the space of distributions endowed with an OT geometry. OT has been (re)discovered in many settings and under different forms, giving it a rich history. While Monge’s seminal work was motivated by an engineering problem, Tolstoi in the 1920s and Hitchcock, Kantorovich and Koopmans in the 1940s established its significance to logistics and economics. Dantzig solved it numerically in 1949 within the framework of linear programming, giving OT a firm footing in optimization. OT was later revisited by analysts in the 1990s, notably Brenier, while also gaining fame in computer vision under the name of earth mover’s distances. Recent years have witnessed yet another revolution in the spread of OT, thanks to the emergence of approximate solvers that can scale to large problem dimensions. As a consequence, OT is being increasingly used to unlock various problems in imaging sciences (such as color or texture processing), graphics (for shape manipulation) or machine learning (for regression, classification and generative modeling). The shortest path principle guides most decisions in life and sciences: When a commodity, a person or a single bit of information is available at a given point and needs to be sent at a target point, one should favor using the least possible effort. This is typically reached by moving an item along a straight line when in the plane or along geodesic curves in more involved metric spaces. The theory of optimal transport generalizes that intuition in the case where, instead of moving only one item at a time, one is concerned with the problem of moving simultaneously several items (or a continuous distribution thereof) from one configuration onto another. At the turn of the millenium, researchers in computer, imaging and more generally data sciences understood that optimal transport theory provided very powerful tools to study distributions in a different and more abstract context, that of comparing distributions readily available to them under the form of bags-of-features or descriptors. ## Assignment and Monge Problem Given a cost matrix $(C_{i,j})_{i \in [n],j \in [m]}$, assuming $n = m$, the optimal assignment problem seeks for a bijection $\sigma$ in the set $\text{Perm}(n)$ of permutations of $n$ elements solving $\min_{\sigma \in \text{Perm}(n)} \frac{1}{n} \sum_{i=1}^{n} C_{i,\sigma(i)}.$ One could naively evaluate the cost function above using all permutations in the set $\text{Perm}(n)$. However, that set has size $n!$, which is gigantic even for small $n$. Consider, for instance, that such a set has more than $10^{100}$ elements [Dantzig, 1983] when $n$ is as small as 70. That problem can therefore be solved only if there exist efficient algorithms to optimize that cost function over the set of permutations. ### Uniqueness Note that the optimal assignment problem may have several optimal solutions. Suppose, for instance, that $n = m = 2$ and that the matrix $C$ is the pairwise distance matrix between the four corners of a 2-D square of side length 1, as represented in the left plot of Figure 2.2. In that case only two assignments exist, and they are both optimal. ### Monge Problem between Discrete Measures ![[Pasted image 20240121230012.png]] For discrete measures $\alpha = \sum_{i=1}^{n} a_i \delta_{x_i}, \quad \text{and} \quad \beta = \sum_{j=1}^{m} b_j \delta_{y_j},$ the Monge problem [1781] seeks a map that associates to each point $x_i$ a single point $y_j$ and which must push the mass of $\alpha$ toward the mass of $\beta$, namely, such a map $ T: \{x_1, \ldots, x_n\} \rightarrow \{y_1, \ldots, y_m\} $ must verify that $\forall j \in [m], \quad b_j = \sum_{i:T(x_i)=y_j} a_i,$ which we write in compact form as $T_\# \alpha = \beta$. Because all the elements of $b$ are positive, that map is necessarily surjective. This map should minimize some transportation cost, which is parameterized by a function $c(x, y)$ defined for points $(x, y) \in X \times Y$, $\min_{T} \left\{ \sum_{i} c(x_i, T(x_i)) : T_\# \alpha = \beta \right\}.$ Such a map between discrete points can be of course encoded, assuming all $xs and $ys are distinct, using indices $\sigma: [n] \rightarrow [m]$ so that $j = \sigma(i)$, and the mass conservation is written as $\sum_{i \in \sigma^{-1}(j)} a_i = b_j,$ where the inverse $\sigma^{-1}(j)$ is to be understood as the preimage set of $j$. In the special case when $n = m$ and all weights are uniform, that is, $a_i = b_j = \frac{1}{n}$, then the mass conservation constraint implies that $T$ is a bijection, such that $T(x_i) = y_{\sigma(i)}$, and the Monge problem is equivalent to the optimal matching problem (2.2), where the cost matrix is $C_{ij} \ \stackrel{\text{def}}{=} \ c(x_i, y_j).$ When $n \neq m$, note that, optimality aside, Monge maps may not even exist between a discrete measure to another. This happens when their weight vectors are not compatible, which is always the case when the target measure has more points than the source measure, $n < m$. For instance, the right plot in the figure above shows an (optimal) Monge map between $\alpha$ and $\beta$, but there is no Monge map from $\beta$ to $\alpha$. ### Push-Forward Operator For a continuous map $T: \mathcal{X} \rightarrow \mathcal{Y}$, we define its corresponding push-forward operator $T_\#: \mathcal{M}(\mathcal{X}) \rightarrow \mathcal{M}(\mathcal{Y})$. For discrete measures (2.1), the push-forward operation consists simply in moving the positions of all the points in the support of the measure $T_\#\alpha \stackrel{\text{def}}{=} \sum_i a_i\delta_{T(x_i)}.$ For more general measures, for instance, for those with a density, the notion of push-forward plays a fundamental role to describe the spatial modification (or transport) of a probability measure. The formal definition reads as follows. ## Kantorovich Relaxation The assignment problem, and its generalization found in the Monge problem, is not always relevant to studying discrete measures, such as those found in practical problems. Indeed, because the assignment problem is formulated as a permutation problem, it can only be used to compare \textit{uniform histograms of the same size}. A direct generalization to discrete measures with nonuniform weights can be carried out using Monge's formalism of push-forward maps, but that formulation may also be degenerate in the absence of feasible solutions satisfying the mass conservation constraint. Additionally, the assignment problem is combinatorial, and the feasible set for the Monge problem, despite being continuously parameterized as the set consisting in all push-forward measures that satisfy the mass conservation constraint, is *nonconvex*. Both are therefore difficult to solve when approached in their original formulation. The key idea of Kantorovich [1942] is to relax the deterministic nature of transportation, namely the fact that a source point $x_i$ can only be assigned to another point or location $y_{\sigma_i}$, or $T(x_i)$ only. Kantorovich proposes instead that the mass at any point $x_i$ be potentially dispatched across several locations. Kantorovich moves away from the idea that mass transportation should be *deterministic* to consider instead a *probabilistic* transport, which allows what is commonly known as *mass splitting* from a source toward several targets. This flexibility is encoded using, in place of a permutation $\sigma$ or a map $T$, a coupling matrix $P \in \mathbb{R}_{+}^{n \times m}$, where $P_{i,j}$ describes the amount of mass flowing from bin $i$ toward bin $j$, or from the mass found at $x_i$ toward $y_i$ in the formalism of discrete measures. Admissible couplings admit a far simpler characterization than Monge maps, $U(a,b) \stackrel{\text{def}}{=} \left\{ P \in \mathbb{R}_{+}^{n \times m} : P1_m = a \text{ and } P^T1_n = b \right\},$ where we used the following matrix-vector notation: $P1_m = \left( \sum_j P_{i,j} \right) \in \mathbb{R}^n \text{ and } P^T1_n = \left( \sum_i P_{i,j} \right) \in \mathbb{R}^m.$ The set of matrices $U(a,b)$ is bounded and defined by $n + m$ equality constraints, and therefore is a convex polytope (the convex hull of a finite set of matrices) [Brualdi, 2006, §8.1]. Additionally, whereas the Monge formulation (as illustrated in the right plot of Figure 2.2) was intrinsically asymmetric, Kantorovich's relaxed formulation is always symmetric, in the sense that a coupling $P$ is in $U(a,b)$ if and only if $P^T$ is in $U(b,a)$. Kantorovich's optimal transport problem now reads $L_c(a,b) \stackrel{\text{def}}{=} \min_{P \in U(a,b)} \langle C, P \rangle \stackrel{\text{def}}{=} \sum_{i,j} C_{i,j}P_{i,j}.$ This is a linear program, and as is usually the case with such programs, its optimal solutions are not necessarily unique. ### Mines and Factories The Kantorovich problem finds a very natural illustration in the following resource allocation problem (see also Hitchcock [1941]). Suppose that an operator runs $n$ warehouses and $m$ factories. Each warehouse contains a valuable raw material that is needed by the factories to run properly. More precisely, each warehouse is indexed with an integer $i$ and contains $a_i$ units of the raw material. These raw materials must all be moved to the factories, with a prescribed quantity $b_j$ needed at factory $j$ to function properly. To transfer resources from a warehouse $i$ to a factory $j$, the operator can use a transportation company that will charge $C_{i,j}$ to move a single unit of the resource from location $i$ to location $j$. We assume that the transportation company has the monopoly to transport goods and applies the same linear pricing scheme to all actors of the economy: the cost of shipping a unit of the resource from $i$ to $j$ is equal to $\alpha \times C_{i,j}$. Faced with the problem described above, the operator chooses to solve the linear program to obtain a transportation plan $P^*$ that quantifies for each pair $i, j$ the amount of goods $P_{i,j}$ that must transported from warehouse $i$ to factory $j$. The operator pays on aggregate a total of $\langle P^*, C \rangle$ to the transportation company to execute that plan. ![[Pasted image 20240121235359.png]] ### Kantorovich Problem between Discrete Measures For discrete measures $\alpha, \beta$ of the form (2.3), we store in the matrix $C$ all pairwise costs between points in the supports of $\alpha, \beta$, namely $C_{i,j} \stackrel{\text{def}}{=} c(x_i, y_j)$, to define $L_c(\alpha, \beta) \stackrel{\text{def}}{=} L_c(a, b).$ Therefore, the Kantorovich formulation of optimal transport between discrete measures is the same as the problem between their associated probability weight vectors $a, b$ except that the cost matrix $C$ depends on the support of $\alpha$ and $\beta$. The notation $L_c(\alpha, \beta)$, however, is useful in some situations, because it makes explicit the dependency with respect to \textit{both} probability weights and supporting points, the latter being exclusively considered through the cost function $c$. ![[Pasted image 20240121235938.png]] ### Using Optimal Assignments and Couplings (Applications) The optimal transport plan itself (either as a coupling $P$ or a Monge map $T$ when it exists) has found many applications in data sciences, and in particular image processing. It has, for instance, been used for contrast equalization [Delon, 2004] and texture synthesis Gutierrez et al. [2017]. A significant part of applications of OT to imaging sciences is for image matching [Zhu et al., 2007, Wang et al., 2013, Museyko et al., 2009, Li et al., 2013], image fusion [Courty et al., 2016], medical imaging [Wang et al., 2011] and shape registration [Makihara and Yagi, 2010, Lai and Zhao, 2017, Su et al., 2015], and image watermarking [Mathon et al., 2014]. In astrophysics, OT has been used for reconstructing the early universe [Frisch et al., 2002]. OT has also been used for music transcription [Flamary et al., 2016], and finds numerous applications in economics to interpret matching data [Galichon, 2016]. Lastly, let us note that the computation of transportation maps computed using OT techniques (or inspired from them) is also useful to perform sampling [Reich, 2013, Oliver, 2014] and Bayesian inference [Kim et al., 2013, El Moselhy and Marzouk, 2012]. ## Metric Properties of Optimal Transport An important feature of OT is that it defines a distance between histograms and probability measures as soon as the cost matrix satisfies certain suitable properties. Indeed, OT can be understood as a canonical way to lift a ground distance between points to a distance between histogram or measures. We first consider the case where, using a term first introduced by Rubner et al. [2000], the "ground metric'' matrix $C$ is fixed, representing substitution costs between bins, and shared across several histograms we would like to compare. The following proposition states that OT provides a valid distance between histograms supported on these bins. We suppose $n = m$ and that for some $p \geq 1$, $C = D^p = (D_{i,j}^p)_{i,j} \in \mathbb{R}^{n \times n}$, where $D \in \mathbb{R}_{+}^{n \times n}$ is a distance on $[n]$, i.e. 1. $D \in \mathbb{R}_{+}^{n \times n}$ is symmetric; 2. $D_{i,j} = 0$ if and only if $i = j$; 3. $\forall (i, j, k) \in [n]^3, D_{i,k} \leq D_{i,j} + D_{j,k}$. Then: $W_p(a, b) \stackrel{\text{def}}{=} L_p(a, b)^{1/p}$ (note that $W_p$ depends on $D$) defines the $p$-Wasserstein distance on $\Sigma_n$, i.e. $W_p$ is symmetric, positive, $W_p(a, b) = 0$ if and only if $a = b$, and it satisfies the triangle inequality $\forall a, b, c \in \Sigma_n, \ W_p(a, c) \leq W_p(a, b) + W_p(b, c).$ **(The cases $0 < p \leq 1$).** Note that if $0 < p \leq 1$, then $D^p$ is itself distance. This implies that while for $p \geq 1$, $W_p(a, b)$ is a distance, in the case $p \leq 1$, it is actually $W_p(a, b)^p$ which defines a distance on the simplex. ### Applications of Wasserstein distances The fact that the OT distance automatically "lifts'' a ground metric between bins to a metric between histograms on such bins makes it a method of choice for applications in computer vision and machine learning to compare histograms. In these fields, a classical approach is to "pool'' local features (for instance, image descriptors) and compute a histogram of the empirical distribution of features (a so-called bag of features) to perform retrieval, clustering or classification; see, for instance, [Oliva and Torralba, 2001]. Along a similar line of ideas, OT distances can be used over some lifted feature spaces to perform signal and image analysis [Thorpe et al., 2017]. Applications to retrieval and clustering were initiated by the landmark paper [Rubner et al., 2000], with renewed applications following faster algorithms for threshold matrices $C$ that fit for some applications, for example, in computer vision [Pele and Werman, 2008, 2009]. More recent applications stress the use of the earth mover's distance for bags-of-words, either to carry out dimensionality reduction [Rolet et al., 2016] and classify texts [Kusner et al., 2015, Huang et al., 2016], or to define an alternative loss to train multiclass classifiers that output bags-of-words [Frogner et al., 2015]. Kolouri et al. [2017] provides a recent overview of such applications to signal processing and machine learning. ### Wasserstein Distance between Measures The Wasserstein Distance definition can be generalized to deal with arbitrary measures that need not be discrete. We assume $\mathcal{X} = \mathcal{Y}$ and that for some $p \geq 1$, $c(x,y) = d(x,y)^p$, where $d$ is a distance on $\mathcal{X}$, i.e. - $d(x,y) = d(y,x) \geq 0$; - $d(x,y) = 0$ if and only if $x = y$; - $\forall (x, y, z) \in \mathcal{X}^3, d(x,z) \leq d(x,y) + d(y,z)$. Then the $p$-Wasserstein distance on $\mathcal{X}$, $W_p(\alpha, \beta) \stackrel{\text{def}}{=} L_{d_p}(\alpha, \beta)^{1/p}$ (note that $W_p$ depends on $d$), is indeed a distance, namely $W_p$ is symmetric, nonnegative, $W_p(\alpha, \beta) = 0$ if and only if $\alpha = \beta$, and it satisfies the triangle inequality $\forall (\alpha, \beta, \gamma) \in \mathcal{M}_1^+(\mathcal{X})^3, \ W_p(\alpha, \gamma) \leq W_p(\alpha, \beta) + W_p(\beta, \gamma).$ ### Dual Problem The Kantorovich problem is a constrained convex minimization problem, and as such, it can be naturally paired with a so-called dual problem, which is a constrained concave maximization problem. The following fundamental proposition explains the relationship between the primal and dual problems. The Kantorovich problem admits the dual $L_c(a, b) = \max_{(f,g) \in R(C)} \langle f, a \rangle + \langle g, b \rangle,$ where the set of admissible dual variables is $R(C) \stackrel{\text{def}}{=} \{(f,g) \in \mathbb{R}^n \times \mathbb{R}^m : \forall (i,j) \in [n] \times [m], f_i + g_j \leq C_{i,j}\}.$ Such dual variables are often referred to as "Kantorovich potentials.'' Following the interpretation given to the Kantorovich problem , we follow with an intuitive presentation of the dual. Recall that in that setup, an operator wishes to move at the least possible cost an overall amount of resources from warehouses to factories. The operator can do so by solving Kantorovich's formulation, follow the instructions set out in $P^*$, and pay $(P^*, C)$ to the transportation company. **Outsourcing logistics**. Suppose that the operator does not have the computational means to solve the linear program (Kantorovich's formulation). He decides instead to outsource that task to a vendor. The vendor chooses a pricing scheme with the following structure: the vendor splits the logistic task into that of collecting and then delivering the goods and will apply a collection price $f_i$ to collect a unit of resource at each warehouse $i$ (no matter where that unit is sent to) and a price $g_j$ to deliver a unit of resource to factory $j$ (no matter from which warehouse that unit comes from). On aggregate, since there are exactly $a_i$ units at warehouse $i$ and $b_j$ needed at factory $j$, the vendor asks as a consequence of that pricing scheme a price of $\langle f, a \rangle + \langle g, b \rangle$ to solve the operator’s logistic problem. **Setting prices**. Note that the pricing system used by the vendor allows quite naturally for arbitrarily negative prices. Indeed, if the vendor applies a price vector $f$ for warehouses and a price vector $g$ for factories, then the total bill will not be changed by simultaneously decreasing all entries in $f$ by an arbitrary number and increasing all entries of $g$ by that same number, since the total amount of resources in all warehouses is equal to those that have to be delivered to the factories. In other words, the vendor can give the illusion of giving an extremely good deal to the operator by paying him to collect some of his goods, but compensate that loss by simply charging him more for delivering them. Knowing this, the vendor, wishing to charge as much as she can for that service, sets vectors $f$ and $g$ to be as high as possible. **Checking prices**. In the absence of another competing vendor, the operator must therefore think of a quick way to check that the vendor’s prices are reasonable. A possible way to do so would be for the operator to compute the price $L_c(a, b)$ of the most efficient plan by solving Kantorovich's formulation and check if the vendor’s offer is at the very least no larger than that amount. However, recall that the operator cannot afford such a lengthy computation in the first place. Luckily, there is a far more efficient way for the operator to check whether the vendor has a competitive offer. Recall that $f_i$ is the price charged by the vendor for picking a unit at $i$ and $g_j$ to deliver one at $j$. Therefore, the vendor’s pricing scheme implies that transferring one unit of resource from $i$ to $j$ costs exactly $f_i + g_j$. Yet, the operator also knows that the cost of shipping one unit from $i$ to $j$ as priced by the transporting company is $C_{i,j}$. Therefore, if for any pair $i, j$ the aggregate price $f_i + g_j$ is strictly larger than $C_{i,j}$, the vendor is charging more than the fair price charged by the transportation company for that task, and the operator should refuse the vendor’s offer. **Optimal prices as a dual problem**. It is therefore in the interest of the operator to check that for all pairs $i, j$ the prices offered by the vendor verify $f_i + g_j \leq C_{i,j}$. Suppose that the operator does check that the vendor has provided price vectors that do comply with these $n \times m$ inequalities. Can he conclude that the vendor's proposal is attractive? Doing a quick back of the hand calculation, the operator does indeed conclude that it is in his interest to accept that offer. Indeed, since any of his transportation plans $P$ would have a cost $(P, C) = \sum_{i,j} P_{i,j}C_{i,j}$, the operator can conclude by applying these $n \times m$ inequalities that for any transport plan $P$ (including the optimal one $P^*$), the marginal constraints imply $ \sum_{i,j} P_{i,j}C_{i,j} \geq \sum_{i,j} P_{i,j} (f_i + g_j) = \left( \sum_i f_i \sum_j P_{i,j} \right) + \left( \sum_j g_j \sum_i P_{i,j} \right) = \langle f, a \rangle + \langle g, b \rangle, $ and therefore observe that *any* attempt at doing the job by himself would necessarily be more expensive than the vendor’s price. Knowing this, the vendor must therefore find a set of prices $f, g$ that maximize $\langle f, a \rangle + \langle g, b \rangle$ but that must satisfy at the very least for all $i, j$ the basic inequality that $f_i + g_j \leq C_{i,j}$ for his offer to be accepted. One can show that the best price obtained by the vendor is in fact exactly equal to the best possible cost the operator would obtain by computing $L_c(a, b)$. ![[Pasted image 20240123222209.png]] Figure 2.8 illustrates the primal and dual solutions resulting from the same transport problem. On the left, blue dots represent warehouses and red dots stand for factories; the areas of these dots stand for the probability weights $a, b$, links between them represent an optimal transport, and their width is proportional to transferred amounts. Optimal prices obtained by the vendor are shown on the right. Prices have been chosen so that their mean is equal to 0. The highest relative prices come from collecting goods at an isolated warehouse on the lower left of the figure, and delivering goods at the factory located in the upper right area.