Training Flow and Diffusion Models

These are some notes I took, while watching MIT 6.S184: Lecture 03 and reading [1].

1. Overview

Flow Matching [1]: is a method for learning a Continuous Normalizing Flow that transforms samples between any two distributions. It shows that Diffusion Models are actually just a specific type of flow matching where the probability path is defined by a Gaussian noise schedule. 11 Flow Matching + Straight Conditional Probability Path (Schedule) = Optimal Transport Path (most popular, due to ease of obtaining training examples)
Flow Matching + Curved Conditional Probaibility Path (Diffusion Schedule, i.e. gaussian noise) = Diffusion Model (trained with a better objective).

So to summarize, key two features of flow matching are 1) arbitaray priors, 2) flexible paths:

2. Notation

To make the math easier to follow, here are the symbols used in this post:

A sample point in the vector field (can be noise, data, or in-between).
A sample from the data distribution ().
A sample from the prior (noise) distribution ().
Time step , where is noise and is data.
The vector field (velocity) at location and time .
Coefficients determining the path schedule
The learnable parameters of the neural network.

3. The big picture of arriving at a trainable objective in flow matching

  1. Our ideal goal is to learn . So we can try to formulate the flow matching loss with the marginal vector field:

    Yet the issue is that, being intractable33 because it requires an integral over the entire data distribution.

  2. To fix this intractability, we use the Conditional Flow Matching (CFM) loss, which regresses against the easy-to-calculate conditional vector field .
    It can be proved that the Marginal flow matching loss equals the conditional flow matching loss () up to a constant that does not depend on the neural network parameters ().
    Because their gradients are the same (), we can minimize the easy, conditional loss to implicitly solve the hard, marginal one.

  3. Now that we know we can use the conditional version, we pick a specific

    • For instance, we can use a Gaussian conditional probability path
    • So what’s the Target? For this path, we can analytically calculate exactly what should be using its time derivatives.

      using the standard reparametrization trick 44 we express as a function of standard Gaussian noise : This allows us to substitute every instance of in the equation with

  4. Finally, we choose the simplest “schedulers” for and .
    The Setup: We set and .This specific linear interpolation gives us the Conditional Optimal Transport (CondOT) path [2].
    The Loss: This leads to the simple training objective:

    where

To elaborate more

  1. “first gen diffusion papers only did score matching”

    • First diffusion models only used discrete time (no ODE, SDE)
  2. how different is the diffusion objective from the flow matching loss?
  3. Is cosine schedule of diffusion equivalent to linear interpolation in FM?
  4. Is Euler method used in practice? or do people prefer higher order methods? Heun’s method?

Bibliography

  • [1] Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow Matching for Generative Modeling,” in 11th International Conference on Learning Representations, 2023.
  • [2] X. Liu, C. Gong, and Q. Liu, “Flow Straight And Fast: Learning To Generate And Transfer Data With Rectified Flow,” in 11th International Conference on Learning Representations, 2023.
KUDOSDon’t
Move
Thanks!