Solving Inverse Problems with Deep Learning

Applications to Astrophysics


François Lanusse










slides at eiffl.github.io/talks/KMI2020

Linear inverse problems

$\boxed{y = \mathbf{A}x + n}$

$\mathbf{A}$ is known and encodes our physical understanding of the problem.
$\Longrightarrow$ When non-invertible or ill-conditioned, the inverse problem is ill-posed with no unique solution $x$
Deconvolution
Inpainting
Denoising

Can AI solve all of our problems?

Branched GAN model for deblending (Reiman & Göhre, 2018)
The issue with using deep learning as a black-box
  • No explicit control of noise, PSF, depth, number of sources.
    • Model would have to be retrained for all observing configurations
  • No guarantees on the network output (e.g. flux preservation, artifacts)
  • No proper uncertainty quantification.

Focus of this talk

  • Deep Learning for Data Processing

  • Bayesian Neural Networks

  • Physical Bayesian inference


This talk

Combine Deep Learning, Bayesian Statistics, and Physics for:

  • interpretability
  • uncertainty quantification

in inverse problems.

Can we understand how a Neural Network solves an Inverse Problem?

A Motivating Example: Image Deconvolution



$ y = P \ast x + n $

Observed $y$


Ground-Based Telescope

$f_\theta$


some deep Convolutional Neural Network

Unknown $x$


Hubble Space Telescope


  • A standard approach would be to train a neural network $f_\theta$ to estimate $x$ given $y$.
  • Step I: Assemble from data or simulations a training set of images $$\mathcal{D} = \{(x_0, y_0), (x_1, y_1), \ldots, (x_N, y_N) \}$$ $\Longrightarrow$ the dataset contains hardcoded assumptions about PSF $P$ noise $n$, and galaxy morphology $x$.
  • Step II: Train the neural network $f_\theta$ under a regression loss of the type: $$ \mathcal{L} = \sum_{i=0}^N \parallel x_i - f_\theta(y_i)\parallel^2 $$

$$ y $$

$$f_\theta$$

$$f_\theta(y)$$

$$ \mbox{True } x $$

$\Longrightarrow$Why is the network output different from the truth? If it's not the truth, then what is $f_\theta(y)$?

Let's try to understand the neural network output by looking at the loss function

$$ \mathcal{L} = \sum_{(x_i, y_i) \in \mathcal{D}} \parallel x_i - f_\theta(y_i)\parallel^2 \quad \simeq \quad \int \parallel x - f_\theta(y) \parallel^2 \ p(x,y) \ dx dy $$
$$\Longrightarrow \int \left[ \int \parallel x - f_\theta(y) \parallel^2 \ p(x|y) \ dx \right] p(y) dy $$
$\mathcal{L}$ minimized when $f_{\theta^\star}(y) = \int x \ p(x|y) \ dx $, i.e. when $f_{\theta^\star}(x)$ is predicting the mean of $p(x|y)$.
  • Using an $\ell_2$ loss learns the mean of the $p(x | y)$

  • Using an $\ell_1$ loss learns the median of $p(x|y)$

  • In general, training a neural network for regression doesn't achieve de mode of the distribution.

    Check this blogpost and this notebook to learn how to do that: Open In Colab

A Bayesian understanding of a regression network

Data $y$

Posterior samples
Posterior mean
True $x$

  • The distribution $p(x|y)$ can be understood as a Bayesian posterior distribution: $$ p(x | y) \propto \underbrace{p(y | x)}_{\mbox{likelihood}} \quad \underbrace{p(x)}_{\mbox{prior}} $$

  • Both priors and likelihoods are learned implicitly by the neural network from the training set.
    $\Longrightarrow$ priors AND likelihoods are hardcoded in the training set.

Application to Weak Lensing Mass-Mapping


Work in collaboration with:
Niall Jeffrey, Ofer Lahav, Jean-Luc Starck




$\Longrightarrow$ Use Deep Learning to reconstruct the projected mass distribution in the Universe.

Gravitational lensing

Galaxy shapes as estimators for gravitational shear
$$ e = \gamma + e_i \qquad \mbox{ with } \qquad e_i \sim \mathcal{N}(0, I)$$
  • We are trying the measure the ellipticity $e$ of galaxies as an estimator for the gravitational shear $\gamma$

The Weak Lensing Mass-Mapping as an Inverse Problem

Shear $\gamma$
Convergence $\kappa$
$$\gamma_1 = \frac{1}{2} (\partial_1^2 - \partial_2^2) \ \Psi \quad;\quad \gamma_2 = \partial_1 \partial_2 \ \Psi \quad;\quad \kappa = \frac{1}{2} (\partial_1^2 + \partial_2^2) \ \Psi$$
$$\boxed{\gamma = \mathbf{P} \kappa}$$

Illustration on the Dark Energy Survey (DES) simulations

Jeffrey, Abdalla, Lahav, Lanusse et al. (2018)

The DeepMass approach: Posterior mean estimation by Deep Learning

$$\boxed{e = \mathbf{P} \kappa + e_i}$$


  • Preprocess the galaxy shapes data with a Wiener Filter $\tilde{\kappa}^{wf} = \mathbf{WF}(e)$
  • Building training set on simulations $$\mathcal{D} = \{ (\tilde{\kappa}_0^{wf}, \kappa_0), \ldots, (\tilde{\kappa}_N^{wf}, \kappa_n)\} $$ $\Longrightarrow$ Incorporates our prior knownledge $p(\kappa)$

  • Train a UNet under an $\ell_2$ loss: $$\mathcal{L} = \sum_{i=0}^N \parallel \kappa_i - f_\theta(\tilde{\kappa}_i^{wf}) \parallel^2 $$ $\Longrightarrow$ CNN optimized to estimate the posterior mean: $$ f_{\theta^\star}(\tilde{\kappa}^{wf})= \int \kappa \ p(\kappa | \kappa^{wf}) d\kappa$$

Results

  • Simulations
  • Application to DES Science Verification data


Takeaway


  • Training a deep neural network for regression in an inverse problem is doing Bayesian Inference.
    $\Longrightarrow$ Using an $\ell_2$ loss returns the posterior mean.


  • Both likelihood and prior are learned implicitly from the training set.


  • Can be used to achieve superior results to conventional methods on inverse problems like mass-mapping.


Limitations of this black-box approach
  • No guarantees on the network output!
    $\Longrightarrow$ We don't know how far it is from the actual posterior mean.

  • The posterior mean is not the Maximum a Posteriori (MAP) solution.

  • The likelihood is harcoded.
    • If the noise changes, the model needs to be retrained.
$\boxed{y = \mathbf{A}x + n}$

The Bayesian view of the problem:

$$ p(x | y) \propto p(y | x) \ p(x) $$
  • $p(y | x)$ is the data likelihood, which contains the physics

  • $p(x)$ is the prior knowledge on the solution.


With these concepts in hand, we can estimate for instance the Maximum A Posteriori solution:

$$\hat{x} = \arg\max\limits_x \ \log p(y \ | \ x) + \log p(x)$$
For instance, if $n$ is Gaussian, $\hat{x} = \arg\max\limits_x \ \frac{1}{2} \parallel y - \mathbf{A} x \parallel_{\mathbf{\Sigma}}^2 + \log p(x)$

How do you choose the prior ?

Classical examples of signal priors

Sparse
$$ \log p(x) = \parallel \mathbf{W} x \parallel_1 $$
Gaussian $$ \log p(x) = x^t \mathbf{\Sigma^{-1}} x $$
Total Variation $$ \log p(x) = \parallel \nabla x \parallel_1 $$

But what about this?

Can we open the black-box and use the neural network to only learn a prior from data?

Do you know this person?



















Probably not, this is a randomly generated person: thispersondoesntexist.com

What is generative modeling?



  • The goal of generative modeling is to learn the distribution $\mathbb{P}$ from which the training set $X = \{x_0, x_1, \ldots, x_n \}$ is drawn.

  • Usually, this means building a parametric model $\mathbb{P}_\theta$ that tries to be close to $\mathbb{P}$.



True $\mathbb{P}$

Samples $x_i \sim \mathbb{P}$

Model $\mathbb{P}_\theta$

Why isn't it easy?


  • The curse of dimensionality put all points far apart in high dimension

Distance between pairs of points drawn from a Gaussian distribution.

  • Classical methods for estimating probability densities, i.e. Kernel Density Estimation (KDE) start to fail in high dimension because of all the gaps

The evolution of generative models




  • Deep Belief Network
    (Hinton et al. 2006)

  • Variational AutoEncoder
    (Kingma & Welling 2014)

  • Generative Adversarial Network
    (Goodfellow et al. 2014)

  • Wasserstein GAN
    (Arjovsky et al. 2017)



A visual Turing test


Fake PixelCNN samples

Real galaxies from SDSS

Not all generative models are created equal

Grathwohl et al. 2018


  • GANs and VAEs are very common and successfull but do not fit our purposes.

  • We want a model which can provide an explicit $\log p(x)$.

PixelCNN: Likelihood-based Autoregressive generative model

Models the probability $p(x)$ of an image $x$ as: $$ p_{\theta}(x) = \prod_{i=0}^{n} p_{\theta}(x_i | x_{i-1} \ldots x_0) $$
  • Some of the best log-likelihoods on the market.
  • Extremely stable during training.
  • Slow to sample from.



Ramachandran et al. 2017
van den Oord et al. 2016

  • Provides an explicit value for $\log p_\theta(x)$
    $\Longrightarrow$ Can be used as a Bayesian prior for inverse problems.

A deep denoising example

$$ \boxed{{\color{Orchid} y} = {\color{SkyBlue} x} + n} $$
  • We learn the distribution of noiseless data $\log p_\theta(x)$ from samples using a deep generative model.

  • We measure a noisy ${\color{Orchid} y}$ and we want to estimate a denoised ${\color{SkyBlue} x}$

  • The solution should lie on the realistic data manifold, symbolized by the two-moons distribution.

    We want to solve for the Maximum A Posterior solution:

    $$\arg \max - \frac{1}{2} \parallel {\color{Orchid} y} - {\color{SkyBlue} x} \parallel_2^2 + \log p_\theta({\color{SkyBlue} x})$$ This can be done by gradient descent as long as one has access to $\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x\color{orange})}{\color{orange} d \color{orange}x}$.

Application to Galaxy Deblending


Work in collaboration with:
Peter Melchior, Fred Moolekamp




$\Longrightarrow$ Use PixelCNNs as deep priors for deblending

The Scarlet algorithm: deblending as an optimization problem

Melchior et al. 2018
$$ \mathcal{L} = \frac{1}{2} \parallel \mathbf{\Sigma}^{-1/2} (\ Y - P \ast A S \ ) \parallel_2^2 - \sum_{i=1}^K \log p_{\theta}(S_i) + \sum_{i=1}^K g_i(A_i) + \sum_{i=1}^K f_i(S_i)$$
Where for a $K$ component blend:
  • $P$ is the convolution with the instrumental response

  • $A_i$ are channel-wise galaxy SEDs, $S_i$ are the morphology models

  • $\mathbf{\Sigma}$ is the noise covariance

  • $\log p_\theta$ is a PixelCNN prior

  • $f_i$ and $g_i$ are arbitrary additional non-smooth consraints, e.g. positivity, monotonicity...
$\Longrightarrow$ Explicit physical modeling of the observed sky

Training the morphology prior

Postage stamps of isolated COSMOS galaxies used for training, at WFIRST resolution and fixed fiducial PSF
isolated galaxy $\log p_\theta(x) = 3293.7$
artificial blend $\log p_\theta(x) = 3100.5 $

Scarlet in action

Input blend
Solution
Residuals
  • Classic priors (monotonicity, symmetry).

  • Deep Morphology prior.
True Galaxy
Deep Morphology Prior Solution
Monotonicity + Symmetry Solution

Extending to multi-band images

Lanusse, Melchior, Moolekamp (2019)

Takeaway



  • Deep generative models can be used to provide data driven priors.


  • Explicit likelihood, uses of all of our physical knowledge.
    $\Longrightarrow$ The method can be applied for varying PSF, noise, or even different instruments!


  • Iterative optimization can include additional regularization terms.
    • Ensures fit to data (flux preservation).
    • Can impose physical constraints like positivity.




But what about uncertainty quantification?



Can we sample from the full Bayesian posterior $p(y | x)$ to estimate uncertainties?

Deep Posterior Sampling with Denoising Score Matching


Work in collaboration with:
Benjamin Remy, Zaccharie Ramzi




$\Longrightarrow$ Sample from posterior with gradient-based Monte-Carlo methods

The score is all you need!

  • Instead of learning ${\color{orange} \log {\color{orange} p\color{orange}(\color{orange} x\color{orange})}}$, learn $\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x\color{orange})}{\color{orange} d \color{orange}x}$, a.k.a. the Score Function

    • MAP estimation only need the score! Same is true for Hamiltonian Monte Carlo!

  • The score can be learned efficiently by training a deep Gaussian denoiser (Denoising Score Matching).
  • The score of the full posterior is simply: $$\nabla \log p(x |y) = \underbrace{\nabla \log p(y |x)}_{\mbox{known}} \quad + \quad \underbrace{\nabla \log p(x)}_{\mbox{learned}}$$

$\boldsymbol{x}'$
$\boldsymbol{x}$
$\boldsymbol{x}'- \boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$
$\boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$
$$\boxed{\boldsymbol{r}^\star(\boldsymbol{x}', \sigma) = \boldsymbol{x}' + \sigma^2 \nabla_{\boldsymbol{x}} \log p_{\sigma^2}(\boldsymbol{x}')}$$

Uncertainty quantification in Magnetic Resonance Imaging (MRI)

Ramzi, Remy, Lanusse et al. 2020


$$\boxed{y = \mathbf{F}^{-1} \mathbf{M} \mathbf{F} x + n}$$

  • Learn score $\nabla \log p(x)$ with UNet trained by denoising score matching.
  • Sample 320x320pix image posterior by Annealed Hamiltonian Monte Carlo.

$\Longrightarrow$ We can see which parts of the image are well constrained by data, and which regions are uncertain.

Probabilistic Mass-Mapping of the HST COSMOS field

Remy, Lanusse, Ramzi et al. 2020
$$\boxed{\gamma = \mathbf{P} \kappa + n}$$
  • Hybrid prior: theoretical Gaussian on large scale, data-driven on small scales using N-body simulations. $$\underbrace{\nabla_{\boldsymbol{\kappa}} \log p(\boldsymbol{\kappa})}_\text{full prior} = \underbrace{\nabla_{\boldsymbol{\kappa}} \log p_{th}(\boldsymbol{\kappa})}_\text{gaussian prior} + \underbrace{\boldsymbol{r}_\theta(\boldsymbol{\kappa}, \nabla_{\boldsymbol{\kappa}} \log p_{th}(\boldsymbol{\kappa}))}_\text{learned residuals}$$ with $p_{th}(\boldsymbol{\kappa}) = \frac{1}{ \sqrt{ \det 2 \pi \boldsymbol{S}}} \exp \left( -\frac{1}{2} \boldsymbol{\kappa}^\dagger \boldsymbol{S}^{-1} \boldsymbol{\kappa} \right)$, computed from theory.


Conclusion





How to use Deep Learning for Inverse Problems in a Robust and Interpretable way?
  • Even a simple regression network is doing approximate Bayesian inference.

  • Restricting Deep Learning to modeling the priors brings robustness and intrepretability.
    • Deep Generative Models are perfect for learning data-driven priors.

  • Proper Uncertainty Quantification is possible by sampling the Bayesian posterior of the inverse problem.




Thank you!



Bonus: How to train a generative model from noisy/corrupted data?

Deep Generative Models for Galaxy Image Simulations


Open In Colab


Work in collaboration with
Rachel Mandelbaum, Siamak Ravanbakhsh, Chun-Liang Li, Barnabas Poczos, Peter Freeman



Lanusse et al. (2020)
Ravanbakhsh, Lanusse, et al. (2017)

The weak lensing shape measurement problem

Shape measurement biases
$$ < e > = \ (1 + m) \ \gamma \ + \ c $$
  • Can be calibrated on image simulations
  • How complex do the simulations need to be?

Complications specific to astronomical images: spot the differences!


CelebA

HSC PDR-2 wide

  • There is noise
  • We have a Point Spread Function

A Physicist's approach: let's build a model



$\longrightarrow$
$g_\theta$
$\longrightarrow$
PSF
$\longrightarrow$
Pixelation
$\longrightarrow$
Noise
Probabilistic model
$$ x \sim ? $$
$$ x \sim \mathcal{N}(z, \Sigma) \quad z \sim ? $$
latent $z$ is a denoised galaxy image
$$ x \sim \mathcal{N}( \mathbf{P} z, \Sigma) \quad z \sim ?$$
latent $z$ is a super-resolved and denoised galaxy image
$$ x \sim \mathcal{N}( \mathbf{P} (\Pi \ast z), \Sigma) \quad z \sim ? $$
latent $z$ is a deconvolved, super-resolved, and denoised galaxy image
$$ x \sim \mathcal{N}( \mathbf{P} (\Pi \ast g_\theta(z)), \Sigma) \quad z \sim \mathcal{N}(0, \mathbf{I}) $$
latent $z$ is a Gaussian sample
$\theta$ are parameters of the model



$\Longrightarrow$ Decouples the morphology model from the observing conditions.

How to train your dragon model

  • Training the generative amounts to finding $\theta_\star$ that maximizes the marginal likelihood of the model: $$p_\theta(x | \Sigma, \Pi) = \int \mathcal{N}( \Pi \ast g_\theta(z), \Sigma) \ p(z) \ dz$$
    $\Longrightarrow$ This is generally intractable

  • Efficient training of parameter $\theta$ is made possible by Amortized Variational Inference.
Auto-Encoding Variational Bayes (Kingma & Welling, 2014)
  • We introduce a parametric distribution $q_\phi(z | x, \Pi, \Sigma)$ which aims to model the posterior $p_{\theta}(z | x, \Pi, \Sigma)$.

  • Working out the KL divergence between these two distributions leads to: $$\log p_\theta(x | \Sigma, \Pi) \quad \geq \quad - \mathbb{D}_{KL}\left( q_\phi(z | x, \Sigma, \Pi) \parallel p(z) \right) \quad + \quad \mathbb{E}_{z \sim q_{\phi}(. | x, \Sigma, \Pi)} \left[ \log p_\theta(x | z, \Sigma, \Pi) \right]$$ $\Longrightarrow$ This is the Evidence Lower-Bound, which is differentiable with respect to $\theta$ and $\phi$.

The famous Variational Auto-Encoder



$$\log p_\theta(x| \Sigma, \Pi ) \geq - \underbrace{\mathbb{D}_{KL}\left( q_\phi(z | x, \Sigma, \Pi) \parallel p(z) \right)}_{\mbox{code regularization}} + \underbrace{\mathbb{E}_{z \sim q_{\phi}(. | x, \Sigma, \Pi)} \left[ \log p_\theta(x | z, \Sigma, \Pi) \right]}_{\mbox{reconstruction error}} $$

Sampling from the model

Woups... what's going on?

Tradeoff between code regularization and image quality


$$\log p_\theta(x| \Sigma, \Pi ) \geq - \underbrace{\mathbb{D}_{KL}\left( q_\phi(z | x, \Sigma, \Pi) \parallel p(z) \right)}_{\mbox{code regularization}} + \underbrace{\mathbb{E}_{z \sim q_{\phi}(. | x, \Sigma, \Pi)} \left[ \log p_\theta(x | z, \Sigma, \Pi) \right]}_{\mbox{reconstruction error}} $$

Latent space modeling with Normalizing Flows


$\Longrightarrow$ All we need to do is sample from the aggregate posterior of the data instead of sampling from the prior.


Dinh et al. 2016
Normalizing Flows
  • Assumes a bijective mapping between data space $x$ and latent space $z$ with prior $p(z)$: $$ z = f_{\theta} ( x ) \qquad \mbox{and} \qquad x = f^{-1}_{\theta}(z)$$
  • Admits an explicit marginal likelihood: $$ \log p_\theta(x) = \log p(z) + \log \left| \frac{\partial f_\theta}{\partial x} \right|(x) $$




Flow-VAE samples



Second order moments


Testing galaxy morphologies





Bayesian Inference a.k.a. Uncertainty Quantification

The Bayesian view of the problem: $$ p(z | x ) \propto p_\theta(x | z, \Sigma, \mathbf{\Pi}) p(z)$$ where:
  • $p( z | x )$ is the posterior
  • $p( x | z )$ is the data likelihood, contains the physics
  • $p( z )$ is the prior
Data
$x_n$
Truth
$x_0$

Posterior samples
$g_\theta(z)$

$\mathbf{P} (\Pi \ast g_\theta(z))$
Median
Data residuals
$x_n - \mathbf{P} (\Pi \ast g_\theta(z))$
Standard Deviation
$\Longrightarrow$ Uncertainties are fully captured by the posterior.

How to perform efficient posterior inference?


  • Posterior fitting by Variational Inference
    $$ \mathrm{ELBO} = - \mathbb{D}_{KL}\left( q_\phi(z) \parallel p(z) \right) \quad + \quad \mathbb{E}_{z \sim q_{\phi}} \left[ \log p_\theta(x_n | z, \Sigma_n, \Pi_n) \right]$$
  • Posterior fitting by $EL_{2}O$
    $$EL_2O = \arg \min_\theta \mathbb{E}_{z \sim p^{\prime}} \sum_i \alpha_i \parallel \nabla_{z}^n \ln q_\theta(z) - \nabla_{z}^{n} \ln p(z | x_n, \Sigma_n, \Pi_n) \parallel_2^2$$See (Seljak & Yu, 2019) for more details.

  • Or your favorite method...