Solving Inverse Problems with Deep Learning

Applications to Astrophysics

François Lanusse

slides at eiffl.github.io/talks/KMI2020

Linear inverse problems

$\boxed{y = \mathbf{A}x + n}$

$\mathbf{A}$ is known and encodes our physical understanding of the problem.
$\Longrightarrow$ When non-invertible or ill-conditioned, the inverse problem is ill-posed with no unique solution $x$

Deconvolution

Inpainting

Denoising

Can AI solve all of our problems?

Branched GAN model for deblending (Reiman & Göhre, 2018)

The issue with using deep learning as a black-box

No explicit control of noise, PSF, depth, number of sources.
- Model would have to be retrained for all observing configurations
No guarantees on the network output (e.g. flux preservation, artifacts)
No proper uncertainty quantification.

Focus of this talk

Deep Learning for Data Processing

Bayesian Neural Networks

Physical Bayesian inference

This talk

Combine Deep Learning, Bayesian Statistics, and Physics for:

interpretability
uncertainty quantification

in inverse problems.

Can we understand how a Neural Network solves an Inverse Problem?

A Motivating Example: Image Deconvolution

$ y = P \ast x + n $

Observed $y$

Ground-Based Telescope

$f_\theta$

some deep Convolutional Neural Network

Unknown $x$

Hubble Space Telescope

A standard approach would be to train a neural network $f_\theta$ to estimate $x$ given $y$.

Step I: Assemble from data or simulations a training set of images $$\mathcal{D} = \{(x_0, y_0), (x_1, y_1), \ldots, (x_N, y_N) \}$$ $\Longrightarrow$ the dataset contains hardcoded assumptions about PSF $P$ noise $n$, and galaxy morphology $x$.
Step II: Train the neural network $f_\theta$ under a regression loss of the type: $$ \mathcal{L} = \sum_{i=0}^N \parallel x_i - f_\theta(y_i)\parallel^2 $$

$$ y $$

$$f_\theta$$

$$f_\theta(y)$$

$$ \mbox{True } x $$

$\Longrightarrow$Why is the network output different from the truth? If it's not the truth, then what is $f_\theta(y)$?

Let's try to understand the neural network output by looking at the loss function

$$ \mathcal{L} = \sum_{(x_i, y_i) \in \mathcal{D}} \parallel x_i - f_\theta(y_i)\parallel^2 \quad \simeq \quad \int \parallel x - f_\theta(y) \parallel^2 \ p(x,y) \ dx dy $$

$$\Longrightarrow \int \left[ \int \parallel x - f_\theta(y) \parallel^2 \ p(x|y) \ dx \right] p(y) dy $$

$\mathcal{L}$ minimized when $f_{\theta^\star}(y) = \int x \ p(x|y) \ dx $, i.e. when $f_{\theta^\star}(x)$ is predicting the mean of $p(x|y)$.

Using an $\ell_2$ loss learns the mean of the $p(x | y)$

Using an $\ell_1$ loss learns the median of $p(x|y)$

In general, training a neural network for regression doesn't achieve de mode of the distribution.

Check this blogpost and this notebook to learn how to do that:

A Bayesian understanding of a regression network

Data $y$

Posterior samples

Posterior mean

True $x$

The distribution $p(x|y)$ can be understood as a Bayesian posterior distribution: $$ p(x | y) \propto \underbrace{p(y | x)}_{\mbox{likelihood}} \quad \underbrace{p(x)}_{\mbox{prior}} $$

Both priors and likelihoods are learned implicitly by the neural network from the training set.
$\Longrightarrow$ priors AND likelihoods are hardcoded in the training set.

Application to Weak Lensing Mass-Mapping

Work in collaboration with:
Niall Jeffrey, Ofer Lahav, Jean-Luc Starck

$\Longrightarrow$ Use Deep Learning to reconstruct the projected mass distribution in the Universe.

Gravitational lensing

Galaxy shapes as estimators for gravitational shear

$$ e = \gamma + e_i \qquad \mbox{ with } \qquad e_i \sim \mathcal{N}(0, I)$$

We are trying the measure the ellipticity $e$ of galaxies as an estimator for the gravitational shear $\gamma$

The Weak Lensing Mass-Mapping as an Inverse Problem

Shear $\gamma$

Convergence $\kappa$

$$\gamma_1 = \frac{1}{2} (\partial_1^2 - \partial_2^2) \ \Psi \quad;\quad \gamma_2 = \partial_1 \partial_2 \ \Psi \quad;\quad \kappa = \frac{1}{2} (\partial_1^2 + \partial_2^2) \ \Psi$$

$$\boxed{\gamma = \mathbf{P} \kappa}$$

Illustration on the Dark Energy Survey (DES) simulations

Jeffrey, Abdalla, Lahav, Lanusse et al. (2018)

The DeepMass approach: Posterior mean estimation by Deep Learning

$$\boxed{e = \mathbf{P} \kappa + e_i}$$

Preprocess the galaxy shapes data with a Wiener Filter $\tilde{\kappa}^{wf} = \mathbf{WF}(e)$
Building training set on simulations $$\mathcal{D} = \{ (\tilde{\kappa}_0^{wf}, \kappa_0), \ldots, (\tilde{\kappa}_N^{wf}, \kappa_n)\} $$ $\Longrightarrow$ Incorporates our prior knownledge $p(\kappa)$

Train a UNet under an $\ell_2$ loss: $$\mathcal{L} = \sum_{i=0}^N \parallel \kappa_i - f_\theta(\tilde{\kappa}_i^{wf}) \parallel^2 $$ $\Longrightarrow$ CNN optimized to estimate the posterior mean: $$ f_{\theta^\star}(\tilde{\kappa}^{wf})= \int \kappa \ p(\kappa | \kappa^{wf}) d\kappa$$

Results

Simulations
Application to DES Science Verification data

Takeaway

Training a deep neural network for regression in an inverse problem is doing Bayesian Inference.
$\Longrightarrow$ Using an $\ell_2$ loss returns the posterior mean.

Both likelihood and prior are learned implicitly from the training set.

Can be used to achieve superior results to conventional methods on inverse problems like mass-mapping.

Limitations of this black-box approach

No guarantees on the network output!
$\Longrightarrow$ We don't know how far it is from the actual posterior mean.

The posterior mean is not the Maximum a Posteriori (MAP) solution.

The likelihood is harcoded.
- If the noise changes, the model needs to be retrained.

$\boxed{y = \mathbf{A}x + n}$

The Bayesian view of the problem:

$$ p(x | y) \propto p(y | x) \ p(x) $$

$p(y | x)$ is the data likelihood, which contains the physics

$p(x)$ is the prior knowledge on the solution.

With these concepts in hand, we can estimate for instance the Maximum A Posteriori solution:

$$\hat{x} = \arg\max\limits_x \ \log p(y \ | \ x) + \log p(x)$$
For instance, if $n$ is Gaussian, $\hat{x} = \arg\max\limits_x \ \frac{1}{2} \parallel y - \mathbf{A} x \parallel_{\mathbf{\Sigma}}^2 + \log p(x)$

How do you choose the prior ?

Classical examples of signal priors

Sparse

$$ \log p(x) = \parallel \mathbf{W} x \parallel_1 $$

Gaussian

$$ \log p(x) = x^t \mathbf{\Sigma^{-1}} x $$

Total Variation

$$ \log p(x) = \parallel \nabla x \parallel_1 $$

But what about this?

Can we open the black-box and use the neural network to only learn a prior from data?

Do you know this person?

Probably not, this is a randomly generated person: thispersondoesntexist.com

What is generative modeling?

The goal of generative modeling is to learn the distribution $\mathbb{P}$ from which the training set $X = \{x_0, x_1, \ldots, x_n \}$ is drawn.

Usually, this means building a parametric model $\mathbb{P}_\theta$ that tries to be close to $\mathbb{P}$.

True $\mathbb{P}$

Samples $x_i \sim \mathbb{P}$

Model $\mathbb{P}_\theta$

Why isn't it easy?

The curse of dimensionality put all points far apart in high dimension

Distance between pairs of points drawn from a Gaussian distribution.

Classical methods for estimating probability densities, i.e. Kernel Density Estimation (KDE) start to fail in high dimension because of all the gaps

The evolution of generative models

Deep Belief Network
(Hinton et al. 2006)

Variational AutoEncoder
(Kingma & Welling 2014)

Generative Adversarial Network
(Goodfellow et al. 2014)

Wasserstein GAN
(Arjovsky et al. 2017)

A visual Turing test

Fake PixelCNN samples

Real galaxies from SDSS

Not all generative models are created equal

Grathwohl et al. 2018

GANs and VAEs are very common and successfull but do not fit our purposes.

We want a model which can provide an explicit $\log p(x)$.

PixelCNN: Likelihood-based Autoregressive generative model

Models the probability $p(x)$ of an image $x$ as: $$ p_{\theta}(x) = \prod_{i=0}^{n} p_{\theta}(x_i | x_{i-1} \ldots x_0) $$

Some of the best log-likelihoods on the market.
Extremely stable during training.
Slow to sample from.

Ramachandran et al. 2017

van den Oord et al. 2016

Provides an explicit value for $\log p_\theta(x)$
$\Longrightarrow$ Can be used as a Bayesian prior for inverse problems.

A deep denoising example

$$ \boxed{{\color{Orchid} y} = {\color{SkyBlue} x} + n} $$

Try me out at: https://eiffl.github.io/DeepPriors

We learn the distribution of noiseless data $\log p_\theta(x)$ from samples using a deep generative model.

We measure a noisy ${\color{Orchid} y}$ and we want to estimate a denoised ${\color{SkyBlue} x}$

The solution should lie on the realistic data manifold, symbolized by the two-moons distribution.
We want to solve for the Maximum A Posterior solution:
$$\arg \max - \frac{1}{2} \parallel {\color{Orchid} y} - {\color{SkyBlue} x} \parallel_2^2 + \log p_\theta({\color{SkyBlue} x})$$ This can be done by gradient descent as long as one has access to $\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x\color{orange})}{\color{orange} d \color{orange}x}$.

Application to Galaxy Deblending

Work in collaboration with:
Peter Melchior, Fred Moolekamp

$\Longrightarrow$ Use PixelCNNs as deep priors for deblending

The Scarlet algorithm: deblending as an optimization problem

Melchior et al. 2018

$$ \mathcal{L} = \frac{1}{2} \parallel \mathbf{\Sigma}^{-1/2} (\ Y - P \ast A S \ ) \parallel_2^2 - \sum_{i=1}^K \log p_{\theta}(S_i) + \sum_{i=1}^K g_i(A_i) + \sum_{i=1}^K f_i(S_i)$$

Where for a $K$ component blend:

$P$ is the convolution with the instrumental response

$A_i$ are channel-wise galaxy SEDs, $S_i$ are the morphology models

$\mathbf{\Sigma}$ is the noise covariance

$\log p_\theta$ is a PixelCNN prior

$f_i$ and $g_i$ are arbitrary additional non-smooth consraints, e.g. positivity, monotonicity...

$\Longrightarrow$ Explicit physical modeling of the observed sky

Training the morphology prior

Postage stamps of isolated COSMOS galaxies used for training, at WFIRST resolution and fixed fiducial PSF

isolated galaxy

$\log p_\theta(x) = 3293.7$

artificial blend

$\log p_\theta(x) = 3100.5 $

Scarlet in action

Input blend

Solution

Residuals

Classic priors (monotonicity, symmetry).

Deep Morphology prior.

True Galaxy

Deep Morphology Prior Solution

Monotonicity + Symmetry Solution

Extending to multi-band images

Lanusse, Melchior, Moolekamp (2019)

Takeaway

Deep generative models can be used to provide data driven priors.

Explicit likelihood, uses of all of our physical knowledge.
$\Longrightarrow$ The method can be applied for varying PSF, noise, or even different instruments!

Iterative optimization can include additional regularization terms.
- Ensures fit to data (flux preservation).
- Can impose physical constraints like positivity.

But what about uncertainty quantification?

Can we sample from the full Bayesian posterior $p(y | x)$ to estimate uncertainties?

Deep Posterior Sampling with Denoising Score Matching

Work in collaboration with:
Benjamin Remy, Zaccharie Ramzi

$\Longrightarrow$ Sample from posterior with gradient-based Monte-Carlo methods

The score is all you need!

Instead of learning ${\color{orange} \log {\color{orange} p\color{orange}(\color{orange} x\color{orange})}}$, learn $\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x\color{orange})}{\color{orange} d \color{orange}x}$, a.k.a. the Score Function
- MAP estimation only need the score! Same is true for Hamiltonian Monte Carlo!

The score can be learned efficiently by training a deep Gaussian denoiser (Denoising Score Matching).
The score of the full posterior is simply: $$\nabla \log p(x |y) = \underbrace{\nabla \log p(y |x)}_{\mbox{known}} \quad + \quad \underbrace{\nabla \log p(x)}_{\mbox{learned}}$$

$\boldsymbol{x}'$

$\boldsymbol{x}$

$\boldsymbol{x}'- \boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$

$\boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$

$$\boxed{\boldsymbol{r}^\star(\boldsymbol{x}', \sigma) = \boldsymbol{x}' + \sigma^2 \nabla_{\boldsymbol{x}} \log p_{\sigma^2}(\boldsymbol{x}')}$$

Uncertainty quantification in Magnetic Resonance Imaging (MRI)

Ramzi, Remy, Lanusse et al. 2020

$$\boxed{y = \mathbf{F}^{-1} \mathbf{M} \mathbf{F} x + n}$$

Learn score $\nabla \log p(x)$ with UNet trained by denoising score matching.
Sample 320x320pix image posterior by Annealed Hamiltonian Monte Carlo.

$\Longrightarrow$ We can see which parts of the image are well constrained by data, and which regions are uncertain.

Probabilistic Mass-Mapping of the HST COSMOS field

Remy, Lanusse, Ramzi et al. 2020

$$\boxed{\gamma = \mathbf{P} \kappa + n}$$

Hybrid prior: theoretical Gaussian on large scale, data-driven on small scales using N-body simulations. $$\underbrace{\nabla_{\boldsymbol{\kappa}} \log p(\boldsymbol{\kappa})}_\text{full prior} = \underbrace{\nabla_{\boldsymbol{\kappa}} \log p_{th}(\boldsymbol{\kappa})}_\text{gaussian prior} + \underbrace{\boldsymbol{r}_\theta(\boldsymbol{\kappa}, \nabla_{\boldsymbol{\kappa}} \log p_{th}(\boldsymbol{\kappa}))}_\text{learned residuals}$$ with $p_{th}(\boldsymbol{\kappa}) = \frac{1}{ \sqrt{ \det 2 \pi \boldsymbol{S}}} \exp \left( -\frac{1}{2} \boldsymbol{\kappa}^\dagger \boldsymbol{S}^{-1} \boldsymbol{\kappa} \right)$, computed from theory.

Conclusion

How to use Deep Learning for Inverse Problems in a Robust and Interpretable way?

Even a simple regression network is doing approximate Bayesian inference.

Restricting Deep Learning to modeling the priors brings robustness and intrepretability.
- Deep Generative Models are perfect for learning data-driven priors.

Proper Uncertainty Quantification is possible by sampling the Bayesian posterior of the inverse problem.

Thank you!

Bonus: How to train a generative model from noisy/corrupted data?

Deep Generative Models for Galaxy Image Simulations

Work in collaboration with
Rachel Mandelbaum, Siamak Ravanbakhsh, Chun-Liang Li, Barnabas Poczos, Peter Freeman

Lanusse et al. (2020)
Ravanbakhsh, Lanusse, et al. (2017)

The weak lensing shape measurement problem

Shape measurement biases

$$ < e > = \ (1 + m) \ \gamma \ + \ c $$

Can be calibrated on image simulations
How complex do the simulations need to be?

Complications specific to astronomical images: spot the differences!

CelebA

HSC PDR-2 wide

There is noise
We have a Point Spread Function

A Physicist's approach: let's build a model

$\longrightarrow$
$g_\theta$

$\longrightarrow$
PSF

$\longrightarrow$
Pixelation

$\longrightarrow$
Noise

Probabilistic model

$$ x \sim ? $$

$$ x \sim \mathcal{N}(z, \Sigma) \quad z \sim ? $$
latent $z$ is a denoised galaxy image

$$ x \sim \mathcal{N}( \mathbf{P} z, \Sigma) \quad z \sim ?$$
latent $z$ is a super-resolved and denoised galaxy image

$$ x \sim \mathcal{N}( \mathbf{P} (\Pi \ast z), \Sigma) \quad z \sim ? $$
latent $z$ is a deconvolved, super-resolved, and denoised galaxy image

$$ x \sim \mathcal{N}( \mathbf{P} (\Pi \ast g_\theta(z)), \Sigma) \quad z \sim \mathcal{N}(0, \mathbf{I}) $$
latent $z$ is a Gaussian sample
$\theta$ are parameters of the model

$\Longrightarrow$ Decouples the morphology model from the observing conditions.

How to train your dragon model

Training the generative amounts to finding $\theta_\star$ that maximizes the marginal likelihood of the model: $$p_\theta(x | \Sigma, \Pi) = \int \mathcal{N}( \Pi \ast g_\theta(z), \Sigma) \ p(z) \ dz$$
$\Longrightarrow$ This is generally intractable

Efficient training of parameter $\theta$ is made possible by Amortized Variational Inference.

Auto-Encoding Variational Bayes (Kingma & Welling, 2014)

We introduce a parametric distribution $q_\phi(z | x, \Pi, \Sigma)$ which aims to model the posterior $p_{\theta}(z | x, \Pi, \Sigma)$.

Working out the KL divergence between these two distributions leads to: $$\log p_\theta(x | \Sigma, \Pi) \quad \geq \quad - \mathbb{D}_{KL}\left( q_\phi(z | x, \Sigma, \Pi) \parallel p(z) \right) \quad + \quad \mathbb{E}_{z \sim q_{\phi}(. | x, \Sigma, \Pi)} \left[ \log p_\theta(x | z, \Sigma, \Pi) \right]$$ $\Longrightarrow$ This is the Evidence Lower-Bound, which is differentiable with respect to $\theta$ and $\phi$.

The famous Variational Auto-Encoder

$$\log p_\theta(x| \Sigma, \Pi ) \geq - \underbrace{\mathbb{D}_{KL}\left( q_\phi(z | x, \Sigma, \Pi) \parallel p(z) \right)}_{\mbox{code regularization}} + \underbrace{\mathbb{E}_{z \sim q_{\phi}(. | x, \Sigma, \Pi)} \left[ \log p_\theta(x | z, \Sigma, \Pi) \right]}_{\mbox{reconstruction error}} $$

Sampling from the model

Woups... what's going on?

Tradeoff between code regularization and image quality

Latent space modeling with Normalizing Flows

$\Longrightarrow$ All we need to do is sample from the aggregate posterior of the data instead of sampling from the prior.

Dinh et al. 2016

Normalizing Flows

Assumes a bijective mapping between data space $x$ and latent space $z$ with prior $p(z)$: $$ z = f_{\theta} ( x ) \qquad \mbox{and} \qquad x = f^{-1}_{\theta}(z)$$
Admits an explicit marginal likelihood: $$ \log p_\theta(x) = \log p(z) + \log \left| \frac{\partial f_\theta}{\partial x} \right|(x) $$

Flow-VAE samples

Second order moments

Testing galaxy morphologies

Bayesian Inference a.k.a. Uncertainty Quantification

The Bayesian view of the problem: $$ p(z | x ) \propto p_\theta(x | z, \Sigma, \mathbf{\Pi}) p(z)$$ where:

$p( z | x )$ is the posterior
$p( x | z )$ is the data likelihood, contains the physics
$p( z )$ is the prior

Data
$x_n$

Truth
$x_0$

Posterior samples
$g_\theta(z)$

$\mathbf{P} (\Pi \ast g_\theta(z))$

Median

Data residuals
$x_n - \mathbf{P} (\Pi \ast g_\theta(z))$

Standard Deviation

$\Longrightarrow$ Uncertainties are fully captured by the posterior.

How to perform efficient posterior inference?

(Böhm, Lanusse, Seljak 2019)

Posterior fitting by Variational Inference
$$ \mathrm{ELBO} = - \mathbb{D}_{KL}\left( q_\phi(z) \parallel p(z) \right) \quad + \quad \mathbb{E}_{z \sim q_{\phi}} \left[ \log p_\theta(x_n | z, \Sigma_n, \Pi_n) \right]$$
Posterior fitting by $EL_{2}O$
$$EL_2O = \arg \min_\theta \mathbb{E}_{z \sim p^{\prime}} \sum_i \alpha_i \parallel \nabla_{z}^n \ln q_\theta(z) - \nabla_{z}^{n} \ln p(z | x_n, \Sigma_n, \Pi_n) \parallel_2^2$$See (Seljak & Yu, 2019) for more details.

Or your favorite method...

Solving Inverse Problems with Deep Learning

Applications to Astrophysics

François Lanusse

Linear inverse problems

Can AI solve all of our problems?

Focus of this talk

Can we understand how a Neural Network solves an Inverse Problem?

A Motivating Example: Image Deconvolution

A Bayesian understanding of a regression network

Application to Weak Lensing Mass-Mapping

Work in collaboration with: Niall Jeffrey, Ofer Lahav, Jean-Luc Starck

Gravitational lensing

The Weak Lensing Mass-Mapping as an Inverse Problem

Illustration on the Dark Energy Survey (DES) simulations

The DeepMass approach: Posterior mean estimation by Deep Learning

Results

Takeaway

How do you choose the prior ?

Classical examples of signal priors

But what about this?

Can we open the black-box and use the neural network to only learn a prior from data?

Do you know this person?

What is generative modeling?

Why isn't it easy?

The evolution of generative models

A visual Turing test

Not all generative models are created equal

PixelCNN: Likelihood-based Autoregressive generative model

A deep denoising example

Application to Galaxy Deblending

Work in collaboration with: Peter Melchior, Fred Moolekamp

The Scarlet algorithm: deblending as an optimization problem

Training the morphology prior

Scarlet in action

Extending to multi-band images

Takeaway

Can we sample from the full Bayesian posterior $p(y | x)$ to estimate uncertainties?

Deep Posterior Sampling with Denoising Score Matching

Work in collaboration with: Benjamin Remy, Zaccharie Ramzi

The score is all you need!

Uncertainty quantification in Magnetic Resonance Imaging (MRI)

Probabilistic Mass-Mapping of the HST COSMOS field

Conclusion

Bonus: How to train a generative model from noisy/corrupted data?

Deep Generative Models for Galaxy Image Simulations

Work in collaboration with Rachel Mandelbaum, Siamak Ravanbakhsh, Chun-Liang Li, Barnabas Poczos, Peter Freeman

The weak lensing shape measurement problem

Complications specific to astronomical images: spot the differences!

A Physicist's approach: let's build a model

How to train your dragon model

The famous Variational Auto-Encoder

Sampling from the model

Tradeoff between code regularization and image quality

Latent space modeling with Normalizing Flows

Flow-VAE samples

Second order moments

Testing galaxy morphologies

Bayesian Inference a.k.a. Uncertainty Quantification

How to perform efficient posterior inference?

Work in collaboration with:
Niall Jeffrey, Ofer Lahav, Jean-Luc Starck

Work in collaboration with:
Peter Melchior, Fred Moolekamp

Work in collaboration with:
Benjamin Remy, Zaccharie Ramzi

Work in collaboration with
Rachel Mandelbaum, Siamak Ravanbakhsh, Chun-Liang Li, Barnabas Poczos, Peter Freeman