Neural Stochastic Flows: Solver-Free Modelling and Inference for SDE Solutions

NeurIPS 2025

Naoki Kiyohara^1,2 Edward Johns¹ Yingzhen Li¹ ¹Imperial College London, ²Canon Inc.

Abstract

Stochastic differential equations (SDEs) are well suited to modelling noisy and/or irregularly-sampled time series, which are omnipresent in finance, physics, and machine learning applications. Traditional approaches require costly simulation of numerical solvers when sampling between arbitrary time points. We introduce Neural Stochastic Flows (NSFs) and their latent dynamic versions, which learn (latent) SDE transition laws directly using conditional normalising flows, with architectural constraints that preserve properties inherited from stochastic flows. This enables sampling between arbitrary states in a single step, providing up to two orders of magnitude speedup for distant time points. Experiments on synthetic SDE simulations and real-world tracking and video data demonstrate that NSF maintains distributional accuracy comparable to numerical approaches while dramatically reducing computation for arbitrary time-point sampling, enabling numerical approaches while dramatically reducing computation for arbitrary time-point sampling, enabling applications where numerical solvers remain prohibitively expensive.

Solver-Free Modelling of SDEs

Stochastic differential equations (SDEs) capture the noisy dynamics of robotics, finance, climate systems, and modern generative models. Real deployments demand forecasts across arbitrary gaps, and evaluations must remain fast enough for control or analysis loops.

Solver-based neural SDEs approximate these transitions by threading hundreds of computational steps, such as Euler–Maruyama, so the cost grows linearly with the horizon. This makes long-range or densely queried predictions difficult to deploy.

Numerical SDE solvers require iterative integration steps for long-horizon predictions.

To address these computational challenges, we propose Neural Stochastic Flows (NSFs) that side-step numerical integration by learning the transition density $p_\boldsymbol{\theta}(\boldsymbol{x}_t \mid \boldsymbol{x}_s; \Delta t)$, where $\boldsymbol{x}_s$ and $\boldsymbol{x}_t$ are states at times $s$ and $t$ with $s < t$, and $\Delta t :=t - s$ is the time gap, in a single pass. The animation shows the solver baseline we replace and what NSF models.

Neural Stochastic Flows enable direct sampling from conditional densities without iterative solvers.

Modelling Weak Solutions (Conditional Densities) of SDEs

As we describe in the paper, NSFs are designed to fulfil the following inherited properties of stochastic flows when viewed as weak solutions. Here, we define the NSF transition density $p_\boldsymbol{\theta}(\boldsymbol{x}_t \mid \boldsymbol{x}_s; s, \Delta t)$, where $\boldsymbol{x}_s$ and $\boldsymbol{x}_t$ are states at times $s$ and $t$ with $s < t$, and $\Delta t :=t - s$ is the time gap.

1
Independence.
For any $0 \le t_1 \le \dots \le t_n$, the conditionals $p_\boldsymbol{\theta}(\boldsymbol{x}_{t_{k+1}}\mid \boldsymbol{x}_{t_k}; t_k, t_{k+1}-t_k)$ are independent.

Realised by the architecture: each transition conditions only on $(\boldsymbol{x}_s, s, \Delta t)$ and draws independent Gaussian noise.
2
Identity.
Evaluating at zero gap recovers the input state: $p_\boldsymbol{\theta}(\boldsymbol{x}_t \mid \boldsymbol{x}_s; s, 0) = \delta(\boldsymbol{x}_t - \boldsymbol{x}_s)$.

Realised by the architecture: we design the base distribution and coupling layers to collapse to the identity as $\Delta t\to 0$.
3
Flow property.
For any $0 \le t_1 \le \dots \le t_n$, the transition density satisfies the Chapman–Kolmogorov property:
$p_\boldsymbol{\theta}(\boldsymbol{x}_{t_k} \mid \boldsymbol{x}_{t_i}; t_i, t_k - t_i) = \int p_\boldsymbol{\theta}(\boldsymbol{x}_{t_k} \mid \boldsymbol{x}_{t_j}; t_j, t_k - t_j)\, p_\boldsymbol{\theta}(\boldsymbol{x}_{t_j} \mid \boldsymbol{x}_{t_i}; t_i, t_j - t_i)\,\mathrm{d}\boldsymbol{x}_{t_j}.$
Shaped by the bidirectional flow regulariser introduced in Flow Consistency Regularisation.
4
Stationarity (autonomous SDEs).
When the underlying SDE is time-homogeneous, $p_\boldsymbol{\theta}(\boldsymbol{x}_{t_j}\mid \boldsymbol{x}_s; t_i, t_j - t_i)$ depends on the gap only: $p_\boldsymbol{\theta}(\boldsymbol{x}_{t_j}\mid \boldsymbol{x}_s; t_i, t_j - t_i)=p_\boldsymbol{\theta}(\boldsymbol{x}_{t_j+r}\mid \boldsymbol{x}_s; t_i + r, t_j - t_i)$.

Realised by the architecture: for autonomous SDEs we drop the absolute time $t_i$ from the conditioner so the transition depends only on the gap.

Architecture of Neural Stochastic Flows

Let $\boldsymbol{c} := (\boldsymbol{x}_{t_i}, \Delta t, t_i)$ denote the conditioning parameters, where $t_i$ is included only for non-autonomous systems and omitted otherwise, and let $\Delta t := t_j-t_i$. We instantiate the sampling procedure of the flow distribution as

\boldsymbol{z} = \underbrace{\boldsymbol{x}_{t_i} + \Delta t \cdot \text{MLP}_\mu(\boldsymbol{c};\boldsymbol{\theta}_\mu)}_{\boldsymbol{\mu}(\boldsymbol{c})} + \underbrace{\sqrt{\Delta t}\cdot\text{MLP}_\sigma(\boldsymbol{c};\boldsymbol{\theta}_\sigma)}_{\boldsymbol{\sigma}(\boldsymbol{c})} \odot \,\boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}),

\boldsymbol{x}_{t_j} = \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{z}, \boldsymbol{c}) = \boldsymbol{f}_{L}(\cdot;\boldsymbol{c}, \boldsymbol{\theta}_L) \circ \boldsymbol{f}_{L-1}(\cdot;\boldsymbol{c}, \boldsymbol{\theta}_{L-1}) \circ \cdots \circ \boldsymbol{f}_{1}(\boldsymbol{z};\boldsymbol{c}, \boldsymbol{\theta}_1).

Our architecture integrates a parametric Gaussian initialisation with a sequence of bijective transformations. The state-dependent Gaussian, centred at $\boldsymbol{\mu}(\boldsymbol{c})$ with noise scale $\boldsymbol{\sigma}(\boldsymbol{c})$, follows the similar form to the Euler–Maruyama discretisation with drift scaled by $\Delta t$ and diffusion by $\sqrt{\Delta t}$. The subsequent bijective transformations $\boldsymbol{f}_1$ through $\boldsymbol{f}_L$ are implemented as conditioned coupling flows whose parameters depend on $\boldsymbol{c}$. Each layer splits the state $\boldsymbol{z}$ into two partitions $(\boldsymbol{z}_\text{A},\boldsymbol{z}_\text{B})$ and applies an affine update to one partition conditioned on the other and on $\boldsymbol{c}$:

\boldsymbol{f}_{ i}(\boldsymbol{z};\boldsymbol{c},\boldsymbol{\theta}_i) = \mathrm{Concat}\!\left(\boldsymbol{z}_\text{A},\boldsymbol{z}_\text{B} \odot \exp\!\big(\Delta t\,\text{MLP}^{(i)}_{\text{scale}}(\boldsymbol{z}_\text{A},\boldsymbol{c};\boldsymbol{\theta}_\text{scale}^{(i)})\big) + \Delta t\,\text{MLP}^{(i)}_{\text{shift}}(\boldsymbol{z}_\text{A},\boldsymbol{c};\boldsymbol{\theta}_\text{shift}^{(i)})\right),

with alternating partitions across layers. The explicit $\Delta t$ factor ensures $\boldsymbol{f}_i(\boldsymbol{z};\boldsymbol{c}, \boldsymbol{\theta}_i) = \boldsymbol{z}$ when $\Delta t=0$ and, combined with the form of the base Gaussian, preserves the identity at zero time gap. Stacking such layers yields an expressive diffeomorphism while keeping the Jacobian log-determinant tractable for loss computation.

Inherited Properties

✓
Independence.
Ensured by the conditional sampling on the initial state $\boldsymbol{x}_{t_i}$ without any overlap between the transitions.
✓
Identity.
The explicit $\Delta t$ factor in both the base Gaussian and coupling layers ensures $p_\boldsymbol{\theta}(\boldsymbol{x}_t \mid \boldsymbol{x}_s; s, 0) = \delta(\boldsymbol{x}_t - \boldsymbol{x}_s)$.
✓
Stationarity (autonomous SDEs).
Obtained by omitting $t_i$ from $\boldsymbol{c}$ when modelling autonomous SDEs.

Regularisation for Flow Property

To guarantee that recursive application of NSF matches its one-step predictions, we optimise a bidirectional KL flow loss that enforces the flow property (Chapman–Kolmogorov property):

p_\boldsymbol{\theta}(\boldsymbol{x}_{t_k} \mid \boldsymbol{x}_{t_i}; t_i, t_k - t_i) = \int p_\boldsymbol{\theta}(\boldsymbol{x}_{t_k} \mid \boldsymbol{x}_{t_j}; t_j, t_k - t_j)\, p_\boldsymbol{\theta}(\boldsymbol{x}_{t_j} \mid \boldsymbol{x}_{t_i}; t_i, t_j - t_i)\,\mathrm{d}\boldsymbol{x}_{t_j}.

To enforce this property, we minimise the discrepancy between the direct one-step transition (LHS) and the two-step transition through an intermediate state (RHS).

Flow-consistency loss guides alignment between one-step and two-step transitions.

Several candidates exist such as optimal-transport/Wasserstein, Stein, adversarial, and kernel MMD. Among them, we choose KL divergences because we can minimise the upper-bounds of the KL divergences in a tractable way, and since KL is asymmetric, we use both directions.

Direct KLs remain intractable due to the marginalisation over the intermediate state $\boldsymbol{x}_{t_j}$ on the two-step side. Therefore, we introduce a bridge distribution $b_\boldsymbol{\xi}(\boldsymbol{x}_{t_j}\mid\boldsymbol{x}_{t_i},\boldsymbol{x}_{t_k})$ as an auxiliary variational distribution (Agakov and Barber, 2004; Ranganath et al., 2016; Salimans et al., 2015) and optimise variational upper bounds for both directions.

Specifically, for the forward KL (one-step to two-step), we have the upper bound:

\mathcal{L}_{\text{flow, 1-to-2}} = \mathop{\mathbb{E}}_{p_\boldsymbol{\theta}(\boldsymbol{x}_{t_k}\mid \boldsymbol{x}_{t_i})b_\boldsymbol{\xi}(\boldsymbol{x}_{t_j}\mid \boldsymbol{x}_{t_i}, \boldsymbol{x}_{t_k})}\left[\log \frac{p_\boldsymbol{\theta}\left(\boldsymbol{x}_{t_k} \mid \boldsymbol{x}_{t_i}\right) \, b_\boldsymbol{\xi}\left(\boldsymbol{x}_{t_j} \mid \boldsymbol{x}_{t_i}, \boldsymbol{x}_{t_k}\right)}{p_\boldsymbol{\theta}\left(\boldsymbol{x}_{t_k} \mid \boldsymbol{x}_{t_j}\right) \, p_\boldsymbol{\theta}\left(\boldsymbol{x}_{t_j} \mid \boldsymbol{x}_{t_i}\right)}\right].

And for the reverse KL (two-step to one-step), we have the upper bound:

\mathcal{L}_{\text{flow, 2-to-1}} = \mathop{\mathbb{E}}_{p_\boldsymbol{\theta}(\boldsymbol{x}_{t_j}\mid \boldsymbol{x}_{t_i})p_\boldsymbol{\theta}(\boldsymbol{x}_{t_k} \mid \boldsymbol{x}_{t_j})}\left[\log \frac{p_\boldsymbol{\theta}\left(\boldsymbol{x}_{t_j} \mid \boldsymbol{x}_{t_i}\right) \, p_\boldsymbol{\theta}\left(\boldsymbol{x}_{t_k} \mid \boldsymbol{x}_{t_j}\right)}{b_\boldsymbol{\xi}\left(\boldsymbol{x}_{t_j} \mid \boldsymbol{x}_{t_i}, \boldsymbol{x}_{t_k}\right) \, p_\boldsymbol{\theta}\left(\boldsymbol{x}_{t_k} \mid \boldsymbol{x}_{t_i}\right)}\right].

The total flow-consistency loss, incorporated as a weighted regulariser alongside the main negative log-likelihood objective, is the sum of the two KL divergences:

\mathcal{L}_{\text{flow}} = \mathcal{L}_{\text{flow, 1-to-2}} + \mathcal{L}_{\text{flow, 2-to-1}}.

Latent Neural Stochastic Flows

Irregularly-sampled real-world sequences seldom expose the full system state; instead we observe noisy measurements $\boldsymbol{o}_{t_i}$ at times $0 = t_0 < t_1 < \dots < t_T$. To model the hidden continuous-time dynamics while remaining solver-free, we introduce the Latent Neural Stochastic Flows (Latent NSFs) as variational state-space models (VSSMs; Chung et al., 2015; Krishnan et al., 2017; Johnson et al., 2016; Karl et al., 2017) that extend the variational autoencoder framework (Kingma and Welling, 2014; Rezende et al., 2014) with NSF transition kernels.

Graphical model for a standard variational state-space model — Standard VSSM: discrete-time latent transitions.

Graphical model for latent Neural Stochastic Flows — Latent NSF: solver-free continuous-time transitions parameterised by $\Delta t$.

Skip KL objective for latent Neural Stochastic Flows — Skip-KL objective compares direct transitions with the fully conditioned posterior.

Generative model

Let $\boldsymbol{x}_{t_i} \in \mathbb{R}^d$ denote the latent state at time $t_i$, let $\boldsymbol{o}_{t_i}$ be the corresponding observation, and define $\Delta t_i := t_i - t_{i-1}$. The joint distribution factorises as

\begin{aligned} p_{\boldsymbol{\theta},\boldsymbol{\psi}}\bigl(\boldsymbol{x}_{t_{0:T}},\boldsymbol{o}_{t_{0:T}}\bigr) &= p_{\boldsymbol{\theta}}\!\left(\boldsymbol{x}_{t_0}\right) \prod_{i=1}^{T} p_{\boldsymbol{\theta}}\!\left( \boldsymbol{x}_{t_i}\mid \boldsymbol{x}_{t_{i-1}}; t_{i-1},\Delta t_i \right) p_{\boldsymbol{\psi}}\!\left( \boldsymbol{o}_{t_i}\mid \boldsymbol{x}_{t_i} \right), \end{aligned}

Here $p_{\boldsymbol{\theta}}(\boldsymbol{x}_{t_i}\mid \boldsymbol{x}_{t_{i-1}}; t_{i-1},\Delta t_i)$ is an NSF transition density that admits arbitrary $\Delta t_i$ without numerical integration, $p_{\boldsymbol{\psi}}(\boldsymbol{o}_{t_i}\mid \boldsymbol{x}_{t_i})$ is the observation decoder, and $p_{\boldsymbol{\theta}}(\boldsymbol{x}_{t_0})$ is an initial prior over the state.

Variational posterior

For the inference model, we employ a gated recurrent unit encoder (Cho et al., 2014) that processes the observation sequence:

q_{\boldsymbol{\phi}}\!\left( \boldsymbol{x}_{t_{0:T}}\mid \boldsymbol{o}_{\le t_T} \right) = \prod_{i=0}^{T} \mathcal{N}\!\left( \boldsymbol{x}_{t_i}\mid \boldsymbol{m}_{t_i}, \operatorname{diag}\!\left(\boldsymbol{s}_{t_i}^{2}\right) \right),

(\boldsymbol{m}_{t_i},\boldsymbol{s}_{t_i}) = \mathrm{GRU}_{\boldsymbol{\phi}}\!\left( [\boldsymbol{o}_{t_i},\Delta t_i,t_i], \boldsymbol{h}_{t_{i-1}} \right),

The hidden state is initialised as $\boldsymbol{h}_{t_{-1}} = \boldsymbol{0}$. Absolute time $t_i$ is included only for non-autonomous systems; for autonomous SDEs the encoder conditions on $[\boldsymbol{o}_{t_i}, \Delta t_i]$.

Learning objective

Our goal is to train the generative model $p_{\boldsymbol{\theta},\boldsymbol{\psi}}\bigl(\boldsymbol{x}_{t_{0:T}},\boldsymbol{o}_{t_{0:T}}\bigr)$ and the inference model $q_{\boldsymbol{\phi}}(\boldsymbol{x}_{t_{0:T}}\mid \boldsymbol{o}_{\le t_T})$. A standard choice is the $\beta$-weighted negative ELBO ($\beta$-NELBO) loss (Higgins et al., 2017):

\mathcal{L}_{\beta\text{-NELBO}} = \sum_{i=0}^{T} \Bigl[ -\,\mathop{\mathbb{E}}_{q_{\boldsymbol{\phi}}} \bigl[ \log p_{\boldsymbol{\psi}}(\boldsymbol{o}_{t_i}\mid \boldsymbol{x}_{t_i}) \bigr] + \beta \mathrm{KL}\bigl( q_{\boldsymbol{\phi}}(\boldsymbol{x}_{t_i}\mid \boldsymbol{o}_{\le t_i}) \,\|\, p_{\boldsymbol{\theta}}(\boldsymbol{x}_{t_i}\mid \boldsymbol{x}_{t_{i-1}}) \bigr) \Bigr],

The expectation is over $q_{\boldsymbol{\phi}}(\boldsymbol{x}_{t_i}\mid \boldsymbol{o}_{\le t_i})$ and $\mathrm{KL}(\cdot\|\cdot)$ denotes the Kullback–Leibler divergence.

Focusing only on adjacent time steps can accumulate error over long horizons, so we introduce a skip-ahead KL divergence loss (rightmost panel above), akin to latent overshooting (Hafner et al., 2019):

\mathcal{L}_{\text{skip}} = \sum_{i=0}^{T-\tau} \mathop{\mathbb{E}}_{j \sim \mathcal{U}\{i+2,\,i+\tau\}} \Bigl[ \mathrm{KL}\bigl( q_{\boldsymbol{\phi}}(\boldsymbol{x}_{t_j}\mid \boldsymbol{o}_{\le t_j}) \,\|\, p_{\boldsymbol{\theta}}(\boldsymbol{x}_{t_j}\mid \boldsymbol{x}_{t_i}) \bigr) \Bigr].

Here $\mathcal{U}\{i+2,\,i+\tau\}$ denotes the discrete uniform distribution over indices $\{i+2,\dots,i+\tau\}$ with $\tau \ge 2$. Latent NSF transitions permit direct sampling across arbitrary time gaps, making this objective efficient to evaluate.

Combining the skip-ahead loss with the flow-consistency loss $\mathcal{L}_{\text{flow}}$ yields the total loss:

\mathcal{L}_{\text{total}} = \mathcal{L}_{\beta\text{-NELBO}} + \lambda\,\mathcal{L}_{\text{flow}} + \beta_{\text{skip}}\,\mathcal{L}_{\text{skip}},

The hyperparameters $\lambda$ and $\beta_{\text{skip}}$ control the strengths of the flow-consistency and skip-ahead KL divergence losses, respectively; $\beta$ weights the latent KL in the $\beta$-NELBO.

Results

Detailed experimental configurations are documented in the paper.

› Stochastic Lorenz Attractor KL vs Compute • Figures & Table

NSF matches the ground-truth attractor while remaining solver-free. Independent one-step samples stay on the correct manifold, unlike solver-based neural SDEs that either diffuse or collapse when rolled forward.

Ground-truth Lorenz trajectories — Ground truth samples

Latent SDE trajectories for Lorenz — Latent SDE baseline

SDE matching trajectories for Lorenz — SDE matching baseline

NSF trajectories for Lorenz — NSF (one-step)

KL divergence vs. compute on Lorenz (lower is better).
Method	t = 0.25		t = 0.5		t = 0.75		t = 1.0
Method	KL ↓	kFLOPs ↓	KL ↓	kFLOPs ↓	KL ↓	kFLOPs ↓	KL ↓	kFLOPs ↓
Latent SDE (Li et al., 2020)	2.1 ± 0.9	959	1.8 ± 0.1	1,917	0.9 ± 0.3	2,839	1.5 ± 0.5	3,760
Neural LSDE (Oh et al., 2024)	1.3 ± 0.4	1,712	7.2 ± 1.4	3,416	74.5 ± 24.6	5,057	53.1 ± 29.3	6,699
Neural GSDE (Oh et al., 2024)	1.2 ± 0.4	1,925	3.9 ± 0.3	3,848	20.2 ± 7.6	5,698	14.1 ± 8.4	7,548
Neural LNSDE (Oh et al., 2024)	1.7 ± 0.4	1,925	4.6 ± 0.7	3,848	57.3 ± 16.4	5,698	44.6 ± 23.4	7,548
SDE matching (Bartosh et al., 2024)
$\Delta t = 0.0001$	4.3 ± 0.7	184,394	5.3 ± 1.0	368,787	3.4 ± 0.8	553,034	3.8 ± 1.0	737,354
$\Delta t = 0.01$	6.3 ± 0.4	1,917	11.7 ± 0.5	3,834	7.9 ± 0.3	5,677	6.0 ± 0.3	7,520
NSF (ours)
$ \mathcal{H}_{\mathrm{pred}} = 1.0$	0.8 ± 0.7	53	1.3 ± 0.1	53	0.6 ± 0.3	53	0.2 ± 0.6	53
$ \mathcal{H}_{\mathrm{pred}} = 0.5$	2.4 ± 1.9	53	1.3 ± 0.1	53	1.0 ± 0.4	105	1.7 ± 1.1	105
$ \mathcal{H}_{\mathrm{pred}} = 0.25$	1.2 ± 0.7	53	1.2 ± 0.1	105	0.8 ± 0.4	156	1.3 ± 0.8	208

Average runtime per 100 samples (JAX): latent SDE 124–148 ms; NSF 0.3 ms.

› CMU Motion Capture Test MSE • Trade-off Plot

Latent NSF delivers the strongest extrapolation accuracy while remaining orders of magnitude faster than solver-based latent SDEs. Setup 1 evaluates within-horizon forecasting, and Setup 2 withholds the final third of each sequence during training.

Test MSE (95% confidence interval) on CMU Motion Capture.
Method	Setup 1	Setup 2
npODE (Heinonen et al., 2018)	22.96^†	—
Neural ODE (Chen et al., 2018)	22.49 ± 0.88^†	—
ODE2VAE-KL (Yildiz et al., 2019)	8.09 ± 1.95^†	—
Latent ODE (Rubanova et al., 2019)	5.98 ± 0.28^*	31.62 ± 0.05^§
Latent SDE (Li et al., 2020)	12.91 ± 2.90^♠	9.52 ± 0.21^§
Latent Approx SDE (Solin et al., 2021)	7.55 ± 0.05^§	10.50 ± 0.86^§
ARCTA (Course et al., 2023)	7.62 ± 0.93^‡	9.92 ± 1.82
NCDSSM (Ansari et al., 2023)	5.69 ± 0.01^§	4.74 ± 0.01^§
SDE matching (Bartosh et al., 2024)	5.20 ± 0.43^♣	4.26 ± 0.35
Latent NSF (ours)	8.62 ± 0.32	3.41 ± 0.27

† reported by Yildiz et al. (2019); * reported by Rubanova et al. (2019); ‡ reported by Course et al. (2023); § reported by Ansari et al. (2023); ♠ our reproduction of Li et al. (2020) code: latent-sde-mocap; ♣ our reproduction of Bartosh et al. (2024) code: sde-matching-mocap.

Runtime per 100 samples (JAX): latent SDE 75 ms vs. Latent NSF 3.5 ms.

Runtime-accuracy trade-off on CMU Motion Capture — Latent NSF dominates the runtime/accuracy frontier on both forecasting (left) and extrapolation (right) protocols.

› Stochastic Moving MNIST FD Metrics • Grids

On video forecasting, Latent NSF preserves appearance quality while modelling stochastic futures. One-step sampling excels on static structure, and recursive rollout remains competitive on dynamics.

Fréchet distance (FD) on Stochastic Moving MNIST.
Method	Static FD ↓	Dynamics FD ↓	Frame-wise FD ↓
Latent SDE	2.66 ± 0.87	5.39 ± 3.10	0.58 ± 0.21
Latent NSF (recursive)	2.36 ± 0.60	7.76 ± 2.56	0.63 ± 0.17
Latent NSF (one-step)	1.67 ± 0.47	--	0.63 ± 0.15

Dynamics FD is undefined for one-step NSF because it predicts the marginal distribution rather than multi-step trajectories.

Fréchet distance comparison on Stochastic Moving MNIST — Latent NSF improves static fidelity while maintaining competitive frame-wise quality; recursive rollout remains close on dynamics.

Ground-truth Stochastic Moving MNIST frames — Ground-truth future frames

Latent SDE Stochastic Moving MNIST predictions — Latent SDE rollout

Recursive Latent NSF Stochastic Moving MNIST predictions — Latent NSF (recursive)

One-step Latent NSF Stochastic Moving MNIST predictions — Latent NSF (one-step)

Position of Our Work

We categorise prior work in terms of whether learning and/or sampling avoid fine-grained numerical time-stepping, and the class of dynamics targeted, as summarised below.

Solver requirements across continuous-time differential equation models.
Method(s)	Target dynamics	Solver-free training	Solver-free inference
Neural ODEs (Chen et al., 2018)	General ODEs	✗	✗
Neural flows (Bilos et al., 2021)	General ODEs	✓	✓
Score-based diffusion via reverse SDEs/PF-ODEs (Song et al., 2021; Lipman et al., 2023)	Pre-defined diffusion SDEs/ODEs	✓	✗
Progressive distillation (Salimans & Ho, 2022)	Pre-defined diffusion SDEs/ODEs	✗	✓
Consistency models (Song et al., 2023; Kim et al., 2024); rectified flows (Liu et al., 2023)	Pre-defined diffusion SDEs/ODEs	✓	✓
Neural (latent) SDEs (Tzen & Raginsky, 2019; Li et al., 2020; Kidger et al., 2021; Oh et al., 2024; Solin et al., 2021)	General Itô SDEs	✗	✗
ARCTA (Course & Nair, 2023); SDE matching (Bartosh et al., 2025)	General Itô SDEs	✓	✗
Neural Stochastic Flows (ours)	General Itô SDEs	✓	✓

'Pre-defined diffusion SDEs/ODEs' refers to a boundary-conditioned diffusion process bridging a fixed base distribution and the data distribution utilised in diffusion models.

NSF unifies the solver-free philosophy of neural flows with the expressive power of neural (latent) SDEs. It learns the Markov transition density as a conditional normalising flow, which satisfies the conditions of weak solutions of SDEs by design and regularisation objectives. Unlike diffusion model-specific accelerations, NSF handles arbitrary Itô SDEs and extends naturally to latent sequence models. The resulting method removes numerical integration while retaining closed-form likelihoods.

BibTeX

@inproceedings{kiyohara2025neural,
  title     = {Neural Stochastic Flows: Solver-Free Modelling and Inference for {SDE} Solutions},
  author    = {Naoki Kiyohara and Edward Johns and Yingzhen Li},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025}
}