NeurIPS 2025
Stochastic differential equations (SDEs) are well suited to modelling noisy and/or irregularly-sampled time series, which are omnipresent in finance, physics, and machine learning applications. Traditional approaches require costly simulation of numerical solvers when sampling between arbitrary time points. We introduce Neural Stochastic Flows (NSFs) and their latent dynamic versions, which learn (latent) SDE transition laws directly using conditional normalising flows, with architectural constraints that preserve properties inherited from stochastic flows. This enables sampling between arbitrary states in a single step, providing up to two orders of magnitude speedup for distant time points. Experiments on synthetic SDE simulations and real-world tracking and video data demonstrate that NSF maintains distributional accuracy comparable to numerical approaches while dramatically reducing computation for arbitrary time-point sampling, enabling numerical approaches while dramatically reducing computation for arbitrary time-point sampling, enabling applications where numerical solvers remain prohibitively expensive.
Stochastic differential equations (SDEs) capture the noisy dynamics of robotics, finance, climate systems, and modern generative models. Real deployments demand forecasts across arbitrary gaps, and evaluations must remain fast enough for control or analysis loops.
Solver-based neural SDEs approximate these transitions by threading hundreds of computational steps, such as Euler–Maruyama, so the cost grows linearly with the horizon. This makes long-range or densely queried predictions difficult to deploy.
To address these computational challenges, we propose Neural Stochastic Flows (NSFs) that side-step numerical integration by learning the transition density $p_\boldsymbol{\theta}(\boldsymbol{x}_t \mid \boldsymbol{x}_s; \Delta t)$, where $\boldsymbol{x}_s$ and $\boldsymbol{x}_t$ are states at times $s$ and $t$ with $s < t$, and $\Delta t :=t - s$ is the time gap, in a single pass. The animation shows the solver baseline we replace and what NSF models.
As we describe in the paper, NSFs are designed to fulfil the following inherited properties of stochastic flows when viewed as weak solutions. Here, we define the NSF transition density $p_\boldsymbol{\theta}(\boldsymbol{x}_t \mid \boldsymbol{x}_s; s, \Delta t)$, where $\boldsymbol{x}_s$ and $\boldsymbol{x}_t$ are states at times $s$ and $t$ with $s < t$, and $\Delta t :=t - s$ is the time gap.
For any $0 \le t_1 \le \dots \le t_n$, the conditionals $p_\boldsymbol{\theta}(\boldsymbol{x}_{t_{k+1}}\mid \boldsymbol{x}_{t_k}; t_k, t_{k+1}-t_k)$ are independent.
Realised by the architecture: each transition conditions only on $(\boldsymbol{x}_s, s, \Delta t)$ and draws independent Gaussian noise.
Evaluating at zero gap recovers the input state: $p_\boldsymbol{\theta}(\boldsymbol{x}_t \mid \boldsymbol{x}_s; s, 0) = \delta(\boldsymbol{x}_t - \boldsymbol{x}_s)$.
Realised by the architecture: we design the base distribution and coupling layers to collapse to the identity as $\Delta t\to 0$.
For any $0 \le t_1 \le \dots \le t_n$, the transition density satisfies the Chapman–Kolmogorov property:
Shaped by the bidirectional flow regulariser introduced in Flow Consistency Regularisation.
When the underlying SDE is time-homogeneous, $p_\boldsymbol{\theta}(\boldsymbol{x}_{t_j}\mid \boldsymbol{x}_s; t_i, t_j - t_i)$ depends on the gap only: $p_\boldsymbol{\theta}(\boldsymbol{x}_{t_j}\mid \boldsymbol{x}_s; t_i, t_j - t_i)=p_\boldsymbol{\theta}(\boldsymbol{x}_{t_j+r}\mid \boldsymbol{x}_s; t_i + r, t_j - t_i)$.
Realised by the architecture: for autonomous SDEs we drop the absolute time $t_i$ from the conditioner so the transition depends only on the gap.
Let $\boldsymbol{c} := (\boldsymbol{x}_{t_i}, \Delta t, t_i)$ denote the conditioning parameters, where $t_i$ is included only for non-autonomous systems and omitted otherwise, and let $\Delta t := t_j-t_i$. We instantiate the sampling procedure of the flow distribution as
Our architecture integrates a parametric Gaussian initialisation with a sequence of bijective transformations. The state-dependent Gaussian, centred at $\boldsymbol{\mu}(\boldsymbol{c})$ with noise scale $\boldsymbol{\sigma}(\boldsymbol{c})$, follows the similar form to the Euler–Maruyama discretisation with drift scaled by $\Delta t$ and diffusion by $\sqrt{\Delta t}$. The subsequent bijective transformations $\boldsymbol{f}_1$ through $\boldsymbol{f}_L$ are implemented as conditioned coupling flows whose parameters depend on $\boldsymbol{c}$. Each layer splits the state $\boldsymbol{z}$ into two partitions $(\boldsymbol{z}_\text{A},\boldsymbol{z}_\text{B})$ and applies an affine update to one partition conditioned on the other and on $\boldsymbol{c}$:
with alternating partitions across layers. The explicit $\Delta t$ factor ensures $\boldsymbol{f}_i(\boldsymbol{z};\boldsymbol{c}, \boldsymbol{\theta}_i) = \boldsymbol{z}$ when $\Delta t=0$ and, combined with the form of the base Gaussian, preserves the identity at zero time gap. Stacking such layers yields an expressive diffeomorphism while keeping the Jacobian log-determinant tractable for loss computation.
Ensured by the conditional sampling on the initial state $\boldsymbol{x}_{t_i}$ without any overlap between the transitions.
The explicit $\Delta t$ factor in both the base Gaussian and coupling layers ensures $p_\boldsymbol{\theta}(\boldsymbol{x}_t \mid \boldsymbol{x}_s; s, 0) = \delta(\boldsymbol{x}_t - \boldsymbol{x}_s)$.
Obtained by omitting $t_i$ from $\boldsymbol{c}$ when modelling autonomous SDEs.
To guarantee that recursive application of NSF matches its one-step predictions, we optimise a bidirectional KL flow loss that enforces the flow property (Chapman–Kolmogorov property):
To enforce this property, we minimise the discrepancy between the direct one-step transition (LHS) and the two-step transition through an intermediate state (RHS).
Several candidates exist such as optimal-transport/Wasserstein, Stein, adversarial, and kernel MMD. Among them, we choose KL divergences because we can minimise the upper-bounds of the KL divergences in a tractable way, and since KL is asymmetric, we use both directions.
Direct KLs remain intractable due to the marginalisation over the intermediate state $\boldsymbol{x}_{t_j}$ on the two-step side. Therefore, we introduce a bridge distribution $b_\boldsymbol{\xi}(\boldsymbol{x}_{t_j}\mid\boldsymbol{x}_{t_i},\boldsymbol{x}_{t_k})$ as an auxiliary variational distribution (Agakov and Barber, 2004; Ranganath et al., 2016; Salimans et al., 2015) and optimise variational upper bounds for both directions.
Specifically, for the forward KL (one-step to two-step), we have the upper bound:
And for the reverse KL (two-step to one-step), we have the upper bound:
The total flow-consistency loss, incorporated as a weighted regulariser alongside the main negative log-likelihood objective, is the sum of the two KL divergences:
Irregularly-sampled real-world sequences seldom expose the full system state; instead we observe noisy measurements $\boldsymbol{o}_{t_i}$ at times $0 = t_0 < t_1 < \dots < t_T$. To model the hidden continuous-time dynamics while remaining solver-free, we introduce the Latent Neural Stochastic Flows (Latent NSFs) as variational state-space models (VSSMs; Chung et al., 2015; Krishnan et al., 2017; Johnson et al., 2016; Karl et al., 2017) that extend the variational autoencoder framework (Kingma and Welling, 2014; Rezende et al., 2014) with NSF transition kernels.
Let $\boldsymbol{x}_{t_i} \in \mathbb{R}^d$ denote the latent state at time $t_i$, let $\boldsymbol{o}_{t_i}$ be the corresponding observation, and define $\Delta t_i := t_i - t_{i-1}$. The joint distribution factorises as
Here $p_{\boldsymbol{\theta}}(\boldsymbol{x}_{t_i}\mid \boldsymbol{x}_{t_{i-1}}; t_{i-1},\Delta t_i)$ is an NSF transition density that admits arbitrary $\Delta t_i$ without numerical integration, $p_{\boldsymbol{\psi}}(\boldsymbol{o}_{t_i}\mid \boldsymbol{x}_{t_i})$ is the observation decoder, and $p_{\boldsymbol{\theta}}(\boldsymbol{x}_{t_0})$ is an initial prior over the state.
For the inference model, we employ a gated recurrent unit encoder (Cho et al., 2014) that processes the observation sequence:
The hidden state is initialised as $\boldsymbol{h}_{t_{-1}} = \boldsymbol{0}$. Absolute time $t_i$ is included only for non-autonomous systems; for autonomous SDEs the encoder conditions on $[\boldsymbol{o}_{t_i}, \Delta t_i]$.
Our goal is to train the generative model $p_{\boldsymbol{\theta},\boldsymbol{\psi}}\bigl(\boldsymbol{x}_{t_{0:T}},\boldsymbol{o}_{t_{0:T}}\bigr)$ and the inference model $q_{\boldsymbol{\phi}}(\boldsymbol{x}_{t_{0:T}}\mid \boldsymbol{o}_{\le t_T})$. A standard choice is the $\beta$-weighted negative ELBO ($\beta$-NELBO) loss (Higgins et al., 2017):
The expectation is over $q_{\boldsymbol{\phi}}(\boldsymbol{x}_{t_i}\mid \boldsymbol{o}_{\le t_i})$ and $\mathrm{KL}(\cdot\|\cdot)$ denotes the Kullback–Leibler divergence.
Focusing only on adjacent time steps can accumulate error over long horizons, so we introduce a skip-ahead KL divergence loss (rightmost panel above), akin to latent overshooting (Hafner et al., 2019):
Here $\mathcal{U}\{i+2,\,i+\tau\}$ denotes the discrete uniform distribution over indices $\{i+2,\dots,i+\tau\}$ with $\tau \ge 2$. Latent NSF transitions permit direct sampling across arbitrary time gaps, making this objective efficient to evaluate.
Combining the skip-ahead loss with the flow-consistency loss $\mathcal{L}_{\text{flow}}$ yields the total loss:
The hyperparameters $\lambda$ and $\beta_{\text{skip}}$ control the strengths of the flow-consistency and skip-ahead KL divergence losses, respectively; $\beta$ weights the latent KL in the $\beta$-NELBO.
Detailed experimental configurations are documented in the paper.
NSF matches the ground-truth attractor while remaining solver-free. Independent one-step samples stay on the correct manifold, unlike solver-based neural SDEs that either diffuse or collapse when rolled forward.
| Method | t = 0.25 | t = 0.5 | t = 0.75 | t = 1.0 | ||||
|---|---|---|---|---|---|---|---|---|
| KL ↓ | kFLOPs ↓ | KL ↓ | kFLOPs ↓ | KL ↓ | kFLOPs ↓ | KL ↓ | kFLOPs ↓ | |
| Latent SDE (Li et al., 2020) | 2.1 ± 0.9 | 959 | 1.8 ± 0.1 | 1,917 | 0.9 ± 0.3 | 2,839 | 1.5 ± 0.5 | 3,760 |
| Neural LSDE (Oh et al., 2024) | 1.3 ± 0.4 | 1,712 | 7.2 ± 1.4 | 3,416 | 74.5 ± 24.6 | 5,057 | 53.1 ± 29.3 | 6,699 |
| Neural GSDE (Oh et al., 2024) | 1.2 ± 0.4 | 1,925 | 3.9 ± 0.3 | 3,848 | 20.2 ± 7.6 | 5,698 | 14.1 ± 8.4 | 7,548 |
| Neural LNSDE (Oh et al., 2024) | 1.7 ± 0.4 | 1,925 | 4.6 ± 0.7 | 3,848 | 57.3 ± 16.4 | 5,698 | 44.6 ± 23.4 | 7,548 |
| SDE matching (Bartosh et al., 2024) | ||||||||
| $\Delta t = 0.0001$ | 4.3 ± 0.7 | 184,394 | 5.3 ± 1.0 | 368,787 | 3.4 ± 0.8 | 553,034 | 3.8 ± 1.0 | 737,354 |
| $\Delta t = 0.01$ | 6.3 ± 0.4 | 1,917 | 11.7 ± 0.5 | 3,834 | 7.9 ± 0.3 | 5,677 | 6.0 ± 0.3 | 7,520 |
| NSF (ours) | ||||||||
| $ \mathcal{H}_{\mathrm{pred}} = 1.0$ | 0.8 ± 0.7 | 53 | 1.3 ± 0.1 | 53 | 0.6 ± 0.3 | 53 | 0.2 ± 0.6 | 53 |
| $ \mathcal{H}_{\mathrm{pred}} = 0.5$ | 2.4 ± 1.9 | 53 | 1.3 ± 0.1 | 53 | 1.0 ± 0.4 | 105 | 1.7 ± 1.1 | 105 |
| $ \mathcal{H}_{\mathrm{pred}} = 0.25$ | 1.2 ± 0.7 | 53 | 1.2 ± 0.1 | 105 | 0.8 ± 0.4 | 156 | 1.3 ± 0.8 | 208 |
Average runtime per 100 samples (JAX): latent SDE 124–148 ms; NSF 0.3 ms.
Latent NSF delivers the strongest extrapolation accuracy while remaining orders of magnitude faster than solver-based latent SDEs. Setup 1 evaluates within-horizon forecasting, and Setup 2 withholds the final third of each sequence during training.
| Method | Setup 1 | Setup 2 |
|---|---|---|
| npODE (Heinonen et al., 2018) | 22.96† | — |
| Neural ODE (Chen et al., 2018) | 22.49 ± 0.88† | — |
| ODE2VAE-KL (Yildiz et al., 2019) | 8.09 ± 1.95† | — |
| Latent ODE (Rubanova et al., 2019) | 5.98 ± 0.28* | 31.62 ± 0.05§ |
| Latent SDE (Li et al., 2020) | 12.91 ± 2.90♠ | 9.52 ± 0.21§ |
| Latent Approx SDE (Solin et al., 2021) | 7.55 ± 0.05§ | 10.50 ± 0.86§ |
| ARCTA (Course et al., 2023) | 7.62 ± 0.93‡ | 9.92 ± 1.82 |
| NCDSSM (Ansari et al., 2023) | 5.69 ± 0.01§ | 4.74 ± 0.01§ |
| SDE matching (Bartosh et al., 2024) | 5.20 ± 0.43♣ | 4.26 ± 0.35 |
| Latent NSF (ours) | 8.62 ± 0.32 | 3.41 ± 0.27 |
† reported by Yildiz et al. (2019); * reported by Rubanova et al. (2019); ‡ reported by Course et al. (2023); § reported by Ansari et al. (2023); ♠ our reproduction of Li et al. (2020) code: latent-sde-mocap; ♣ our reproduction of Bartosh et al. (2024) code: sde-matching-mocap.
Runtime per 100 samples (JAX): latent SDE 75 ms vs. Latent NSF 3.5 ms.
On video forecasting, Latent NSF preserves appearance quality while modelling stochastic futures. One-step sampling excels on static structure, and recursive rollout remains competitive on dynamics.
| Method | Static FD ↓ | Dynamics FD ↓ | Frame-wise FD ↓ |
|---|---|---|---|
| Latent SDE | 2.66 ± 0.87 | 5.39 ± 3.10 | 0.58 ± 0.21 |
| Latent NSF (recursive) | 2.36 ± 0.60 | 7.76 ± 2.56 | 0.63 ± 0.17 |
| Latent NSF (one-step) | 1.67 ± 0.47 | -- | 0.63 ± 0.15 |
Dynamics FD is undefined for one-step NSF because it predicts the marginal distribution rather than multi-step trajectories.
@inproceedings{kiyohara2025neural,
title = {Neural Stochastic Flows: Solver-Free Modelling and Inference for {SDE} Solutions},
author = {Naoki Kiyohara and Edward Johns and Yingzhen Li},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025}
}