Recurrent Memory for Online Interdomain Gaussian Processes

Online HiPPO Sparse Variational Gaussian Processes (OHSVGP) bring recurrent memory to streaming GP inference via interdomain inducing features.

¹ Imperial College London • ² Canon Inc. • ³ University of Copenhagen

* Equal contribution.

Paper Code BibTeX

Abstract

We propose a novel online Gaussian process (GP) model that is capable of capturing long-term memory in sequential data. Our model, Online HiPPO Sparse Variational Gaussian Process (OHSVGP), leverages the HiPPO framework by interpreting its time-varying orthogonal projections as inducing variables with time-dependent orthogonal polynomial basis functions, which allows the SVGP inducing variables to memorize the process history. We show that the HiPPO framework fits naturally into the interdomain GP framework and that the kernel matrices can be updated online in a recurrence form based on the ODE evolution of HiPPO. OHSVGP is evaluated on online prediction for 1D time series, continual learning with multidimensional inputs, and deep generative modeling with sparse GP variational autoencoders, outperforming existing online GP methods in predictive performance, long-term memory preservation, and computational efficiency.

Motivation

Gaussian processes remain a staple for modeling sequential data because they combine flexibility with calibrated uncertainty (Roberts et al., 2013) and plug naturally into deep generative models (Fortuin et al., 2020). But the cubic training and quadratic memory cost of exact GPs quickly becomes prohibitive in streaming settings such as sensor feeds or epidemiological surveillance.

Our paper therefore focuses on retaining those statistical benefits while learning online, without discarding the memory of previous observations.

Accordingly, we focus on two practical desiderata that any online GP surrogate must satisfy if it is to preserve earlier information without ballooning costs:

Incremental updates
Carry forward the previously fitted posterior and summarize all past batches with a fixed-size memory state, instead of solving a fresh linear system after every new observation.
Long-term memory
Preserve calibrated predictive uncertainty even when the stream drifts, avoiding the catastrophic forgetting observed in earlier online SVGP methods.

Background & challenge

Let $\mathcal{X}$ be the input space; for time series we take $\mathcal{X}=[0,\infty)$. For any finite set of inputs $\bbX = [x_1,\ldots,x_n]^\intercal$, the random vector $\bbf \equiv f(\bX) = [f(x_1),\ldots,f(x_n)]^\intercal$ follows $\mathcal{N}(0,\bK_{\bbf\bbf})$ with $[\bK_{\bbf\bbf}]_{ij}=k(x_i,x_j)$. The cost of obtaining posterior predictions on $\bX_1$ scales cubically and quadratically in $n_1 = |\bbX_1|$. Given responses $\by$ and inputs $\bX$, we model $y_i \sim p(y_i \mid f(x_i))$ with GP prior $f \sim \mathcal{GP}(0,k)$, turning to approximate inference when the likelihood is non-conjugate.

Variational and interdomain sparse GPs

Sparse variational GPs introduce $M$ inducing points $\bZ\in\mathcal{X}^M$ with inducing variables $\bu = [f(\bz_1),\ldots,f(\bz_M)]^\intercal$ and define $q(\bbf,\bu) := p(\bbf \mid \bu)q_\theta(\bu)$, leading to the evidence lower bound

\log p(\by) \geq \sum_{i=1}^n \mathbb{E}_{q(f_i)}[\log p(y_i \mid f_i)] - \mathrm{KL}\!\left[q_\theta(\bu) \middle\| p(\bu)\right]=: \mathcal{L}_\theta.

With $q_\theta(\bu) = \mathcal{N}(\bu; \mathbf{m}_{\bu}, \bS_{\bu})$, the posterior marginals become

q(f_i) = \mathcal{N}(f_i; \bK_{f_i\bu}\bK_{\bu\bu}^{-1}\mathbf{m}_{\bu}, \bK_{f_i f_i} - \bK_{f_i\bu}\bK_{\bu\bu}^{-1}[\bK_{\bu\bu} - \bS_{\bu}]\bK_{\bu\bu}^{-1}\bK_{\bu f_i}).

Interdomain GPs generalize the inducing variables to $u_m := \int f(x)\phi_m(x)\,\mathrm{d}x$, so the basis functions $\phi_m$ determine the structure of $\bK_{f\bu}$ and $\bK_{\bu\bu}$.

Online SVGP recap

When batches $(\bX_{t_1},\by_{t_1}),(\bX_{t_2},\by_{t_2}),\ldots$ arrive sequentially, online SVGP (Bui et al., 2017) maximizes

\sum_{i=1}^{n_{t_2}} \mathop{\mathbb{E}}_{q_{t_2}(f_i)} \left[ \log p_{t_2}(y_i \mid f_i) \right] + \mathrm{KL} \left( \tilde{q}_{t_2}(\bu_{t_1}) \,\middle\|\, p_{t_1}(\bu_{t_1}) \right) - \mathrm{KL} \left( \tilde{q}_{t_2}(\bu_{t_1}) \,\middle\|\, q_{t_1}(\bu_{t_1}) \right) - \mathrm{KL} \left( q_{t_2}(\bu_{t_2}) \,\middle\|\, p_{t_2}(\bu_{t_2}) \right).

Here $y_i\in\by_{t_2}$ for $i=1,\ldots,n_{t_2}$ and $\tilde{q}_{t_2}(\bu_{t_1}) := \int p_{t_2}(\bu_{t_1}\mid\bu_{t_2})q_{t_2}(\bu_{t_2})\,\mathrm{d}\bu_{t_2}$. The expectations $\mathbb{E}_{q_{t_2}(f_i)}[\log p_{t_2}(y_i \mid f_i)]$ use the variational marginals $q_{t_2}(f_i)=\int p(f_i\mid\bu_{t_2})\,q_{t_2}(\bu_{t_2})\,\mathrm{d}\bu_{t_2}$ of the new latent function values.

This objective therefore keeps the approximate posterior consistent between time steps while requiring access only to the current minibatch.

However, optimized inducing points drift toward the newest data, so the model gradually forgets earlier regions unless the inducing set keeps growing—a limitation documented empirically by Bui et al. (2017) and reproduced in our experiments.

SVGP-only rollout: as inducing points migrate, the predictive variance in older time spans collapses to the prior.

Our approach: HiPPO memory + interdomain GP

We interpret HiPPO’s time-varying orthogonal projections, introduced by Gu et al. (2020), as interdomain inducing variables, i.e. $u_m^{(t)}=\int f(x)\phi_m^{(t)}(x)\,\mathrm{d}x$ with $\phi_m^{(t)}(x)=g_m^{(t)}(x)\,\omega^{(t)}(x)$. This turns the HiPPO state into an adaptive inducing set that compresses all past observations.

HiPPO equation animation: time-varying basis functions evolve as the inducing memory.

Recurrent inducing memory

Let $u(t)=\int f(x)\,\phi^{(t)}(x)\,\mathrm{d}x$ denote the HiPPO projection coefficients. Differentiating the projection gives $\dot{u}(t)=A(t)u(t)+B(t)f(t)$, where $A(t)$ and $B(t)$ encode the HiPPO ODE (e.g., Legendre-LegS). Using this ODE, we can track how the prior cross-covariance $\mathbf{K}_{\mathbf{fu}}^{(t)}=\operatorname{cov}(f(x_n),u(t))$ evolves over time; for each input $x_n$, the $n$-th row satisfies

\frac{\mathrm{d}}{\mathrm{d}t}\left[\mathbf{K}_{\mathbf{fu}}^{(t)}\right]_{n,:} = \mathbf{A}(t)\left[\mathbf{K}_{\mathbf{fu}}^{(t)}\right]_{n,:} + \mathbf{B}(t)k(x_n,t).

$A(t)$ and $B(t)$ arise from the underlying HiPPO ODE (e.g., LegS). Integrating this recurrence yields closed-form updates for each row of $\mathbf{K}_{\mathbf{fu}}^{(t)}$ without re-evaluating kernel integrals from scratch.

For the inducing covariance $\mathbf{K}_{\mathbf{uu}}^{(t)}$ the situation is more delicate: its $\ell m$-th element $[\mathbf{K}_{\mathbf{uu}}^{(t)}]_{\ell m} =\iint k(x,x^\prime)\phi_{\ell}^{(t)}(x)\phi_{m}^{(t)}(x^\prime)\,\mathrm{d}x\,\mathrm{d}x^\prime$ involves a double integral and therefore does not directly follow the single-integral HiPPO dynamics above. In our implementation we handle this by using random Fourier features (Rahimi & Recht, 2007) to factorize the kernel:

k(x,x^\prime) \approx \frac{1}{N}\sum_{n=1}^N\left[ \cos\left(w_nx\right)\cos\left(w_nx^\prime\right) + \sin\left(w_nx\right)\sin\left(w_nx^\prime\right)\right],

with $w_n\sim p(w)$ drawn from the spectral density. Substituting this expansion into the double integral factorizes $[\mathbf{K}_{\mathbf{uu}}^{(t)}]_{\ell m}$ into a sum of products of one-dimensional integrals (e.g. $\int \cos(w_n x)\phi_{\ell}^{(t)}(x)\,\mathrm{d}x$ and $\int \sin(w_n x)\phi_{\ell}^{(t)}(x)\,\mathrm{d}x$). Each of these integrals is a standard HiPPO projection coefficient that satisfies the linear ODE in $t$. We evolve these features recurrently and average their outer products to obtain an RFF approximation of $\mathbf{K}_{\mathbf{uu}}^{(t)}$.

Alternatively, one can differentiate $\mathbf{K}_{\mathbf{uu}}^{(t)}$ directly with respect to $t$, which yields the Lyapunov-type matrix ODE derived in the appendix:

\frac{\mathrm{d}}{\mathrm{d}t}\,\mathbf{K}_{\mathbf{uu}}^{(t)} = \mathbf{A}(t)\,\mathbf{K}_{\mathbf{uu}}^{(t)} + \mathbf{K}_{\mathbf{uu}}^{(t)}\mathbf{A}(t)^\top + \frac{1}{t}\bigl(\tilde{\mathbf{B}}(t)+\tilde{\mathbf{B}}(t)^\top\bigr).

Here $\mathbf{A}(t)$ is the HiPPO matrix and $\tilde{\mathbf{B}}(t)=\mathbf{c}(t)\mathbf{1}_M$ with $c_{\ell}(t)=\int k(t,x)\,\phi_{\ell}^{(t)}(x)\,\mathrm{d}x$. This direct evolution provides a deterministic dynamics for $\mathbf{K}_{\mathbf{uu}}^{(t)}$, but leads to a stiff ODE in practice, so our experiments use the RFF-based factorization above.

Predictive distribution & online updates

Substituting the recurrently updated $\mathbf{K}_{\mathbf{fu}}^{(t)}$ and $\mathbf{K}_{\mathbf{uu}}^{(t)}$ into the posterior expression for $q(f_i)$ keeps the predictive distribution synchronized with the HiPPO memory, so sequential ELBO optimization inherits long-term information without growing the inducing set.

The toy streaming example below visualizes this effect. Standard online SVGP stores only a set of localized inducing points, so its posterior quickly collapses to the prior away from the latest batch. OHSVGP instead evolves time-dependent HiPPO basis functions that act as a global memory state, so the posterior mean and variance remain tied to the full history and catastrophic forgetting is mitigated.

OHSVGP animation illustrating the online update process. Top-Left: Current phase. Top-Right: GP predictive posterior $q_t(f)$ updating with new data. Bottom-Left: Evolution of memory state $q(\mathbf{u}^{(t)})$ (mean and variance). Bottom-Right: Time-varying HiPPO basis functions $\phi_n^{(t)}(\tau)$ stretching over time.

Once the HiPPO recurrence is plugged into the variational updates, the inducing state summarizes the entire stream and preserves calibrated uncertainty on early tasks.

Online rollout comparison: HiPPO-SVGP keeps calibrated uncertainty on early data while the standard SVGP forgets.

These elements mirror the three contributions:

1
Interdomain memory.
HiPPO projections act as interdomain inducing variables so the memory state compresses history.
2
ODE recurrences.
We derive update rules, including the RFF construction for the inducing covariance, enabling fixed-cost updates.
3
Benchmark coverage.
The approach is demonstrated across online prediction, continual learning, and GPVAE benchmarks.

Algorithms

We provide the algorithmic details for the OHSVGP Evidence Lower Bound (ELBO) computation. Differences from standard SVGP are highlighted in blue.

Algorithm 1: The HiPPO-SVGP ELBO for a single task

Require:

$\mathbf{X}=\{x_1,\ldots,x_n=t_1\}$ (training inputs up to time $t_1$)
$\{y_1,\ldots,y_n\}$ (training targets)
$\mathbf{Z}\in\mathbb{R}^{M\times 1}$ (inducing points)
$\mathbf{A}(t)\in\mathbb{R}^{M\times M}, \mathbf{B}(t)\in\mathbb{R}^{M\times 1}$ (HiPPO matrices)
$\mathbf{m}_{\mathbf{u}}, \mathbf{S}_{\mathbf{u}}$ (variational parameters)

$\mathbf{K}_{\mathbf{fu}}=k(\mathbf{X},\mathbf{Z}), \mathbf{K}_{\mathbf{uu}}=k(\mathbf{Z},\mathbf{Z})$
$\mathbf{K}_{\mathbf{fu}}^{t_1}, \mathbf{K}_{\mathbf{uu}}^{t_1}$ from HiPPO ODEs evolved from 0 to $t_1$ with $\mathbf{A}(t), \mathbf{B}(t)$
$\mu(x_i) = \mathbf{K}_{f_i\mathbf{u}}^{\class{math-highlight}{t_1}} (\mathbf{K}_{\mathbf{uu}}^{\class{math-highlight}{t_1}})^{-1} \mathbf{m}_{\mathbf{u}}$ // Variational Posterior Mean
$\sigma^2(x_i) = \mathbf{K}_{f_if_i}^{\class{math-highlight}{t_1}} - \mathbf{K}_{f_i\mathbf{u}}^{\class{math-highlight}{t_1}}(\mathbf{K}_{\mathbf{uu}}^{\class{math-highlight}{t_1}})^{-1}[\mathbf{K}_{\mathbf{uu}}^{\class{math-highlight}{t_1}} - \mathbf{S}_{\mathbf{u}}](\mathbf{K}_{\mathbf{uu}}^{\class{math-highlight}{t_1}})^{-1}\mathbf{K}_{\mathbf{u}f_i}^{\class{math-highlight}{t_1}}$ // Variational Posterior Variance
$\ell_{\text{varexp}} \gets \sum_{i=1}^n \mathbb{E}_{\mathcal{N}(\mu(x_i),\sigma^2(x_i))}[\log p(y_i \mid f_i)]$
$\text{KL} \gets \text{KL}(\mathcal{N}(\mathbf{m}_{\mathbf{u}}, \mathbf{S}_{\mathbf{u}}) \,||\, \mathcal{N}(0, \mathbf{K}_{\mathbf{uu}}^{\class{math-highlight}{t_1}}))$
return $\ell_{\text{varexp}} - \text{KL}$

Algorithm 2: The OHSVGP ELBO on the second task

Require:

$\mathbf{X}'=\{t_1 < x'_1,\ldots,x'_{n'}=t_2\}$ (inputs up to time $t_2$)
$\{y'_1,\ldots,y'_{n'}\}$ (targets)
$\mathbf{Z}_{t_1}$ (frozen inducing points from task 1), $\mathbf{u}_{t_1}=f(\mathbf{Z}_{t_1})$
$\mathbf{Z}_{t_2}$ (new inducing points for task 2), $\mathbf{u}_{t_2}=f(\mathbf{Z}_{t_2})$
$\mathbf{A}(t), \mathbf{B}(t)$ (HiPPO matrices)
$\mathbf{m}_{\mathbf{u}_{t_1}}, \mathbf{S}_{\mathbf{u}_{t_1}}$ (learned params from task 1)
$\mathbf{m}_{\mathbf{u}_{t_2}}, \mathbf{S}_{\mathbf{u}_{t_2}}$ (new params for task 2)
$\mathbf{K}_{\mathbf{u}_{t_1}\mathbf{u}_{t_1}}^{\class{math-highlight}{t_1}}$

$\mathbf{K}_{\mathbf{f}'\mathbf{u}_{t_2}}=k(\mathbf{X}',\mathbf{Z}_{t_2}), \mathbf{K}_{\mathbf{u}_{t_2}\mathbf{u}_{t_2}}=k(\mathbf{Z}_{t_2},\mathbf{Z}_{t_2})$
$\mathbf{K}_{\mathbf{f}'\mathbf{u}_{t_2}}^{t_2}$ evolved from 0 to $t_2$, $\mathbf{K}_{\mathbf{u}_{t_2}\mathbf{u}_{t_2}}^{t_2}$ evolved from $t_1$ to $t_2$
$\mu_{t_2}(x'_i) = \mathbf{K}_{f'_i\mathbf{u}_{t_2}}^{\class{math-highlight}{t_2}} (\mathbf{K}_{\mathbf{u}_{t_2}\mathbf{u}_{t_2}}^{\class{math-highlight}{t_2}})^{-1} \mathbf{m}_{\mathbf{u}_{t_2}}$
$\sigma^2_{t_2}(x'_i)$ updated similarly using evolved kernels
$\ell_{\text{varexp}} \gets \sum_{i=1}^{n'} \mathbb{E}[\log p(y'_i \mid f'_i)]$
Compute $\tilde{\mathbf{m}}_{t_1t_2}, \tilde{\mathbf{S}}_{t_1t_2}$ for $\tilde{q}_{t_2}(\mathbf{u}_{t_1})$
$\text{KL} \gets \text{KL}(\mathcal{N}(\mathbf{m}_{\mathbf{u}_{t_2}}, \mathbf{S}_{\mathbf{u}_{t_2}}) \,||\, \mathcal{N}(0, \mathbf{K}_{\mathbf{u}_{t_2}\mathbf{u}_{t_2}}^{\class{math-highlight}{t_2}}))$
$\text{Correction} \gets \text{KL}(\tilde{q}_{t_2}(\mathbf{u}_{t_1}) || p(\mathbf{u}_{t_1})) - \text{KL}(\tilde{q}_{t_2}(\mathbf{u}_{t_1}) || q_{t_1}(\mathbf{u}_{t_1}))$
return $\ell_{\text{varexp}} - \text{KL} + \text{Correction}$

Multidimensional extension

For multidimensional input data, we impose a pseudo-temporal order on the instances within each task and view them as a trajectory $\bx(s)$ in pseudo-time. For a reference input $\bx^{(1)}_n$, the HiPPO cross-covariance can then be interpreted as a path integral $\int_0^{N_1\Delta t} k(\bx^{(1)}_n, \bx(s))\,\phi_m^{(t)}(s)\,\mathrm{d}s$.

Approximating this path integral with a forward Euler scheme and timestamps $t_i = i\Delta t$ yields the recurrence

[\mathbf{K}_{\mathbf{fu}}^{((i+1)\Delta t)}]_{n,:} =[\mathbf{I}+\Delta t\mathbf{A}(i\Delta t)][\mathbf{K}_{\mathbf{fu}}^{(i\Delta t)}]_{n,:} + \Delta t\mathbf{B}(i \Delta t)k \left(\bx^{(1)}_n, \bx^{(1)}_i \right).

When there is no inherent order, we construct the pseudo-time path using either an “oracle” sort that mirrors the data-generation process or a kernel-distance tour that traces a smooth trajectory through the inputs. Both keep the shared HiPPO memory stable across correlated channels.

High-dimensional fusion with GPVAE

We also embed the HiPPO recurrence inside a GPVAE so that every latent GP component maintains its own memory trace. A shared encoder produces the inducing posterior parameters, eliminating the need to hand-tune inducing locations during continual training.

The hybrid retains the SVGPVAE predictive form of Fortuin et al. (2020) yet regularizes the inducing state through the HiPPO recurrence, controlling catastrophic forgetting even when the observation space is high-dimensional (e.g. ERA5 climate grids).

Experiments

We evaluate OHSVGP across streaming regression, continual learning, and GPVAE settings. Detailed configurations and quantitative numbers match the paper; below we summarize the key observations.

› Streaming time-series regression Solar, audio, COVID-19

Across Solar irradiance, audio, and COVID-19 death-count streams, OHSGPR keeps Negative Log Predictive Density (NLPD) flat as tasks accumulate, while OSVGP drifts once inducing points migrate to the latest data.

Setup. Solar irradiance records (Lean, 2004), TIMIT-derived baseband audio (Bui & Turner, 2014), and Santa Catarina COVID-19 death counts (Hawryluk et al., 2021) are each split into sequential tasks (10 for Solar & Audio, 5 for COVID). All runs process one task at a time, keep kernel hyperparameters fixed after the first task, and evaluate NLPD on every previously-seen task.

Baselines. We benchmark against the original Online SVGP (OSVGP) (Bui et al., 2017), Online Variational Conditioning (OVC) with pivoted-Cholesky inducing points (Maddox et al., 2021), and the variational Fourier feature baseline OVFF (Hensman et al., 2018). For Gaussian likelihood cases we also report the closed-form OHSGPR/OSGPR pair to isolate the effect of HiPPO recurrences.

Streaming regression (Solar irradiance & audio)

The plots below show test NLPD accumulated over the first 10 tasks as each additional task is learned.

Combined Solar Irradiance and Audio NLPD curves with M ∈ {50,150,100,200} — **(a)** Solar Irradiance, 50 inducing points. **(b)** Solar Irradiance, 150 inducing points. **(c)** Audio (baseband), 100 inducing points. **(d)** Audio (baseband), 200 inducing points.

Findings. OHSVGP/OHSGPR maintain calibrated NLPD across all tasks while OSVGP rapidly drifts as inducing points migrate to the newest mini-batch. OVC narrows the gap on recent tasks yet still forgets the earliest segments, and OVFF’s global Fourier features lag because they require predefining the entire time interval.

To visualize the source of forgetting we freeze the sensors in an early Solar interval and inspect each model’s predictive mean and variance after it finishes Task 10.

Predictive mean after 10 Solar tasks

Combined Solar task 10 predictive mean ±2σ for OSGPR, OVC, OVFF, OHSGPR — **(a)** OSGPR baseline. **(b)** OVC (pivoted-Cholesky only). **(c)** OVFF interdomain baseline. **(d)** OHSGPR (ours).

Interpretation. Standard SVGP variants revert to the GP prior on the oldest regions because the inducing set is pulled downstream. The HiPPO recurrence instead stores projections of the entire history so the predictive mean and variance remain consistent even 10 tasks after first observing a window.

The COVID benchmark swaps the Gaussian likelihood for a Negative Binomial model, stressing whether the same conclusions hold under over-dispersed counts.

COVID-19 death counts (Negative Binomial)

COVID-19 NLPD with 15 inducing points — COVID NLPD, 15 inducing points.

COVID-19 NLPD with 30 inducing points — COVID NLPD, 30 inducing points.

Counts discussion. The HiPPO memory still dominates once the likelihood departs from Gaussian: OHSVGP attains lower NLPD while OSVGP and OVC oscillate as they re-tune inducing locations for each spike in the death series.

The table summarizes how recurrence-based updates translate into faster wall-clock times (numbers inherit the PyTorch implementations used in Bui et al. (2017) and Maddox et al. (2021)).

Wall-clock runtime to finish all tasks on a single RTX 3090 GPU (values in milliseconds).
Method	Solar Irradiance		Audio		COVID
Method	50	150	100	200	15	30
OSGPR / OSVGP	140	149	144	199	525	530
OVC	0.450	0.620	0.558	0.863	345	360
OVFF	0.327	0.354	0.295	0.356	—	—
OHSGPR / OHSVGP	0.297	0.394	0.392	0.655	370	380

Runtime. OHSGPR requires no gradient updates and finishes all Solar and Audio tasks within seconds, while the online variational methods pay for per-task optimization. Even in the Negative Binomial case, OHSVGP and OVC remain faster than OSVGP because they avoid per-task inducing point re-optimization.

COVID-19 full results

Test set NLPD per task after continually learning each task for all the 5 tasks on COVID dataset.

COVID-19 Full NLPD, M=15 — Full NLPD, M=15.

COVID-19 Full NLPD, M=30 — Full NLPD, M=30.

› Continual learning on tabular data Skillcraft & Powerplant

Setup. Following Stanton & Maddox (2021), Skillcraft and Powerplant are sorted either by their first input dimension or by L2 distance to the origin before partitioning into 10 tasks. Within each task the HiPPO recurrences need a point ordering, so we test both an oracle order that matches the way the tasks were generated (OHSVGP-o) and a data-driven kernel tour that hops through nearby points without oracle knowledge (OHSVGP-k).

Baselines. The comparison includes OSVGP (Bui et al., 2017) and OVC with pivoted-Cholesky inducing points (Maddox et al., 2021). All models use 256 inducing variables and 2k Adam steps per task.

Continual learning (Skillcraft & Powerplant)

NLPD is recorded after completing Task 1, 2, 4, and 8 as well as after all 10 tasks.

Skillcraft continual learning with inputs sorted by first dimension — Skillcraft, sorted by first dimension.

Skillcraft continual learning with inputs sorted by L2 distance — Skillcraft, sorted by L2 distance.

Powerplant continual learning with inputs sorted by first dimension — Powerplant, sorted by first dimension.

Powerplant continual learning with inputs sorted by L2 distance — Powerplant, sorted by L2 distance.

Observations. Even without the oracle ordering, OHSVGP-k keeps forgetting low, highlighting that the HiPPO summary tolerates imperfect task boundaries. When the oracle order is available, OHSVGP-o dominates all four scenarios, whereas OVC and especially OSVGP show sharp drops in NLPD once later tasks explore input regions that differ from Task 1.

Full results on tabular data

Test set NLPD per task after continually learning each task for all the 10 tasks on UCI datasets.

Full NLPD Skillcraft (Dim 0) — Skillcraft (1st dim).

Full NLPD Skillcraft (L2) — Skillcraft (L2).

Full NLPD Powerplant (Dim 0) — Powerplant (1st dim).

Full NLPD Powerplant (L2) — Powerplant (L2).

› High-dimensional GPVAE ERA5 climate

Setup. Hourly ERA5 climate fields (Hersbach et al., 2023) provide 10 continual tasks, each spanning 186 steps and 17 meteorological channels over scattered UK locations. All GPVAE variants share the same encoder, decoder, latent dimensionality, and training schedule; we report results for $M\in\{50,100\}$ inducing variables.

Baselines. Online SVGPVAE (OSVGP) augments the encoder/decoder with Elastic Weight Consolidation (EWC) but keeps a single inducing bank, while OVC-SVGPVAE (OVC) and OVC-per-dim continually update inducing locations with pivoted-Cholesky (Maddox et al., 2021). OHSVGP replaces those inducing points with HiPPO projections, yielding one recurrent memory per latent GP.

ERA5 streaming SVGPVAE

Each scatter shows the NLPD after finishing Task $i$ (horizontal axis) and after training the final task 10 (vertical axis); points underneath the diagonal indicate catastrophic forgetting.

ERA5 NLPD comparison with 50 inducing points — ERA5 NLPD comparison, M = 50.

ERA5 NLPD comparison with 100 inducing points — ERA5 NLPD comparison, M = 100.

Discussion. OHSVGP traces stay near the diagonal for both values of $M$, indicating the latent memories retain the early tasks. OSVGP and both OVC variants slide far above the diagonal because their shared inducing locations cannot cover 10 sequential climates—the larger $M$ mitigates but does not remove the regression gap, whereas HiPPO updates scale without increasing $M$.

BibTeX

@inproceedings{chen2025hipposvgp,
  title     = {Recurrent Memory for Online Interdomain Gaussian Processes},
  author    = {Wenlong Chen and Naoki Kiyohara and Harrison Bo Hua Zhu and Jacob Curran-Sebastian and Samir Bhatt and Yingzhen Li},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025}
}