Skip to contentSkip to Content
DocsTheoryTraining Theory

Training Theory

In reservoir computing, the recurrent weight matrix WW and input weight matrix WinW_{\text{in}} are fixed at initialization. The only learned component is the readout weight matrix WoutW^{\text{out}}, which maps the high-dimensional reservoir state to the desired output. Because the readout is linear, training reduces to solving a system of linear equations — a convex optimization problem with a unique global minimum and no gradient pathology.

This page derives the two training algorithms implemented in SPIRES: ridge regression (batch, offline) and the online delta rule (streaming, incremental).

Problem Setup

Suppose we drive the reservoir with a training input sequence {u(t)}t=1T\{u(t)\}_{t=1}^{T} and collect the corresponding reservoir states {x(t)}t=1T\{\mathbf{x}(t)\}_{t=1}^{T}, discarding an initial transient of T0T_0 steps to allow the reservoir to wash out its initial conditions.

We arrange the states and targets into matrices:

Φ=[x(T0+1)x(T0+2)x(T)]R(TT0)×N\Phi = \begin{bmatrix} \mathbf{x}(T_0+1)^\top \\ \mathbf{x}(T_0+2)^\top \\ \vdots \\ \mathbf{x}(T)^\top \end{bmatrix} \in \mathbb{R}^{(T-T_0) \times N} Y=[ytarget(T0+1)ytarget(T0+2)ytarget(T)]R(TT0)×NoutY = \begin{bmatrix} \mathbf{y}_{\text{target}}(T_0+1)^\top \\ \mathbf{y}_{\text{target}}(T_0+2)^\top \\ \vdots \\ \mathbf{y}_{\text{target}}(T)^\top \end{bmatrix} \in \mathbb{R}^{(T-T_0) \times N_{\text{out}}}

where Φ\Phi is the state matrix (or design matrix) and YY is the target matrix. Each row of Φ\Phi is a snapshot of the reservoir state at one time step.

Ridge Regression (Batch Training)

Ordinary Least Squares

The simplest approach is to find WoutW^{\text{out}} that minimizes the sum of squared errors:

Wout=argminW  ΦWYF2(1)\tag{1} W^{\text{out}} = \operatorname*{argmin}_{W} \; \|\Phi W - Y\|_F^2

where F\|\cdot\|_F is the Frobenius norm. Taking the gradient and setting it to zero:

WΦWYF2=2Φ(ΦWY)=0\frac{\partial}{\partial W}\|\Phi W - Y\|_F^2 = 2\Phi^\top(\Phi W - Y) = 0

Solving:

WOLSout=(ΦΦ)1ΦY(2)\tag{2} W^{\text{out}}_{\text{OLS}} = (\Phi^\top \Phi)^{-1} \Phi^\top Y

This is the ordinary least squares (OLS) solution. It is the unique minimum of a convex quadratic, so there are no local minima or saddle points.

The Overfitting Problem

In reservoir computing, the state dimensionality NN is typically large (hundreds to thousands of neurons), while the number of training samples TT0T - T_0 may be comparable in magnitude. When NN is large relative to the number of samples, OLS exhibits severe overfitting: the readout learns noise in the training data and generalizes poorly.

Mathematically, the problem is that ΦΦ\Phi^\top \Phi becomes ill-conditioned (has near-zero eigenvalues), causing the inverse (ΦΦ)1(\Phi^\top \Phi)^{-1} to amplify noise. The resulting weight vector has large magnitude, indicating that the readout is fitting to tiny fluctuations in the reservoir state.

Tikhonov Regularization (Ridge Regression)

The solution is to add a penalty on the magnitude of the weights. Ridge regression adds an L2L_2 regularization term:

Wout=argminW  ΦWYF2+λWF2(3)\tag{3} W^{\text{out}} = \operatorname*{argmin}_{W} \; \|\Phi W - Y\|_F^2 + \lambda \|W\|_F^2

where λ>0\lambda > 0 is the regularization parameter. The penalty λWF2\lambda \|W\|_F^2 discourages large weights, preferring solutions that are smooth combinations of reservoir states.

Taking the gradient and setting it to zero:

2Φ(ΦWY)+2λW=02\Phi^\top(\Phi W - Y) + 2\lambda W = 0

Solving:

Wridgeout=(ΦΦ+λI)1ΦY(4)\tag{4} W^{\text{out}}_{\text{ridge}} = (\Phi^\top \Phi + \lambda I)^{-1} \Phi^\top Y

This is the ridge regression solution, also known as Tikhonov regularization in the numerical analysis literature. The term λI\lambda I shifts all eigenvalues of ΦΦ\Phi^\top \Phi away from zero, ensuring the matrix is well-conditioned and the inverse is stable.

Interpretation of λ\lambda

The regularization parameter λ\lambda controls the bias-variance trade-off:

  • λ0\lambda \to 0: Recovers OLS. Low bias, high variance. The readout fits the training data closely but may not generalize.
  • λ\lambda \to \infty: The readout approaches Wout0W^{\text{out}} \to 0. High bias (always predicting near zero), zero variance.
  • Optimal λ\lambda: Balances training error and weight magnitude. Typically found by cross-validation or the AGILE optimizer.

The effect on the eigenspectrum of ΦΦ\Phi^\top \Phi is transparent. If ΦΦ\Phi^\top \Phi has eigenvalues σ12σ22σN2\sigma_1^2 \geq \sigma_2^2 \geq \cdots \geq \sigma_N^2, then the ridge solution shrinks each component by a factor of:

σi2σi2+λ\frac{\sigma_i^2}{\sigma_i^2 + \lambda}

Large eigenvalues (strong signal directions) are barely affected. Small eigenvalues (noise directions) are suppressed toward zero. This is precisely the behavior needed in reservoir computing, where a few principal components of the reservoir state carry signal and the rest carry noise.

Computational Cost

Computing WridgeoutW^{\text{out}}_{\text{ridge}} requires:

  1. Forming ΦΦRN×N\Phi^\top \Phi \in \mathbb{R}^{N \times N}: O(TN2)O(T N^2) operations.
  2. Forming ΦYRN×Nout\Phi^\top Y \in \mathbb{R}^{N \times N_{\text{out}}}: O(TNNout)O(T N N_{\text{out}}) operations.
  3. Solving (ΦΦ+λI)W=ΦY(\Phi^\top \Phi + \lambda I) W = \Phi^\top Y: O(N3)O(N^3) operations (or O(N2)O(N^2) per output with Cholesky factorization).

The dominant cost is O(TN2+N3)O(T N^2 + N^3). For typical reservoir sizes (N1001000N \sim 100\text{--}1000), this is fast — seconds on modern hardware. SPIRES uses LAPACKE for the linear solve, leveraging optimized BLAS routines.

SVD Alternative

An equivalent formulation uses the singular value decomposition (SVD) of Φ=UΣV\Phi = U \Sigma V^\top:

Wridgeout=Vdiag ⁣(σiσi2+λ)UYW^{\text{out}}_{\text{ridge}} = V \, \text{diag}\!\left(\frac{\sigma_i}{\sigma_i^2 + \lambda}\right) U^\top Y

This form is more numerically stable when ΦΦ\Phi^\top \Phi is extremely ill-conditioned, though the Cholesky approach is faster when applicable. SPIRES uses the normal equations (Equation 4) by default.

Online Delta Rule (Streaming Training)

Motivation

Ridge regression requires collecting all training data before computing WoutW^{\text{out}}. For applications where data arrives continuously or where the target distribution changes over time, an online (incremental) training algorithm is needed.

The Delta Rule

The online delta rule updates WoutW^{\text{out}} after each time step based on the prediction error:

ΔWout=η(ytarget(t)y(t))x(t)(5)\tag{5} \Delta W^{\text{out}} = \eta \, \bigl(\mathbf{y}_{\text{target}}(t) - \mathbf{y}(t)\bigr) \, \mathbf{x}(t)^\top

where:

  • η>0\eta > 0 is the learning rate
  • y(t)=Woutx(t)\mathbf{y}(t) = W^{\text{out}} \mathbf{x}(t) is the current prediction
  • ytarget(t)\mathbf{y}_{\text{target}}(t) is the target at time tt
  • x(t)\mathbf{x}(t) is the current reservoir state

This is a stochastic gradient descent step on the squared error loss ytargetWoutx2\|\mathbf{y}_{\text{target}} - W^{\text{out}}\mathbf{x}\|^2. The update is:

Wout(t+1)=Wout(t)+η(ytarget(t)Wout(t)x(t))x(t)(6)\tag{6} W^{\text{out}}(t+1) = W^{\text{out}}(t) + \eta \, \bigl(\mathbf{y}_{\text{target}}(t) - W^{\text{out}}(t) \, \mathbf{x}(t)\bigr) \, \mathbf{x}(t)^\top

Properties

Convergence. For a fixed reservoir driven by a stationary input, the delta rule converges to the least-squares solution as tt \to \infty, provided η\eta satisfies:

0<η<2x(t)max20 < \eta < \frac{2}{\|\mathbf{x}(t)\|^2_{\max}}

In practice, η\eta is set to a small constant (e.g., 10410^{-4} to 10210^{-2}).

Computational cost per step. Each update requires O(NNout)O(N \cdot N_{\text{out}}) operations — a single outer product. This is negligible compared to the O(N2)O(N^2) cost of the reservoir state update.

Tracking non-stationary targets. Because the delta rule continuously adapts, it can track slow changes in the target distribution. This makes it suitable for online learning and adaptive control applications.

No regularization. The basic delta rule does not include explicit regularization. Weight decay can be added:

Wout(t+1)=(1ηλ)Wout(t)+η(ytarget(t)Wout(t)x(t))x(t)(7)\tag{7} W^{\text{out}}(t+1) = (1 - \eta\lambda) \, W^{\text{out}}(t) + \eta \, \bigl(\mathbf{y}_{\text{target}}(t) - W^{\text{out}}(t) \, \mathbf{x}(t)\bigr) \, \mathbf{x}(t)^\top

This is the online equivalent of ridge regression, with λ\lambda controlling weight decay.

Batch vs. Streaming: Comparison

PropertyRidge RegressionOnline Delta Rule
Training modeBatch (offline)Streaming (online)
Data requirementAll data available upfrontOne sample at a time
OptimalityGlobal optimum (exact)Asymptotically optimal
Regularizationλ\lambda (Tikhonov)ηλ\eta\lambda (weight decay)
Computational costO(TN2+N3)O(TN^2 + N^3) totalO(NNout)O(N \cdot N_{\text{out}}) per step
Memory requirementO(N2)O(N^2) for ΦΦ\Phi^\top \PhiO(NNout)O(N \cdot N_{\text{out}}) for WoutW^{\text{out}}
Non-stationary targetsNot supportedNaturally supported
Sensitivity to η\etaNoneHigh (requires tuning)

When to Use Which

Use ridge regression when:

  • The training data is fixed and available in advance.
  • An exact global optimum is desired.
  • The dataset is small enough to store the state matrix Φ\Phi.

Use the online delta rule when:

  • Data arrives as a stream and cannot be stored.
  • The target distribution changes over time (non-stationary).
  • Real-time adaptation is required (e.g., control tasks).
  • Memory is limited and Φ\Phi cannot be stored.

In practice, many SPIRES workflows use ridge regression for initial training and the delta rule for subsequent online adaptation.

Connection to Reservoir Computing Theory

The fact that only the readout layer is trained has profound theoretical implications:

Universality. A reservoir with the echo state property and a sufficiently rich state space can approximate any fading-memory filter to arbitrary accuracy, provided the readout is powerful enough. Since linear readouts are universal approximators when the reservoir provides a sufficiently rich feature expansion, the combination of a random reservoir + linear readout is a universal function approximator for temporal signals.

No vanishing/exploding gradients. Because no gradients are propagated through the recurrent dynamics, the training is immune to the vanishing and exploding gradient problems that plague backpropagation-through-time in conventional RNNs.

Convexity. The ridge regression objective (Equation 3) is a strictly convex function of WoutW^{\text{out}}. There is a unique global minimum, no local minima, and no saddle points. Training is deterministic and reproducible.

Separation of dynamics and learning. The reservoir dynamics (determined by WW, WinW_{\text{in}}, neuron model, and topology) and the learning (determined by WoutW^{\text{out}} and λ\lambda) are completely decoupled. This allows them to be designed and optimized independently, which is the architectural advantage exploited by the AGILE optimizer.

References

  1. Jaeger, H. (2001). The “echo state” approach to analysing and training recurrent neural networks. GMD Report 148.
  2. Lukoševičius, M., & Jaeger, H. (2009). Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3), 127—149.
  3. Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of Ill-Posed Problems. Winston & Sons.
  4. Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Chapter 3. Springer.
  5. Widrow, B., & Hoff, M. E. (1960). Adaptive switching circuits. IRE WESCON Convention Record, Part 4, 96—104.

← Spectral Radius | Case Studies: Spoken Digit Recognition →

Last updated on