Training Theory

In reservoir computing, the recurrent weight matrix $W$ and input weight matrix $W_{\text{in}}$ are fixed at initialization. The only learned component is the readout weight matrix $W^{\text{out}}$ , which maps the high-dimensional reservoir state to the desired output. Because the readout is linear, training reduces to solving a system of linear equations — a convex optimization problem with a unique global minimum and no gradient pathology.

This page derives the two training algorithms implemented in SPIRES: ridge regression (batch, offline) and the online delta rule (streaming, incremental).

Problem Setup

Suppose we drive the reservoir with a training input sequence $\{u(t)\}_{t=1}^{T}$ and collect the corresponding reservoir states $\{\mathbf{x}(t)\}_{t=1}^{T}$ , discarding an initial transient of $T_0$ steps to allow the reservoir to wash out its initial conditions.

We arrange the states and targets into matrices:

\Phi = \begin{bmatrix} \mathbf{x}(T_0+1)^\top \\ \mathbf{x}(T_0+2)^\top \\ \vdots \\ \mathbf{x}(T)^\top \end{bmatrix} \in \mathbb{R}^{(T-T_0) \times N}

Y = \begin{bmatrix} \mathbf{y}_{\text{target}}(T_0+1)^\top \\ \mathbf{y}_{\text{target}}(T_0+2)^\top \\ \vdots \\ \mathbf{y}_{\text{target}}(T)^\top \end{bmatrix} \in \mathbb{R}^{(T-T_0) \times N_{\text{out}}}

where $\Phi$ is the state matrix (or design matrix) and $Y$ is the target matrix. Each row of $\Phi$ is a snapshot of the reservoir state at one time step.

Ridge Regression (Batch Training)

Ordinary Least Squares

The simplest approach is to find $W^{\text{out}}$ that minimizes the sum of squared errors:

\tag{1} W^{\text{out}} = \operatorname*{argmin}_{W} \; \|\Phi W - Y\|_F^2

where $\|\cdot\|_F$ is the Frobenius norm. Taking the gradient and setting it to zero:

\frac{\partial}{\partial W}\|\Phi W - Y\|_F^2 = 2\Phi^\top(\Phi W - Y) = 0

Solving:

\tag{2} W^{\text{out}}_{\text{OLS}} = (\Phi^\top \Phi)^{-1} \Phi^\top Y

This is the ordinary least squares (OLS) solution. It is the unique minimum of a convex quadratic, so there are no local minima or saddle points.

The Overfitting Problem

In reservoir computing, the state dimensionality $N$ is typically large (hundreds to thousands of neurons), while the number of training samples $T - T_0$ may be comparable in magnitude. When $N$ is large relative to the number of samples, OLS exhibits severe overfitting: the readout learns noise in the training data and generalizes poorly.

Mathematically, the problem is that $\Phi^\top \Phi$ becomes ill-conditioned (has near-zero eigenvalues), causing the inverse $(\Phi^\top \Phi)^{-1}$ to amplify noise. The resulting weight vector has large magnitude, indicating that the readout is fitting to tiny fluctuations in the reservoir state.

Tikhonov Regularization (Ridge Regression)

The solution is to add a penalty on the magnitude of the weights. Ridge regression adds an $L_2$ regularization term:

\tag{3} W^{\text{out}} = \operatorname*{argmin}_{W} \; \|\Phi W - Y\|_F^2 + \lambda \|W\|_F^2

where $\lambda > 0$ is the regularization parameter. The penalty $\lambda \|W\|_F^2$ discourages large weights, preferring solutions that are smooth combinations of reservoir states.

Taking the gradient and setting it to zero:

2\Phi^\top(\Phi W - Y) + 2\lambda W = 0

Solving:

\tag{4} W^{\text{out}}_{\text{ridge}} = (\Phi^\top \Phi + \lambda I)^{-1} \Phi^\top Y

This is the ridge regression solution, also known as Tikhonov regularization in the numerical analysis literature. The term $\lambda I$ shifts all eigenvalues of $\Phi^\top \Phi$ away from zero, ensuring the matrix is well-conditioned and the inverse is stable.

Interpretation of $\lambda$

The regularization parameter $\lambda$ controls the bias-variance trade-off:

$\lambda \to 0$ : Recovers OLS. Low bias, high variance. The readout fits the training data closely but may not generalize.
$\lambda \to \infty$ : The readout approaches $W^{\text{out}} \to 0$ . High bias (always predicting near zero), zero variance.
Optimal $\lambda$ : Balances training error and weight magnitude. Typically found by cross-validation or the AGILE optimizer.

The effect on the eigenspectrum of $\Phi^\top \Phi$ is transparent. If $\Phi^\top \Phi$ has eigenvalues $\sigma_1^2 \geq \sigma_2^2 \geq \cdots \geq \sigma_N^2$ , then the ridge solution shrinks each component by a factor of:

\frac{\sigma_i^2}{\sigma_i^2 + \lambda}

Large eigenvalues (strong signal directions) are barely affected. Small eigenvalues (noise directions) are suppressed toward zero. This is precisely the behavior needed in reservoir computing, where a few principal components of the reservoir state carry signal and the rest carry noise.

Computational Cost

Computing $W^{\text{out}}_{\text{ridge}}$ requires:

Forming $\Phi^\top \Phi \in \mathbb{R}^{N \times N}$ : $O(T N^2)$ operations.
Forming $\Phi^\top Y \in \mathbb{R}^{N \times N_{\text{out}}}$ : $O(T N N_{\text{out}})$ operations.
Solving $(\Phi^\top \Phi + \lambda I) W = \Phi^\top Y$ : $O(N^3)$ operations (or $O(N^2)$ per output with Cholesky factorization).

The dominant cost is $O(T N^2 + N^3)$ . For typical reservoir sizes ( $N \sim 100\text{--}1000$ ), this is fast — seconds on modern hardware. SPIRES uses LAPACKE for the linear solve, leveraging optimized BLAS routines.

SVD Alternative

An equivalent formulation uses the singular value decomposition (SVD) of $\Phi = U \Sigma V^\top$ :

W^{\text{out}}_{\text{ridge}} = V \, \text{diag}\!\left(\frac{\sigma_i}{\sigma_i^2 + \lambda}\right) U^\top Y

This form is more numerically stable when $\Phi^\top \Phi$ is extremely ill-conditioned, though the Cholesky approach is faster when applicable. SPIRES uses the normal equations (Equation 4) by default.

Online Delta Rule (Streaming Training)

Motivation

Ridge regression requires collecting all training data before computing $W^{\text{out}}$ . For applications where data arrives continuously or where the target distribution changes over time, an online (incremental) training algorithm is needed.

The Delta Rule

The online delta rule updates $W^{\text{out}}$ after each time step based on the prediction error:

\tag{5} \Delta W^{\text{out}} = \eta \, \bigl(\mathbf{y}_{\text{target}}(t) - \mathbf{y}(t)\bigr) \, \mathbf{x}(t)^\top

where:

$\eta > 0$ is the learning rate
$\mathbf{y}(t) = W^{\text{out}} \mathbf{x}(t)$ is the current prediction
$\mathbf{y}_{\text{target}}(t)$ is the target at time $t$
$\mathbf{x}(t)$ is the current reservoir state

This is a stochastic gradient descent step on the squared error loss $\|\mathbf{y}_{\text{target}} - W^{\text{out}}\mathbf{x}\|^2$ . The update is:

\tag{6} W^{\text{out}}(t+1) = W^{\text{out}}(t) + \eta \, \bigl(\mathbf{y}_{\text{target}}(t) - W^{\text{out}}(t) \, \mathbf{x}(t)\bigr) \, \mathbf{x}(t)^\top

Properties

Convergence. For a fixed reservoir driven by a stationary input, the delta rule converges to the least-squares solution as $t \to \infty$ , provided $\eta$ satisfies:

0 < \eta < \frac{2}{\|\mathbf{x}(t)\|^2_{\max}}

In practice, $\eta$ is set to a small constant (e.g., $10^{-4}$ to $10^{-2}$ ).

Computational cost per step. Each update requires $O(N \cdot N_{\text{out}})$ operations — a single outer product. This is negligible compared to the $O(N^2)$ cost of the reservoir state update.

Tracking non-stationary targets. Because the delta rule continuously adapts, it can track slow changes in the target distribution. This makes it suitable for online learning and adaptive control applications.

No regularization. The basic delta rule does not include explicit regularization. Weight decay can be added:

\tag{7} W^{\text{out}}(t+1) = (1 - \eta\lambda) \, W^{\text{out}}(t) + \eta \, \bigl(\mathbf{y}_{\text{target}}(t) - W^{\text{out}}(t) \, \mathbf{x}(t)\bigr) \, \mathbf{x}(t)^\top

This is the online equivalent of ridge regression, with $\lambda$ controlling weight decay.

Batch vs. Streaming: Comparison

Property	Ridge Regression	Online Delta Rule
Training mode	Batch (offline)	Streaming (online)
Data requirement	All data available upfront	One sample at a time
Optimality	Global optimum (exact)	Asymptotically optimal
Regularization	$\lambda$ (Tikhonov)	$\eta\lambda$ (weight decay)
Computational cost	$O(TN^2 + N^3)$ total	$O(N \cdot N_{\text{out}})$ per step
Memory requirement	$O(N^2)$ for $\Phi^\top \Phi$	$O(N \cdot N_{\text{out}})$ for $W^{\text{out}}$
Non-stationary targets	Not supported	Naturally supported
Sensitivity to $\eta$	None	High (requires tuning)

When to Use Which

Use ridge regression when:

The training data is fixed and available in advance.
An exact global optimum is desired.
The dataset is small enough to store the state matrix $\Phi$ .

Use the online delta rule when:

Data arrives as a stream and cannot be stored.
The target distribution changes over time (non-stationary).
Real-time adaptation is required (e.g., control tasks).
Memory is limited and $\Phi$ cannot be stored.

In practice, many SPIRES workflows use ridge regression for initial training and the delta rule for subsequent online adaptation.

Connection to Reservoir Computing Theory

The fact that only the readout layer is trained has profound theoretical implications:

Universality. A reservoir with the echo state property and a sufficiently rich state space can approximate any fading-memory filter to arbitrary accuracy, provided the readout is powerful enough. Since linear readouts are universal approximators when the reservoir provides a sufficiently rich feature expansion, the combination of a random reservoir + linear readout is a universal function approximator for temporal signals.

No vanishing/exploding gradients. Because no gradients are propagated through the recurrent dynamics, the training is immune to the vanishing and exploding gradient problems that plague backpropagation-through-time in conventional RNNs.

Convexity. The ridge regression objective (Equation 3) is a strictly convex function of $W^{\text{out}}$ . There is a unique global minimum, no local minima, and no saddle points. Training is deterministic and reproducible.

Separation of dynamics and learning. The reservoir dynamics (determined by $W$ , $W_{\text{in}}$ , neuron model, and topology) and the learning (determined by $W^{\text{out}}$ and $\lambda$ ) are completely decoupled. This allows them to be designed and optimized independently, which is the architectural advantage exploited by the AGILE optimizer.

References

Jaeger, H. (2001). The “echo state” approach to analysing and training recurrent neural networks. GMD Report 148.
Lukoševičius, M., & Jaeger, H. (2009). Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3), 127—149.
Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of Ill-Posed Problems. Winston & Sons.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Chapter 3. Springer.
Widrow, B., & Hoff, M. E. (1960). Adaptive switching circuits. IRE WESCON Convention Record, Part 4, 96—104.

← Spectral Radius | Case Studies: Spoken Digit Recognition →