Training Theory
In reservoir computing, the recurrent weight matrix and input weight matrix are fixed at initialization. The only learned component is the readout weight matrix , which maps the high-dimensional reservoir state to the desired output. Because the readout is linear, training reduces to solving a system of linear equations — a convex optimization problem with a unique global minimum and no gradient pathology.
This page derives the two training algorithms implemented in SPIRES: ridge regression (batch, offline) and the online delta rule (streaming, incremental).
Problem Setup
Suppose we drive the reservoir with a training input sequence and collect the corresponding reservoir states , discarding an initial transient of steps to allow the reservoir to wash out its initial conditions.
We arrange the states and targets into matrices:
where is the state matrix (or design matrix) and is the target matrix. Each row of is a snapshot of the reservoir state at one time step.
Ridge Regression (Batch Training)
Ordinary Least Squares
The simplest approach is to find that minimizes the sum of squared errors:
where is the Frobenius norm. Taking the gradient and setting it to zero:
Solving:
This is the ordinary least squares (OLS) solution. It is the unique minimum of a convex quadratic, so there are no local minima or saddle points.
The Overfitting Problem
In reservoir computing, the state dimensionality is typically large (hundreds to thousands of neurons), while the number of training samples may be comparable in magnitude. When is large relative to the number of samples, OLS exhibits severe overfitting: the readout learns noise in the training data and generalizes poorly.
Mathematically, the problem is that becomes ill-conditioned (has near-zero eigenvalues), causing the inverse to amplify noise. The resulting weight vector has large magnitude, indicating that the readout is fitting to tiny fluctuations in the reservoir state.
Tikhonov Regularization (Ridge Regression)
The solution is to add a penalty on the magnitude of the weights. Ridge regression adds an regularization term:
where is the regularization parameter. The penalty discourages large weights, preferring solutions that are smooth combinations of reservoir states.
Taking the gradient and setting it to zero:
Solving:
This is the ridge regression solution, also known as Tikhonov regularization in the numerical analysis literature. The term shifts all eigenvalues of away from zero, ensuring the matrix is well-conditioned and the inverse is stable.
Interpretation of
The regularization parameter controls the bias-variance trade-off:
- : Recovers OLS. Low bias, high variance. The readout fits the training data closely but may not generalize.
- : The readout approaches . High bias (always predicting near zero), zero variance.
- Optimal : Balances training error and weight magnitude. Typically found by cross-validation or the AGILE optimizer.
The effect on the eigenspectrum of is transparent. If has eigenvalues , then the ridge solution shrinks each component by a factor of:
Large eigenvalues (strong signal directions) are barely affected. Small eigenvalues (noise directions) are suppressed toward zero. This is precisely the behavior needed in reservoir computing, where a few principal components of the reservoir state carry signal and the rest carry noise.
Computational Cost
Computing requires:
- Forming : operations.
- Forming : operations.
- Solving : operations (or per output with Cholesky factorization).
The dominant cost is . For typical reservoir sizes (), this is fast — seconds on modern hardware. SPIRES uses LAPACKE for the linear solve, leveraging optimized BLAS routines.
SVD Alternative
An equivalent formulation uses the singular value decomposition (SVD) of :
This form is more numerically stable when is extremely ill-conditioned, though the Cholesky approach is faster when applicable. SPIRES uses the normal equations (Equation 4) by default.
Online Delta Rule (Streaming Training)
Motivation
Ridge regression requires collecting all training data before computing . For applications where data arrives continuously or where the target distribution changes over time, an online (incremental) training algorithm is needed.
The Delta Rule
The online delta rule updates after each time step based on the prediction error:
where:
- is the learning rate
- is the current prediction
- is the target at time
- is the current reservoir state
This is a stochastic gradient descent step on the squared error loss . The update is:
Properties
Convergence. For a fixed reservoir driven by a stationary input, the delta rule converges to the least-squares solution as , provided satisfies:
In practice, is set to a small constant (e.g., to ).
Computational cost per step. Each update requires operations — a single outer product. This is negligible compared to the cost of the reservoir state update.
Tracking non-stationary targets. Because the delta rule continuously adapts, it can track slow changes in the target distribution. This makes it suitable for online learning and adaptive control applications.
No regularization. The basic delta rule does not include explicit regularization. Weight decay can be added:
This is the online equivalent of ridge regression, with controlling weight decay.
Batch vs. Streaming: Comparison
| Property | Ridge Regression | Online Delta Rule |
|---|---|---|
| Training mode | Batch (offline) | Streaming (online) |
| Data requirement | All data available upfront | One sample at a time |
| Optimality | Global optimum (exact) | Asymptotically optimal |
| Regularization | (Tikhonov) | (weight decay) |
| Computational cost | total | per step |
| Memory requirement | for | for |
| Non-stationary targets | Not supported | Naturally supported |
| Sensitivity to | None | High (requires tuning) |
When to Use Which
Use ridge regression when:
- The training data is fixed and available in advance.
- An exact global optimum is desired.
- The dataset is small enough to store the state matrix .
Use the online delta rule when:
- Data arrives as a stream and cannot be stored.
- The target distribution changes over time (non-stationary).
- Real-time adaptation is required (e.g., control tasks).
- Memory is limited and cannot be stored.
In practice, many SPIRES workflows use ridge regression for initial training and the delta rule for subsequent online adaptation.
Connection to Reservoir Computing Theory
The fact that only the readout layer is trained has profound theoretical implications:
Universality. A reservoir with the echo state property and a sufficiently rich state space can approximate any fading-memory filter to arbitrary accuracy, provided the readout is powerful enough. Since linear readouts are universal approximators when the reservoir provides a sufficiently rich feature expansion, the combination of a random reservoir + linear readout is a universal function approximator for temporal signals.
No vanishing/exploding gradients. Because no gradients are propagated through the recurrent dynamics, the training is immune to the vanishing and exploding gradient problems that plague backpropagation-through-time in conventional RNNs.
Convexity. The ridge regression objective (Equation 3) is a strictly convex function of . There is a unique global minimum, no local minima, and no saddle points. Training is deterministic and reproducible.
Separation of dynamics and learning. The reservoir dynamics (determined by , , neuron model, and topology) and the learning (determined by and ) are completely decoupled. This allows them to be designed and optimized independently, which is the architectural advantage exploited by the AGILE optimizer.
References
- Jaeger, H. (2001). The “echo state” approach to analysing and training recurrent neural networks. GMD Report 148.
- Lukoševičius, M., & Jaeger, H. (2009). Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3), 127—149.
- Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of Ill-Posed Problems. Winston & Sons.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Chapter 3. Springer.
- Widrow, B., & Hoff, M. E. (1960). Adaptive switching circuits. IRE WESCON Convention Record, Part 4, 96—104.
← Spectral Radius | Case Studies: Spoken Digit Recognition →