Parallelism
SPIRES leverages two levels of parallelism to accelerate reservoir simulation and training: OpenMP for neuron-level updates and BLAS/LAPACK for linear algebra operations. Understanding how these interact — and how to configure them — is important for achieving good performance, especially on multi-core systems.
OpenMP: Neuron Updates
The most computationally intensive part of reservoir simulation is updating the membrane potential of every neuron at each time step. For fractional neurons with history length , each neuron requires an dot product over its voltage history. With neurons, the total per-step cost is .
SPIRES parallelizes the outer loop over neurons using OpenMP:
/* Simplified pseudocode of the internal update loop */
#pragma omp parallel for schedule(dynamic)
for (int i = 0; i < num_neurons; i++) {
/* Compute synaptic input for neuron i */
/* Compute history sum (GL coefficients dot voltage history) */
/* Update membrane potential */
/* Check threshold and reset */
}Each neuron’s update is independent of other neurons at the same time step (they depend on the previous time step’s states, which are fixed). This makes the parallelization embarrassingly parallel with no synchronization overhead.
When OpenMP Helps
OpenMP parallelism is most beneficial when:
- Fractional neurons are used (the history sum dominates).
- Large reservoirs are used ().
- Multiple cores are available (2+ physical cores).
For small reservoirs with integer-order LIF neurons, the per-step cost is and the OpenMP overhead (thread creation, synchronization) may exceed the savings. In this case, single-threaded execution can be faster.
BLAS/LAPACK: Linear Algebra
SPIRES uses BLAS (Basic Linear Algebra Subprograms) and LAPACKE (C interface to LAPACK) for matrix operations during training:
dgemm: Matrix-matrix multiplication for computing and in ridge regression.dposv: Solving the symmetric positive definite system for the readout weights.
These routines are provided by OpenBLAS (the recommended BLAS implementation for SPIRES), which has its own internal threading.
OpenBLAS Threading
OpenBLAS uses pthreads to parallelize matrix operations. By default, it uses all available cores. This is controlled by the OPENBLAS_NUM_THREADS environment variable:
export OPENBLAS_NUM_THREADS=4Configuring Thread Counts
Environment Variables
| Variable | Controls | Default |
|---|---|---|
OMP_NUM_THREADS | OpenMP threads (neuron updates) | Number of CPU cores |
OPENBLAS_NUM_THREADS | OpenBLAS threads (BLAS/LAPACK) | Number of CPU cores |
Setting Threads at Runtime
# Use 4 OpenMP threads and 4 BLAS threads
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
./my_spires_programOr set them inline:
OMP_NUM_THREADS=4 OPENBLAS_NUM_THREADS=4 ./my_spires_programRecommended Configurations
The optimal thread configuration depends on the workload:
| Scenario | OMP_NUM_THREADS | OPENBLAS_NUM_THREADS | Rationale |
|---|---|---|---|
| Large fractional reservoir, short training | (all cores) | 1 | Neuron updates dominate; avoid BLAS contention |
| Small LIF reservoir, long training | 1 | (all cores) | Training dominates; maximize BLAS throughput |
| Balanced workload | Split cores between simulation and training | ||
| Multiple reservoirs in parallel | 1 per reservoir | 1 per reservoir | Avoid oversubscription |
Here is the number of physical CPU cores.
Avoiding Thread Oversubscription
A critical pitfall is oversubscription: requesting more total threads than available CPU cores. If OMP_NUM_THREADS=8 and OPENBLAS_NUM_THREADS=8 on an 8-core machine, a BLAS call from within an OpenMP region could spawn threads, leading to severe performance degradation.
SPIRES avoids this internally by ensuring that BLAS calls during training are not nested inside OpenMP regions. However, if you make your own BLAS calls alongside SPIRES, be aware of the interaction.
A safe default for most workloads:
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=1This gives OpenMP full control during simulation and limits BLAS to single-threaded operation, which avoids contention. For the training phase (which does not use OpenMP), BLAS still runs single-threaded, but the solve is fast for typical reservoir sizes ().
Thread Safety
Each Reservoir is Independent
SPIRES reservoirs are independent, self-contained objects. You can safely use different reservoirs from different threads:
/* Safe: each thread operates on its own reservoir */
#pragma omp parallel for
for (int i = 0; i < num_reservoirs; i++) {
spires_reservoir *r = NULL;
spires_reservoir_create(&configs[i], &r);
spires_train_ridge(r, input, target, len, lambda);
double *pred = spires_run(r, test, test_len);
results[i] = evaluate(pred, test_len);
free(pred);
spires_reservoir_destroy(r);
}Do Not Share Reservoirs
A single reservoir must not be accessed from multiple threads simultaneously. The internal state (membrane potentials, history buffers, readout weights) is not protected by locks:
/* UNSAFE: sharing a single reservoir across threads */
spires_reservoir *r = /* ... */;
#pragma omp parallel for
for (int i = 0; i < num_trials; i++) {
/* BAD: concurrent writes to r's internal state */
spires_step(r, &inputs[i]);
}If you need to run the same configuration with different inputs in parallel, create separate reservoirs for each thread.
Thread-Safe Functions
The following functions are safe to call from any thread, as they only read shared (immutable) data or operate on caller-owned buffers:
spires_num_neurons(r)— reads a constantspires_num_inputs(r)— reads a constantspires_num_outputs(r)— reads a constant
All other functions modify the reservoir’s internal state and must not be called concurrently on the same reservoir.
Performance Profiling
Measuring Parallelism Benefits
To measure the speedup from parallelism, compare single-threaded and multi-threaded execution:
# Single-threaded baseline
OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 time ./my_program
# Multi-threaded
OMP_NUM_THREADS=4 OPENBLAS_NUM_THREADS=1 time ./my_programThe speedup depends on the fraction of time spent in parallelizable code (Amdahl’s law). For large fractional reservoirs, the parallelizable fraction is high (>90%), and near-linear speedup is achievable with up to 4—8 cores.
Identifying Bottlenecks
If parallelism does not provide the expected speedup:
- Reservoir is too small: The per-neuron work is insufficient to amortize thread overhead. Use at least neurons with fractional models.
- History length is too short: For integer-order LIF, the per-neuron cost is , and parallelism overhead dominates.
- Memory bandwidth: For very large reservoirs, the bottleneck shifts from computation to memory access speed. Ensure the reservoir state fits in the CPU cache.
- Thread oversubscription: Check that total thread count does not exceed physical core count.
Platform-Specific Notes
macOS
macOS provides libomp via Homebrew (brew install libomp). The default Apple Clang compiler requires the -Xpreprocessor -fopenmp flag and explicit linking:
gcc -Xpreprocessor -fopenmp -lomp -o my_program my_program.c -lspires ...Alternatively, use GCC from Homebrew (brew install gcc) which has native OpenMP support.
Linux
Most Linux distributions ship GCC with OpenMP support by default. Add -fopenmp to the compiler flags:
gcc -fopenmp -o my_program my_program.c -lspires ...Without OpenMP
SPIRES can be built without OpenMP. In this case, all neuron updates run sequentially. This is useful for embedded systems, single-core machines, or debugging. The library’s behavior is otherwise identical.