Parallelism

SPIRES leverages two levels of parallelism to accelerate reservoir simulation and training: OpenMP for neuron-level updates and BLAS/LAPACK for linear algebra operations. Understanding how these interact — and how to configure them — is important for achieving good performance, especially on multi-core systems.

OpenMP: Neuron Updates

The most computationally intensive part of reservoir simulation is updating the membrane potential of every neuron at each time step. For fractional neurons with history length $L$ , each neuron requires an $O(L)$ dot product over its voltage history. With $N$ neurons, the total per-step cost is $O(N \cdot L)$ .

SPIRES parallelizes the outer loop over neurons using OpenMP:


/* Simplified pseudocode of the internal update loop */
#pragma omp parallel for schedule(dynamic)
for (int i = 0; i < num_neurons; i++) {
    /* Compute synaptic input for neuron i */
    /* Compute history sum (GL coefficients dot voltage history) */
    /* Update membrane potential */
    /* Check threshold and reset */
}

Each neuron’s update is independent of other neurons at the same time step (they depend on the previous time step’s states, which are fixed). This makes the parallelization embarrassingly parallel with no synchronization overhead.

When OpenMP Helps

OpenMP parallelism is most beneficial when:

Fractional neurons are used (the $O(L)$ history sum dominates).
Large reservoirs are used ( $N > 200$ ).
Multiple cores are available (2+ physical cores).

For small reservoirs with integer-order LIF neurons, the per-step cost is $O(N)$ and the OpenMP overhead (thread creation, synchronization) may exceed the savings. In this case, single-threaded execution can be faster.

BLAS/LAPACK: Linear Algebra

SPIRES uses BLAS (Basic Linear Algebra Subprograms) and LAPACKE (C interface to LAPACK) for matrix operations during training:

dgemm: Matrix-matrix multiplication for computing $\Phi^\top \Phi$ and $\Phi^\top Y$ in ridge regression.
dposv: Solving the symmetric positive definite system $(\Phi^\top \Phi + \lambda I)\, X = \Phi^\top Y$ for the readout weights.

These routines are provided by OpenBLAS (the recommended BLAS implementation for SPIRES), which has its own internal threading.

OpenBLAS Threading

OpenBLAS uses pthreads to parallelize matrix operations. By default, it uses all available cores. This is controlled by the OPENBLAS_NUM_THREADS environment variable:


export OPENBLAS_NUM_THREADS=4

Configuring Thread Counts

Environment Variables

Variable	Controls	Default
`OMP_NUM_THREADS`	OpenMP threads (neuron updates)	Number of CPU cores
`OPENBLAS_NUM_THREADS`	OpenBLAS threads (BLAS/LAPACK)	Number of CPU cores

Setting Threads at Runtime


# Use 4 OpenMP threads and 4 BLAS threads
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
./my_spires_program

Or set them inline:


OMP_NUM_THREADS=4 OPENBLAS_NUM_THREADS=4 ./my_spires_program

Recommended Configurations

The optimal thread configuration depends on the workload:

Scenario	`OMP_NUM_THREADS`	`OPENBLAS_NUM_THREADS`	Rationale
Large fractional reservoir, short training	$P$ (all cores)	1	Neuron updates dominate; avoid BLAS contention
Small LIF reservoir, long training	1	$P$ (all cores)	Training dominates; maximize BLAS throughput
Balanced workload	$P/2$	$P/2$	Split cores between simulation and training
Multiple reservoirs in parallel	1 per reservoir	1 per reservoir	Avoid oversubscription

Here $P$ is the number of physical CPU cores.

Avoiding Thread Oversubscription

A critical pitfall is oversubscription: requesting more total threads than available CPU cores. If OMP_NUM_THREADS=8 and OPENBLAS_NUM_THREADS=8 on an 8-core machine, a BLAS call from within an OpenMP region could spawn $8 \times 8 = 64$ threads, leading to severe performance degradation.

SPIRES avoids this internally by ensuring that BLAS calls during training are not nested inside OpenMP regions. However, if you make your own BLAS calls alongside SPIRES, be aware of the interaction.

A safe default for most workloads:


export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=1

This gives OpenMP full control during simulation and limits BLAS to single-threaded operation, which avoids contention. For the training phase (which does not use OpenMP), BLAS still runs single-threaded, but the $O(N^3)$ solve is fast for typical reservoir sizes ( $N \lt 2000$ ).

Thread Safety

Each Reservoir is Independent

SPIRES reservoirs are independent, self-contained objects. You can safely use different reservoirs from different threads:


/* Safe: each thread operates on its own reservoir */
#pragma omp parallel for
for (int i = 0; i < num_reservoirs; i++) {
    spires_reservoir *r = NULL;
    spires_reservoir_create(&configs[i], &r);
    spires_train_ridge(r, input, target, len, lambda);
    double *pred = spires_run(r, test, test_len);
    results[i] = evaluate(pred, test_len);
    free(pred);
    spires_reservoir_destroy(r);
}

A single reservoir must not be accessed from multiple threads simultaneously. The internal state (membrane potentials, history buffers, readout weights) is not protected by locks:


/* UNSAFE: sharing a single reservoir across threads */
spires_reservoir *r = /* ... */;
 
#pragma omp parallel for
for (int i = 0; i < num_trials; i++) {
    /* BAD: concurrent writes to r's internal state */
    spires_step(r, &inputs[i]);
}

If you need to run the same configuration with different inputs in parallel, create separate reservoirs for each thread.

Thread-Safe Functions

The following functions are safe to call from any thread, as they only read shared (immutable) data or operate on caller-owned buffers:

spires_num_neurons(r) — reads a constant
spires_num_inputs(r) — reads a constant
spires_num_outputs(r) — reads a constant

All other functions modify the reservoir’s internal state and must not be called concurrently on the same reservoir.

Performance Profiling

Measuring Parallelism Benefits

To measure the speedup from parallelism, compare single-threaded and multi-threaded execution:


# Single-threaded baseline
OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 time ./my_program
 
# Multi-threaded
OMP_NUM_THREADS=4 OPENBLAS_NUM_THREADS=1 time ./my_program

The speedup depends on the fraction of time spent in parallelizable code (Amdahl’s law). For large fractional reservoirs, the parallelizable fraction is high (>90%), and near-linear speedup is achievable with up to 4—8 cores.

Identifying Bottlenecks

If parallelism does not provide the expected speedup:

Reservoir is too small: The per-neuron work is insufficient to amortize thread overhead. Use at least $N \geq 200$ neurons with fractional models.
History length is too short: For integer-order LIF, the per-neuron cost is $O(1)$ , and parallelism overhead dominates.
Memory bandwidth: For very large reservoirs, the bottleneck shifts from computation to memory access speed. Ensure the reservoir state fits in the CPU cache.
Thread oversubscription: Check that total thread count does not exceed physical core count.

Platform-Specific Notes

macOS

macOS provides libomp via Homebrew (brew install libomp). The default Apple Clang compiler requires the -Xpreprocessor -fopenmp flag and explicit linking:


gcc -Xpreprocessor -fopenmp -lomp -o my_program my_program.c -lspires ...

Alternatively, use GCC from Homebrew (brew install gcc) which has native OpenMP support.

Linux

Most Linux distributions ship GCC with OpenMP support by default. Add -fopenmp to the compiler flags:


gcc -fopenmp -o my_program my_program.c -lspires ...

Without OpenMP

SPIRES can be built without OpenMP. In this case, all neuron updates run sequentially. This is useful for embedded systems, single-core machines, or debugging. The library’s behavior is otherwise identical.

← Memory Management | Error Handling →