Skip to contentSkip to Content
DocsCase StudiesCart-Pole Control

Cart-Pole Control

This case study applies a Fractional-Order Spiking Neural Network (FOSNN) to the classic cart-pole balancing problem, a standard benchmark in reinforcement learning and control theory. Unlike the spoken digit recognition task, which tests pattern classification, cart-pole control tests real-time sequential decision making under noisy, non-stationary conditions. The results demonstrate that intermediate fractional orders (α0.50.7\alpha \approx 0.5\text{--}0.7) produce controllers that are more robust to observation noise than both classical spiking networks and conventional multilayer perceptrons (MLPs).

Task Description

CartPole-v1 Environment

The experiment uses the CartPole-v1 environment from OpenAI Gymnasium (formerly Gym). A pole is attached by an unactuated joint to a cart moving along a frictionless track. The agent must apply a horizontal force (left or right) at each time step to keep the pole balanced.

Observation space (4-dimensional continuous):

VariableSymbolDescription
Cart positionxxHorizontal position of the cart
Cart velocityx˙\dot{x}Horizontal velocity of the cart
Pole angleθ\thetaAngle from vertical (radians)
Pole angular velocityθ˙\dot{\theta}Angular velocity of the pole

Action space: Discrete, {0,1}\{0, 1\} (push left, push right).

Reward: +1+1 for every time step the pole remains upright (θ<12°|\theta| \lt 12°, x<2.4|x| \lt 2.4).

Termination: Episode ends when the pole falls beyond ±12°\pm 12°, the cart leaves the track, or 500 steps are reached. The task is considered “solved” at a mean reward of 475\geq 475 over 100 consecutive episodes.

Why Cart-Pole?

Cart-pole is a deceptively simple benchmark that tests several important properties:

  • Temporal credit assignment: Actions have delayed consequences; the agent must learn that an action taken now affects stability several steps later.
  • Continuous observation / discrete action: The controller must threshold a continuous state into a binary decision.
  • Instability: The upright equilibrium is unstable; the controller must actively compensate for perturbations.
  • Noise sensitivity: In the noisy variants, observation noise corrupts the state estimate and challenges the controller’s robustness.

Experimental Setup

Network Architecture

The FOSNN controller uses a spiking reservoir to process the continuous observations and a policy gradient algorithm to learn the action selection:

ParameterValue
Neuron modelFLIF-GL (Grunwald-Letnikov fractional LIF)
Input dimensions4 (cart-pole observations)
Output2 (action probabilities via softmax)
Training algorithmREINFORCE (Williams, 1992)
Fractional order α\alphaSwept: 0.3, 0.5, 0.6, 0.7, 0.8, 1.0

REINFORCE Algorithm

REINFORCE is a policy gradient method that updates the network parameters to maximize expected cumulative reward. The policy π(as;θ)\pi(a|s; \theta) defines a probability distribution over actions given the current state. The parameter update is:

θθ+ηt=0Tθlogπ(atst;θ)Gt(1)\tag{1} \theta \leftarrow \theta + \eta \sum_{t=0}^{T} \nabla_\theta \log \pi(a_t | s_t; \theta) \cdot G_t

where Gt=t=tTγttrtG_t = \sum_{t'=t}^{T} \gamma^{t'-t} r_{t'} is the discounted return from time step tt, γ\gamma is the discount factor, and η\eta is the learning rate.

In the FOSNN variant, the reservoir processes the observations through its spiking dynamics before the policy head makes the action decision. The fractional memory of the FLIF neurons provides the policy with an implicit state history, reducing the need for explicit recurrence in the policy network.

Baselines

The FOSNN is compared against:

  • Classical SNN (α=1.0\alpha = 1.0): Integer-order spiking network with exponential membrane decay.
  • MLP: A conventional multilayer perceptron with comparable parameter count, trained with REINFORCE.

Experimental Conditions

Three experimental conditions are used to characterize the controller’s properties:

  1. Clean environment: Standard CartPole-v1 with no added noise.
  2. Observation noise: Gaussian noise N(0,σ2)\mathcal{N}(0, \sigma^2) added to all four observation channels. The noise level σ\sigma is swept to generate robustness curves.
  3. Progressive node removal: Neurons are removed from the network one at a time (in random order) and performance is re-evaluated after each removal. This tests structural robustness — how gracefully the controller degrades as its computational substrate is damaged.

Results

Clean Environment Performance

In the clean environment (no noise), all models eventually solve the task (475\geq 475 reward over 100 episodes). The key differences are in learning speed and consistency:

  • FOSNN α=0.50.7\alpha = 0.5\text{--}0.7: Fastest solve time, reaching the 475 threshold in the fewest training episodes. 98% completion rate across random seeds.
  • FOSNN α=0.3\alpha = 0.3: Solves the task but more slowly. The long memory introduces sluggish adaptation to the rapidly changing pole dynamics.
  • Classical SNN (α=1.0\alpha = 1.0): Solves the task but with more variance across seeds.
  • MLP: Solves the task reliably.

Noise Robustness

The critical finding is in the noise experiments. As observation noise σ\sigma increases:

ModelPerformance under noise
FOSNN α=0.6\alpha = 0.6Best noise robustness. Improved std dev by 4.32% relative to clean baseline.
FOSNN α=0.5\alpha = 0.5Strong noise robustness, slightly below α=0.6\alpha = 0.6.
FOSNN α=0.7\alpha = 0.7Good noise robustness.
Classical SNN (α=1.0\alpha = 1.0)Moderate noise robustness.
MLPWorst noise tolerance. Performance degrades most rapidly with increasing σ\sigma.

The FOSNN at α=0.6\alpha = 0.6 achieves the best noise robustness, improving its standard deviation by 4.32% under noisy observations compared to the clean baseline. This counterintuitive result — the controller becomes more consistent under noise — arises because the fractional memory kernel acts as a temporal low-pass filter, smoothing out observation noise while preserving the slow dynamics of the cart-pole system.

Why Fractional Order Helps Under Noise

The observation noise is i.i.d. (white) at each time step, while the true cart-pole dynamics evolve smoothly. A controller that bases its decisions on only the most recent observation (as an MLP does) is maximally affected by noise. A controller with fractional memory effectively integrates over a history window weighted by the power-law kernel, producing a smoothed state estimate.

The smoothing effect is controlled by α\alpha:

  • Low α\alpha: Strong smoothing (long integration window). Good for slow noise, but may over-smooth rapid dynamics.
  • High α\alpha: Weak smoothing (short window). Responsive to dynamics but also to noise.
  • α0.6\alpha \approx 0.6: Optimal balance for the CartPole-v1 dynamics, which have a natural timescale of tens of steps.

This is formally analogous to a moving-average filter with a power-law window, which is known to be optimal for estimating a signal corrupted by white noise when the signal has a 1/f1/f-like spectrum.

Progressive Node Removal (Structural Robustness)

When neurons are progressively removed from the network:

ModelStructural robustness
MLPMost structurally robust. Performance degrades gradually and predictably.
FOSNN α=0.8\alpha = 0.8Lowest initial degradation among spiking models.
FOSNN α=0.6\alpha = 0.6Moderate structural robustness.
Classical SNN (α=1.0\alpha = 1.0)Moderate structural robustness.

The MLP’s structural robustness advantage is expected: in a fully connected network, information is distributed redundantly across all units, so removing any single unit has a small effect. In a spiking reservoir, the sparse recurrent connectivity means that some neurons occupy critical topological positions, and their removal can disproportionately affect performance.

Among the spiking models, higher α\alpha (closer to classical) provides better structural robustness. This is because the fractional memory introduces long-range temporal dependencies that are stored in the history of specific neurons. When those neurons are removed, the history is lost and cannot be reconstructed by the remaining network.

Summary of Results

PropertyBest ModelNotes
Learning speedFOSNN α=0.50.7\alpha = 0.5\text{--}0.7Fastest convergence to 475 reward
Noise robustnessFOSNN α=0.6\alpha = 0.64.32% improvement in std dev under noise
Structural robustnessMLPGradual degradation under node removal
Completion rateFOSNN α=0.50.7\alpha = 0.5\text{--}0.798% across random seeds

Discussion

The Noise-Robustness Advantage

The most significant finding is that fractional-order spiking networks provide superior noise robustness compared to both classical SNNs and MLPs. This advantage arises naturally from the power-law memory kernel without any explicit noise filtering or state estimation.

In practical control applications, observation noise is ubiquitous — sensors are imperfect, actuators introduce vibrations, and the environment is stochastic. A controller that inherently filters noise through its temporal dynamics is valuable, especially when the added computational cost is minimal (the only overhead is the GL history buffer).

Fractional Order as Implicit State Estimation

The FOSNN’s noise robustness can be understood through the lens of state estimation theory. An optimal controller for the noisy cart-pole would first estimate the true state from noisy observations (e.g., via a Kalman filter) and then apply the optimal control law. The fractional memory kernel approximates this two-step process in a single dynamical system:

  1. The power-law weighted history provides a form of temporal averaging that reduces observation noise.
  2. The spiking reservoir’s nonlinear dynamics provide implicit state estimation by combining observations across time.

This eliminates the need for an explicit filter module, reducing architectural complexity while achieving comparable noise reduction.

Trade-off Between Noise and Structural Robustness

The results reveal an interesting trade-off: the models that are most robust to observation noise (fractional-order SNNs) are less robust to structural damage (node removal), and vice versa. This suggests that fractional memory distributes information temporally but concentrates it structurally, while fully connected architectures (MLPs) distribute information structurally but concentrate it temporally.

A system designer must choose which form of robustness is more important for their application. For noisy environments with intact hardware, the FOSNN is preferred. For environments where hardware failures are likely, the MLP or higher-α\alpha SNN may be more appropriate.


← Spoken Digit Recognition | Diabetes Prediction →

Last updated on