Scoring Metrics

The scoring configuration tells the AGILE optimizer how to evaluate and rank candidate reservoir configurations. It defines which performance metric to use, how to penalize variance across random seeds, and how to penalize computational cost. These choices shape the optimizer’s search toward configurations that are not only accurate but also robust and efficient.

The `spires_opt_score` Struct


struct spires_opt_score {
    double lambda_var;   /* variance penalty weight */
    double lambda_cost;  /* computational cost penalty weight */
    int    metric;       /* performance metric (enum) */
};

Fields

Field	Type	Description
`lambda_var`	`double`	Weight for the variance penalty. Higher values favor configurations with consistent performance across random seeds. Range: $\geq 0$ .
`lambda_cost`	`double`	Weight for the computational cost penalty. Higher values favor cheaper (faster) configurations. Range: $\geq 0$ .
`metric`	`int`	The performance metric to optimize. One of `SPIRES_METRIC_AUROC` or `SPIRES_METRIC_AUPRC`.

Composite Score

The optimizer computes a composite score for each candidate configuration:

\text{Score} = \bar{m} - \lambda_{\text{var}} \cdot \sigma_m - \lambda_{\text{cost}} \cdot c

where:

$\bar{m}$ is the mean of the performance metric across random seeds
$\sigma_m$ is the standard deviation of the metric across seeds
$c$ is a normalized computational cost measure
$\lambda_{\text{var}}$ and $\lambda_{\text{cost}}$ are the penalty weights

The optimizer maximizes this composite score. A configuration that scores highly must have a high mean metric, low variance across seeds, and low computational cost (if cost is penalized).

Performance Metrics

AUROC (Area Under the Receiver Operating Characteristic)


score.metric = SPIRES_METRIC_AUROC;  /* value: 0 */

AUROC measures the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example by the classifier. It ranges from 0 to 1, where:

1.0: Perfect discrimination
0.5: Random guessing (no discriminative ability)
< 0.5: Worse than random (predictions are inverted)

AUROC is threshold-independent: it evaluates the ranking quality of the output across all possible classification thresholds. This makes it robust to miscalibrated output scales.

When to use AUROC:

Balanced or moderately imbalanced classification tasks
When you care about overall ranking quality
When the cost of false positives and false negatives is roughly equal

AUPRC (Area Under the Precision-Recall Curve)


score.metric = SPIRES_METRIC_AUPRC;  /* value: 1 */

AUPRC measures the area under the precision-recall curve, where precision = TP / (TP + FP) and recall = TP / (TP + FN). It ranges from 0 to 1, where:

1.0: Perfect precision and recall at all thresholds
Baseline: Equal to the positive class prevalence (e.g., 0.01 for 1% positive rate)

AUPRC is more informative than AUROC when classes are highly imbalanced, because it focuses on the performance of the positive class. A high AUROC can be achieved trivially on imbalanced data by predicting the majority class, but a high AUPRC requires genuinely identifying the rare positive cases.

When to use AUPRC:

Highly imbalanced classification tasks (e.g., anomaly detection, rare event prediction)
When false negatives are costly (missing a positive case)
When the positive class prevalence is below 10%

Choosing Between AUROC and AUPRC

Scenario	Recommended Metric
Balanced classes (40—60% positive)	AUROC
Moderate imbalance (10—40% positive)	AUROC or AUPRC
High imbalance (1—10% positive)	AUPRC
Extreme imbalance (< 1% positive)	AUPRC
Cost-sensitive with equal costs	AUROC
Detecting rare events	AUPRC

Variance Penalty

The variance penalty $\lambda_{\text{var}}$ controls how much the optimizer values consistency across random seeds.

When a configuration is evaluated with $S$ random seeds, it produces $S$ metric values $\{m_1, m_2, \ldots, m_S\}$ . The mean $\bar{m}$ and standard deviation $\sigma_m$ are computed, and the score is reduced by $\lambda_{\text{var}} \cdot \sigma_m$ .

Effect of `lambda_var`

`lambda_var`	Behavior
0.0	No variance penalty; optimizer seeks highest mean performance regardless of consistency
0.5	Moderate penalty; a 1-standard-deviation decrease in consistency costs half a metric point
1.0	Strong penalty; equivalent to optimizing the lower bound $\bar{m} - \sigma_m$
2.0	Very strong; approximately a 95% confidence lower bound

Practical guidance:

For research and benchmarking, use lambda_var = 0.0 or 0.5 to find the highest-performing configuration.
For production deployments where reliability matters, use lambda_var = 1.0 or higher to ensure the chosen configuration performs consistently.
If using very few seeds (1—2), the variance estimate is unreliable. Either increase the number of seeds or reduce the penalty.

Cost Penalty

The cost penalty $\lambda_{\text{cost}}$ discourages the optimizer from selecting computationally expensive configurations when cheaper alternatives perform nearly as well.

The cost $c$ is a normalized measure that accounts for:

Reservoir size: Larger reservoirs (more neurons) are more expensive.
Neuron complexity: Fractional neurons with long histories are more expensive per step than simple LIF neurons.
Connectivity density: Denser networks have more synaptic computations.

Effect of `lambda_cost`

`lambda_cost`	Behavior
0.0	No cost penalty; optimizer chooses the best configuration regardless of computational expense
0.1	Mild penalty; prefers cheaper configurations when performance is similar
0.5	Moderate penalty; willing to sacrifice some performance for significant speedup
1.0	Strong penalty; aggressively favors cheap configurations

Practical guidance:

For offline analysis where compute time is not critical, use lambda_cost = 0.0.
For real-time or embedded applications where inference speed matters, increase lambda_cost to bias toward smaller, faster reservoirs.
For balancing accuracy and efficiency, lambda_cost = 0.1 is a good starting point.

Example Configurations

Maximum Performance

Seek the highest AUROC, regardless of variance or cost:


struct spires_opt_score score = {
    .lambda_var  = 0.0,
    .lambda_cost = 0.0,
    .metric      = SPIRES_METRIC_AUROC,
};

Robust Performance

Optimize for consistent AUROC across seeds:


struct spires_opt_score score = {
    .lambda_var  = 1.0,
    .lambda_cost = 0.0,
    .metric      = SPIRES_METRIC_AUROC,
};

Balanced Efficiency

Good AUPRC with a preference for cheaper configurations:


struct spires_opt_score score = {
    .lambda_var  = 0.5,
    .lambda_cost = 0.2,
    .metric      = SPIRES_METRIC_AUPRC,
};

Real-Time Deployment

Strongly favor fast configurations for anomaly detection:


struct spires_opt_score score = {
    .lambda_var  = 1.0,
    .lambda_cost = 0.5,
    .metric      = SPIRES_METRIC_AUPRC,
};

Interpreting the Result

After optimization, the spires_opt_result struct contains:


struct spires_opt_result {
    spires_reservoir_config best_config;  /* optimal configuration */
    double best_log10_ridge;              /* log10 of best ridge lambda */
    double best_score;                    /* composite score */
    double metric_mean;                   /* mean metric across seeds */
    double metric_std;                    /* std of metric across seeds */
};

Field	Interpretation
`best_score`	The composite score (metric mean minus penalties). This is what the optimizer maximized.
`metric_mean`	The raw mean performance metric. Compare this across different scoring configurations to understand the accuracy-cost trade-off.
`metric_std`	The variability across seeds. Lower is better for deployment reliability.
`best_log10_ridge`	The optimal ridge regularization parameter in log-space. Use `pow(10.0, best_log10_ridge)` to get $\lambda$ .

Interaction with Budget Levels

The scoring configuration is applied at every budget level. At low-fidelity levels (fewer seeds, less data), the metric estimates are noisier. The variance penalty effectively accounts for this: configurations with high variance at low fidelity are penalized, which is appropriate because truly good configurations tend to show consistent performance even with limited evaluation.

At higher fidelity levels, the metric estimates become more reliable, and the variance penalty becomes a true measure of the configuration’s inherent robustness rather than an artifact of limited evaluation.

← Budget Configuration | Memory Management →