Sharpe Stability Ratio (SSR)
A signal-to-noise measure of how consistently a strategy delivers risk-adjusted performance over time, computed from the rolling Sharpe ratio with a heteroskedasticity- and autocorrelation-consistent (HAC) long-run variance.
Introduced by Bajo Traver and Rodríguez Domínguez (2026) as the temporal-consistency complement to the Sharpe Ratio (SR), the Probabilistic Sharpe Ratio (PSR), and the Deflated Sharpe Ratio (DSR).
Overview & Motivation
The Sharpe Ratio compresses an entire return history into a single scalar. Two strategies with identical full-sample may nevertheless differ sharply in their temporal profile: one might deliver steady excess returns across every subperiod, while the other concentrates its gains in a small number of favourable episodes interspersed with mediocrity. This distinction is economically material for due diligence, ongoing monitoring, capital allocation, and performance-fee design, but the classical SR cannot resolve it.
The Probabilistic Sharpe Ratio (PSR) of Bailey and López de Prado (2012) adjusts the SR for finite-sample uncertainty under non-Normal returns, answering "is this Sharpe ratio statistically credible at this frequency over this sample?". The Deflated Sharpe Ratio (DSR) corrects for cross-sectional selection bias arising from multiple testing across strategy candidates. Both metrics evaluate performance at a single temporal aggregation; neither quantifies how risk-adjusted returns evolve within the sample period.
The Sharpe Stability Ratio (SSR) closes this gap. It treats the rolling Sharpe ratio as a time-series object and defines stability as the ratio of its mean to its HAC long-run standard deviation. High SSR indicates persistent risk-adjusted performance across subperiods; low SSR reveals episodic outperformance concentrated in short-lived favourable windows.
The framework is non-trivial because adjacent rolling windows share observations, inducing strong mechanical autocorrelation in the rolling SR sequence even when the underlying returns are independent. Naive variance estimators ignore this dependence and understate temporal uncertainty by orders of magnitude. HAC correction is therefore indispensable for valid inference, and is the central methodological contribution of the SSR framework.
Setup & Notation
Excess returns
Let denote portfolio returns and the corresponding risk-free rates. Excess returns are
The process is assumed strictly stationary and -mixing — a regularity condition satisfied by standard return models including AR, GARCH, and Markov-switching processes. This guarantees consistency of the HAC variance estimator and validity of the moving-block bootstrap.
Full-sample Sharpe ratio
The classical (point-in-time) Sharpe ratio over is
The annualized Sharpe is , where is the number of observations per year ( for daily, for weekly, for monthly).
Rolling performance process
For a window length , define the rolling mean and rolling sample variance
The corresponding rolling Sharpe ratio and the rolling performance process are
The choice of is a smoothing decision analogous to bandwidth selection in nonparametric estimation: shorter windows produce noisy, high-variance Sharpe estimates, while longer windows yield smoother but more averaged-out dynamics. The window must be long enough for the Central Limit Theorem to justify approximate normality of local Sharpe-type estimators.
Our backend uses paper-aligned fixed defaults — no adaptive shrinkage:
- Daily data (): (one trading year).
- Monthly data (): (three years), the value used in the original paper's empirical analysis.
- Other frequencies: defaults to (one calendar year of observations).
When the available history is too short to construct at least 60 rolling Sharpe observations at the chosen window, SSR is reported as unavailable rather than estimated from a shrunken window. This is deliberate: a partial-window SSR is harder to interpret and easier to over-trust than an explicit "insufficient data" verdict.
Mathematical Formulation
Mechanical autocorrelation from overlapping windows
Consecutive rolling windows share observations, creating mechanical overlap that induces strong serial dependence even when the underlying returns are i.i.d. For rolling means, Proposition 3.1 in the original paper shows that the autocorrelation function of the rolling sequence satisfies
Because the rolling Sharpe is a smooth function of the rolling mean and rolling standard deviation, the delta-method linearization
transfers the same first-order autocorrelation structure to , with higher-order remainder terms vanishing as . For , this implies — adjacent rolling Sharpe observations are nearly perfectly correlated by construction.
Long-run variance and HAC estimation
The long-run variance of the rolling performance process accounts for all autocovariances:
The naive sample variance ignores the autocovariance terms and is severely anti-conservative under the overlap-induced dependence. We therefore use the Newey and West (1987) HAC estimator with a Bartlett kernel:
The Bartlett kernel guarantees and applies linearly declining weights to higher-order autocovariances. The bandwidth is selected by the Andrews (1991) data-driven procedure, which fits an AR(1) approximation to obtain
where is the first-order autocorrelation of . The plug-in bandwidth grows as in the textbook regime, slow enough to ensure consistency yet flexible enough to absorb both the mechanical overlap and any genuine serial dependence in the underlying returns.
Finite-sample cap. Because adjacent rolling windows share observations, the first-order autocorrelation of is mechanically close to one ( for daily data with ). The plug-in expression
then explodes — purely as an artifact of overlap, not of true persistence — and Andrews' AR(1) formula can return bandwidths in the hundreds. We therefore enforce a finite-sample cap
keeping the selected lag in the consistency regime while preventing hundreds of noisy high-order autocovariances from dominating finite-sample SSR estimates. The cap is conservative — it permits more autocovariance absorption than Andrews' optimal rate, which is appropriate given the strong overlap-induced dependence we are trying to absorb.
Sharpe Stability Ratio
Let denote the population mean of the rolling performance process and its long-run standard deviation. The structural Sharpe Stability Ratio is
For hypothesis testing relative to a required Sharpe benchmark , the inferential (benchmark-centered) form is
The benchmark represents the required efficiency level. The choice serves as the natural baseline (and, as shown in the elasticity analysis below, eliminates divergences in economically relevant regions). High SSR requires both (i) superior average performance relative to the benchmark and (ii) low temporal dispersion of rolling Sharpe ratios. Two strategies with identical ex-post can display markedly different SSR values: a stable strategy attains high SSR; an episodic strategy whose rolling Sharpe oscillates sharply attains low SSR despite comparable aggregate performance.
Annualization convention. Our backend constructs on the annualized Sharpe scale by multiplying the per-period rolling Sharpe by . The user-supplied benchmark is therefore also interpreted on the annualized scale (e.g. means "require an annualized Sharpe of 0.5"). The SSR ratio itself is invariant under this rescaling because numerator and denominator are scaled identically — so the t-statistic, p-value, and verdict are all unchanged. The annualization only fixes the scale on which , , and are reported.
What the API field ssr returns. The default ssr field on the Sharpe inference payload reports the benchmark-centered form , since this is the quantity the t-test is built on. The structural form is exposed in parallel as ssr_structural, and ssr_centered mirrors ssr for explicitness. When the two forms coincide.
Statistical Inference
Asymptotic distribution
Under stationarity and mixing, Theorem 3.1 of the original paper establishes consistency of the HAC variance estimator,
and Theorem 3.2 establishes a Central Limit Theorem for the rolling mean,
Combining the two yields the inferential statistic
Under the boundary null , Theorem 3.3 shows ; under the fixed alternative , , so the test is consistent. The benchmark enters only through the null hypothesis; it does not affect the asymptotic variance, which is governed solely by the long-run variance of .
Moving-block bootstrap (finite-sample inference)
Asymptotic approximations may be unreliable under the strong autocorrelation induced by overlapping windows. The original paper therefore proposes a moving-block bootstrap (MBB) — Künsch (1989), Hall et al. (1995) — that resamples contiguous blocks of the original return series, reconstructs bootstrap return paths, and recomputes for each replicate . Block length is selected via the Politis and White (2004) automatic procedure, which adapts to the autocorrelation structure of each series. The bootstrap p-value for the one-sided test against is
Our current implementation reports the asymptotic-Normal p-value derived from . The moving-block bootstrap is on the roadmap; the original paper recommends it for finite-sample inference under strong overlap-induced autocorrelation.
Three complementary one-sided tests
The original framework formalizes three distinct hypothesis tests, all implementable via either asymptotic or bootstrap p-values:
- Mean rolling performance vs. benchmark. against . Test statistic (asymptotic) or (bootstrap). This is the test exposed by our API; the verdict on the SSR card reports its one-sided -value at .
- SSR exceeds a stability threshold . against .
- Pairwise stability comparison. against , with a joint block bootstrap that preserves cross-sectional dependence between the two return series.
Sensitivity & Economic Channels
Treating , , and as continuously varying, SSR admits clean partial derivatives that map to three distinct economic channels.
Partial derivatives
Improvements in average performance translate into SSR gains proportionally to the inverse of temporal dispersion: when instability is low, temporal stability acts as a performance amplifier. Reducing raises SSR nonlinearly, so high-SSR strategies benefit more in absolute terms from stability improvements than low-SSR strategies.
Total differential
The total differential decomposes SSR changes into three economic channels:
The performance channel reflects improvements in average rolling Sharpe; the exigence channel captures changes in the required benchmark; the stability channel reflects variations in temporal dispersion.
Elasticities
The unit-elasticity result for is striking: a 1% increase in long-run temporal dispersion produces a 1% decrease in SSR regardless of or . SSR penalizes instability in a scale-invariant manner. The performance and benchmark elasticities both diverge as , which motivates the choice in empirical applications: at zero benchmark, and , eliminating divergence in economically relevant regions.
Iso-SSR curves: the performance–stability trade-off
For a fixed benchmark , level sets of SSR satisfy
where is the SSR level. Iso-SSR curves are upward-sloping straight lines in space with slope : an increase in temporal instability of one unit must be compensated by additional units of average rolling Sharpe to maintain constant SSR. This is the SSR counterpart of the mean–variance trade-off encoded in the classical SR.
How SSR Sits Among SR, PSR, and DSR
The four metrics target distinct dimensions of Sharpe-ratio inference and are designed to be used together. They answer different questions and apply different adjustments.
| Metric | Question | Dimension | Adjustment |
|---|---|---|---|
| SR | Ex-post risk-adjusted return? | Point-in-time level | Mean–variance trade-off |
| PSR | Statistically credible at this sample? | Point-in-time level | Higher moments + effective sample size |
| DSR | Did I cherry-pick from many trials? | Cross-sectional selection bias | Number of trials + cross-strategy dependence |
| SSR | Stable across time? | Time-series consistency | HAC-robust temporal dispersion |
A strategy can score well on SR, PSR, and DSR yet exhibit low SSR — the aggregate looks credible while the time series is lumpy. Conversely, a strategy with high SSR but low DSR may be temporally consistent yet cross-sectionally over-fit. The four dimensions are complementary; SSR fills the unconditional temporal-stability axis that none of the others address.
Interpretation
SSR is dimensionless. Both numerator and denominator are expressed in units of , so the ratio itself does not depend on annualization or observation frequency. Inference is performed on the studentized statistic , which under the null is asymptotically standard Normal — so SSR magnitudes should be interpreted alongside the t-statistic and one-sided p-value rather than against absolute thresholds borrowed from the SR literature.
Our application surfaces a verdict on the inferential test against :
| One-sided p-value | Verdict | Interpretation |
|---|---|---|
| Asymptotically stable at 95% confidence | Mean rolling Sharpe is significantly above the benchmark at the 95% level under the asymptotic-Normal HAC test. | |
| Borderline asymptotic stability | Some evidence of persistence, but not at conventional significance. | |
| Episodic under asymptotic test | Rolling Sharpe oscillations are too large to attribute aggregate performance to persistent skill at the asymptotic level. | |
| too small | Insufficient data | Rolling sequence too short for reliable HAC inference. |
These p-value thresholds are application-level conventions, not part of the original paper. We require at least 60 rolling Sharpe observations for the asymptotic theory to be reliable, which translates into underlying observations: roughly trading days (~15 months) for the daily default, or months (~8 years) for the monthly default. Below this, the SSR card displays an "Insufficient data" verdict rather than an unreliable estimate.
Advantages & Limitations
Advantages
- Distinguishes skill from luck over time: separates genuinely stable risk-adjusted performance from aggregate Sharpe ratios driven by short-lived favourable episodes.
- Orthogonal to PSR and DSR: evaluates a temporal dimension that point-in-time credibility and selection-bias corrections cannot capture.
- Valid inference under serial correlation: the HAC long-run variance handles both mechanical overlap from rolling windows and genuine persistence in returns.
- Scale-invariant penalisation: unit elasticity with respect to means SSR reacts symmetrically to instability across all magnitude ranges.
- Closed-form asymptotic test: standard-Normal test statistic, no parametric model required for the underlying returns.
- Cheap to compute: a single pass over autocovariances with Andrews bandwidth selection adds negligible overhead beyond the rolling Sharpe series itself.
Limitations
- Window choice matters: shorter produces noisier rolling Sharpe estimates; longer smooths over potentially informative regime changes.
- Requires sufficient history: asymptotic results are stated as ; SSR is unreliable for back-tests with fewer than ~60 rolling observations.
- Cannot identify regime shifts: SSR is an unconditional stability metric and complements — rather than replaces — conditional, regime-switching performance models.
- Bandwidth sensitivity: different HAC bandwidth rules can shift inference under heavy tails or near-unit-root persistence; we use Andrews (1991) AR(1) plug-in by default.
- Not a substitute for selection-bias correction: a high-SSR strategy can still be the result of running thousands of trials. SSR should be paired with DSR when a strategy was selected from many candidates.
- Asymptotic vs. bootstrap inference: the original paper recommends moving-block bootstrap for finite-sample inference; our current implementation reports the asymptotic-Normal p-value, with bootstrap support on the roadmap.
References
- Bajo Traver, M., & Rodríguez Domínguez, A. (2026). "The Sharpe Stability Ratio: Temporal Consistency of Risk-Adjusted Performance." SSRN Working Paper No. 6344658. ssrn:6344658.
- Newey, W. K., & West, K. D. (1987). "A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix." Econometrica, 55(3), 703-708. doi:10.2307/1913610.
- Andrews, D. W. K. (1991). "Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation." Econometrica, 59(3), 817-858. doi:10.2307/2938229.
- Künsch, H. R. (1989). "The Jackknife and the Bootstrap for General Stationary Observations." The Annals of Statistics, 17(3), 1217-1241. doi:10.1214/aos/1176347265.
- Politis, D. N., & White, H. (2004). "Automatic Block-Length Selection for the Dependent Bootstrap." Econometric Reviews, 23(1), 53-70. doi:10.1081/ETC-120028836.
- Lo, A. W. (2002). "The Statistics of Sharpe Ratios." Financial Analysts Journal, 58(4), 36-52. doi:10.2469/faj.v58.n4.2453.
- Bailey, D. H., & López de Prado, M. (2012). "The Sharpe Ratio Efficient Frontier." Journal of Risk, 15(2), 3-44. doi:10.21314/JOR.2012.255.
- Bailey, D. H., & López de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." The Journal of Portfolio Management, 40(5), 94-107. doi:10.3905/jpm.2014.40.5.094.
- Sharpe, W. F. (1994). "The Sharpe Ratio." The Journal of Portfolio Management, 21(1), 49-58. doi:10.3905/jpm.1994.409501.