How to Read This Plot
Left column (Type I error — H₀ is true). Ideal behaviour is a flat line at or below 5% (dashed). The y-axis is zoomed to 25% so that inflation above the nominal level is visible. Any curve that climbs above the dashed line is producing too many false alarms. The p-value (red) drifts upward with successive looks — reflecting optional-stopping inflation — while the BF criterion (green) stays more conservative. The shaded ribbons are 95% Wilson binomial confidence intervals around each estimated rate; because only 50 simulations were run, those ribbons are wide and should be read as indicating the direction of the effect rather than its precise magnitude.
Why do red and green look almost the same in both left panels? The p-value and the BIC-based BF both work by comparing how much better the model fits relative to its own noise estimate — they are based on a likelihood ratio, which is self-calibrating.
The LRT statistic is:
\[\Lambda = -2(\ell_0 - \ell_1) \xrightarrow{\;H_0\;} \chi^2(1)\]
In a normal-errors model, the log-likelihood difference simplifies to:
\[\ell_1 - \ell_0 = \frac{N}{2}\log\!\frac{\hat\sigma_0^2}{\hat\sigma_1^2}\]
Under H₀ the condition adds nothing, so \(\hat\sigma_1^2 \approx \hat\sigma_0^2\) and the ratio \(\approx 1\) regardless of the absolute size of \(\sigma\). Doubling the noise doubles both variance estimates, and they cancel. This is why the \(\chi^2(1)\) null distribution — and hence the nominal 5 % Type I rate — holds for any \(\sigma\).
The BIC-based BF inherits the same property. With one extra fixed-effect parameter in M₁ (\(k_1 = k_0 + 1\)):
\[\text{BF}_{10}^{\text{BIC}} = \exp\!\left(\frac{\text{BIC}_0 - \text{BIC}_1}{2}\right) = \exp\!\left(\ell_1 - \ell_0 - \tfrac{1}{2}\log n\right)\]
The only new term is the fixed penalty \(-\tfrac{1}{2}\log n\), which shifts the distribution of \(\log\text{BF}_{10}\) by a constant — it does not reintroduce any dependence on \(\sigma\).
So Type I error rates for red and green are approximately noise-independent by construction.
Why does the red line climb with N but green and purple stay flat (optional-stopping robustness)? This is the central practical difference between the three methods.
Red (p-value) has no memory and no brake. Each new checkpoint is another independent opportunity to cross \(p < .05\). Using the Bonferroni bound as an upper limit, the probability of ever rejecting across \(K\) looks is at most:
\[\Pr(\text{ever } p < \alpha \mid H_0) \leq K\alpha\]
With \(K = 10\) looks at \(\alpha = .05\) that is already 50 % — in practice somewhat lower because the looks use cumulative data and are correlated, but the direction is clear. There is nothing in the p-value that pushes back against repeated testing.
Green (BIC-based BF) has a built-in growing penalty. Recall:
\[\text{BF}_{10}^{\text{BIC}} = \exp\!\left(\ell_1 - \ell_0 - \frac{1}{2}\log n\right)\]
The threshold to reach \(\text{BF}_{10} > 10\) is \(\ell_1 - \ell_0 > \log(10) + \tfrac{1}{2}\log n\). As \(n\) increases, this bar rises by \(\tfrac{1}{2}\log n\). So the BIC penalty acts as an automatic correction for multiple looks: you need progressively stronger evidence to clear the threshold at larger N, which counteracts the extra opportunities provided by sequential testing.
Purple (brms Savage-Dickey) is protected by coherent Bayesian updating. Under H₀ and low noise, each new batch of data shifts the posterior further toward zero and makes it narrower — the density at zero in the denominator of \(\text{BF}_{10}\) increases rather than decreases. More data therefore drives the BF away from a false positive, not toward one. This is the sense in which Bayesian methods are often described as “immune to optional stopping” — provided the model and prior are correctly specified. The Savage-Dickey BF compares two absolute densities:
\[\text{BF}_{10} = \frac{p(\beta = 0 \mid \text{prior})}{p(\beta = 0 \mid \text{data})}\]
The numerator is fixed by the prior — Normal(0, 1) gives a density of 0.40 at zero no matter what. The denominator is how tall the posterior is at zero. Under low noise, data are precise and the posterior is narrow and tall, so the density at zero is high → BF₁₀ stays small even by chance. Under high noise, data are imprecise, the posterior spreads out and flattens, so the density at zero drops — even when the true effect is zero. This spuriously inflates BF₁₀ above 10 more often, raising the false-positive rate. In short: the Savage-Dickey ratio is not self-calibrating with respect to noise — it inherits the sensitivity of the prior width relative to posterior precision, which is a known limitation when the prior is wide relative to the data-generating scale.
Right column (Power — H₁ is true). Here we want the curves to rise toward 100%. Both methods accumulate power as N grows, but the climb is steeper under low noise (top row) than under high noise (bottom row), where even N = 60 leaves power well short of 80%. This is the practical message: more noise requires more data to achieve the same power, regardless of the decision criterion.
Key take-aways for this study:
- In a noisy psycholinguistic task the effect size relative to residual variance determines the minimum viable N — not the sample-size formula for a simple t-test.
- A Bayes Factor criterion controls false positives more strictly than α = .05, but at the cost of needing more data to build up power.
- Sequential designs exploit this by checking at multiple checkpoints (N = 30, 45, 60, 75, 90, 105, 120 here) rather than committing to a fixed N — stopping early when evidence is already decisive, and continuing only when it is not.