Compiler: Microsoft (R) C/C++ Optimizing Compiler Version 19.39.33521 for x64 (Visual Studio 2022 17.9, x64 host/target).
Cross-tested on Compiler Explorer (godbolt.org) across MSVC toolsets:
| Toolset | VS version | /O1 | /O2 |
|---|---|---|---|
| 19.29 | VS 16.11 | ✓ | ✓ |
| 19.30 | VS 17.0 | ✓ | ✓ |
| 19.31 | VS 17.1 | ✓ | ✓ |
| 19.32 | VS 17.2 | ✓ | ✓ |
| 19.33 | VS 17.3 | ✓ | ✗ BUG introduced |
| 19.34–19.39 | VS 17.4 – 17.9 | ✓ | ✗ |
| 19.40 | VS 17.10 | ✓ | ✓ FIXED |
| 19.41–19.44 | VS 17.11 – 17.14 | ✓ | ✓ |
| 19.50 | VS 18.0 (preview) | ✓ | ✓ |
So the bug shipped for ~2 years across 7 toolset minor versions (VS 17.3 through VS 17.9), then was fixed in VS 17.10. The fix is silent — I have not located a release note that describes it.
/O1 produces correct output on every tested version.
A 16-line C program produces different output under cl /O2 and cl /O1. The
/O1 build is correct; /O2 is wrong.
> cl /O2 /nologo msvc_repro.c /Fe:O2.exe
> cl /O1 /nologo msvc_repro.c /Fe:O1.exe
> O2.exe
-2.000000 <-- BUG
> O1.exe
0.000000 <-- correct
Math: contributions from p == i and p == i+1 are equal in magnitude and
opposite in sign (a flips), so they cancel. Result is 0.
/O2 drops the p == i half of the gather, leaving -2.0 from p == i+1 only.
See msvc_repro.c. The body is (only relevant lines):
float s[3], du, dv;
int main(void) {
s[2] = 1;
for (int i = 1; i <= 1; i++)
for (int j = 1; j <= 1; j++)
for (int k = 1; k <= 1; k++)
for (int p = i; p <= i+1; p++)
for (int q = j; q <= j+1; q++)
for (int r = k; r <= k+1; r++) {
float a = (p == i) ? 1.f : -1.f;
du += ((q == j) ? 1.f : -1.f) * ((r == k) ? 1.f : -1.f);
dv += a * s[r];
}
printf("%f\n", du + dv);
return 0;
}/O2 splits the inner gather into two code paths:
$LN21— vectorized SSE path. Accumulates intoxmm6, which packsdu(low) anddv(next slot).$LL44— scalar fallback for thep == icase. Updates scalar registersxmm7(dv) andxmm8(du) instead ofxmm6.
Branching between them (msvc_repro_O2.asm:144-145):
cmp esi, 1 ; esi = p, constant-folded i = 1
jne SHORT $LN21@main ; p != i: take vectorized path
; (fall through to $LL44 when p == i)
Final result is read from xmm6 only (msvc_repro_O2.asm:340-348):
movaps xmm0, xmm6
movss DWORD PTR du, xmm6
shufps xmm0, xmm6, 85
movss DWORD PTR dv, xmm0
addss xmm0, xmm6
...
call printf
xmm7 and xmm8 are reloaded from the prologue spill on exit and never
folded back into the printed result. Every iteration with p == i runs
the scalar path, accumulates into xmm7/xmm8, and that work is discarded.
The contribution from p == i+1 (vectorized path → xmm6) is the only thing
that survives, hence -2.0.
/O1 emits a single, clean nested-loop structure with proper value-select
branches for the ternary. No vectorized/scalar split. See msvc_repro_O1.asm.
The repro shrunk from a ~95-line tile-init kernel down to 16 lines. Each of the following changes makes the bug disappear:
| Change | Result |
|---|---|
| 2 outer loops instead of 3 | bug gone |
| 2 inner loops instead of 3 | bug gone |
Inner range != 2 (e.g. <= i+2) |
bug gone |
Replace ternary with arithmetic 1 - 2*(p-i) |
bug gone |
Inline the a ternary directly into dv += ... |
bug gone |
Remove the second accumulator (du) |
bug gone |
Index s[] with outer var (s[i]) or constant |
bug gone |
/O1, /Od |
correct |
/O2 /Qvec--equivalent unavailable in MSVC |
— |
So the trigger requires the full combination: 3+3 nested loops, vinner = outer; vinner <= outer+1 form, named-local ternary sign, ≥2 distinct float
accumulators with different sign factors, and an array index that is one of
the inner loop variables.
Single-stepping cl /O2 /Zi builds with cdb confirmed:
- The
je/jnebranch atmsvc_repro_O2.asm:144-145always matchesp == 1(= constant-foldedi) on the first inner iteration, every outer iteration. - Setting a breakpoint at the vectorized path's first array load showed
it never executes when
p == 1. - Patching the
jetonops withebrecovered a non-zero result for the-x(i.e.p == i) configuration of the original 95-line program, confirming the spurious branch was the source of the lost work.
REPORT.md— this filemsvc_repro.c— 16-line repromsvc_repro_O2.asm—cl /O2 /Fa /nologo msvc_repro.cmsvc_repro_O1.asm—cl /O1 /Fa /nologo msvc_repro.c
Hit while debugging a numerical kernel: a B^TB curl-gather iteration that
should be mirror-symmetric under coordinate reflection produced asymmetric
results. The -x boundary forcing case stagnated at iter 0 with
max_correction = 0, while the mirror-image +x case converged normally.
The diagnosis above explains exactly that pattern: the kernel had the
3+3 loop structure with the named ternary sign, and one half of the gather
was silently dropped.