Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLVM no longer honours -stack-alignment=4 #21809

Closed
llvmbot opened this issue Oct 31, 2014 · 24 comments
Closed

LLVM no longer honours -stack-alignment=4 #21809

llvmbot opened this issue Oct 31, 2014 · 24 comments
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 31, 2014

Bugzilla Link 21435
Resolution FIXED
Resolved on Nov 17, 2015 16:49
Version trunk
OS All
Attachments bug1347703.ll
Reporter LLVM Bugzilla Contributor
CC @asl,@rnk

Extended Description

LLVM stoped honouring

This can easily be reproduced with the attached bug1347703.ll doing

$ llc -o - -mtriple=i686-pc-mingw32 -mattr=+avx -stack-alignment=4 bug1347703.ll | grep '<and.*[er]sp'
$

which produces nothing showing that the stack pointer is never aligned.

However if one disables frame-pointer elimination then it works again:

$ /home/jfonseca/work/vmware/llvm/llvm/build/linux-x86_64/Debug+Asserts/bin/llc -disable-fp-elim -o - -mtriple=i686-pc-mingw32 -mattr=+avx -stack-alignment=4 bug1347703.ll | grep '<and.*[er]sp'
andl $-32, %esp
andl $-32, %esp

$

Which doesn't seem the right behavior -- using the frame-pointer to align the stack seems an implementation detail -- there are probably other ways, and if frame-pointer is indeed necessary it should be automatically employed as necessary.

This regression happened between LLVM release 3.2 and release 3.3. I strongly suspect r168627 -- llvm-mirror/llvm@1243922

@rnk
Copy link
Collaborator

rnk commented Oct 31, 2014

Based on the input IR and what I've seen of the backend, this looks like some kind of phase ordering problem.

The input test case has no allocas, so we don't think we need to realign the stack for any highly aligned allocations. Then register allocation (or vector legalization) runs, and we are forced to spill some large vector values. This introduces a highly aligned stack allocation, but it's already too late, we can't go back and add the realignment prologue.

See this awesome comment in X86RegisterInfo about it maybe being too late to realign the stack:

bool X86RegisterInfo::canRealignStack(const MachineFunction &MF) const {
if (MF.getFunction()->hasFnAttribute("no-realign-stack"))
return false;

const MachineFrameInfo *MFI = MF.getFrameInfo();
const MachineRegisterInfo *MRI = &MF.getRegInfo();

// Stack realignment requires a frame pointer. If we already started
// register allocation with frame pointer elimination, it is too late now.
if (!MRI->canReserveReg(FramePtr))
return false;

// If a base pointer is necessary. Check that it isn't too late to reserve
// it.
if (MFI->hasVarSizedObjects())
return MRI->canReserveReg(BasePtr);
return true;
}

Even if that codepath isn't executing, it's possible that there is a similar issue.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Oct 31, 2014

CC'ing Jim and Bob on this one... (as they can see the radar) and should be able to chime in on this one.

IIRC, this was some refactoring I did with Jakob, but I don't recall the full details. It is/was a kind of phase ordering issue, but we never came up with a good solution.

@rnk
Copy link
Collaborator

rnk commented Oct 31, 2014

I think we could conservatively assume that any SSA value will need to be spilled to the stack and align the current frame to handle at least that.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Oct 31, 2014

Reed, your proposal is to revert r168627, correct? It's conservatively correct, but you waste a register is many cases. I would prefer a more targeted approach.

Any reason we shouldn't go with Jose's suggestion: reserve the frame pointer to enable stack-realignment, if the user specifies an alignment larger than the default stack alignment?

@llvmbot
Copy link
Collaborator Author

llvmbot commented Oct 31, 2014

Reed, your proposal is to revert r168627, correct? It's conservatively
correct, but you waste a register is many cases. I would prefer a more
targeted approach.

Any reason we shouldn't go with Jose's suggestion: reserve the frame pointer
to enable stack-realignment, if the user specifies an alignment larger than
the default stack alignment?

To be clear -- my suggestion was enable frame pointer when needed.

I'm not sure that enabling when the user specifies a stack alignment different from default stack alignment covers all cases. Imagine that the function needs to spill AVX 32-byte registers. Wouldn't that require a 32-byte aligned stack pointer?

Or what if the function needed to call another function passing a 32-byte vector by reference.

In other words, it seems to me that the alternatives are to revert r168627, or disable frame pointer omission blindly...

@rnk
Copy link
Collaborator

rnk commented Oct 31, 2014

Reed, your proposal is to revert r168627, correct? It's conservatively
correct, but you waste a register is many cases. I would prefer a more
targeted approach.

Any reason we shouldn't go with Jose's suggestion: reserve the frame pointer
to enable stack-realignment, if the user specifies an alignment larger than
the default stack alignment?

I feel like if the default stack alignment is 4 that should do the same thing as the user specifying a stack alignment of 4, right?

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 3, 2014

Reed, your proposal is to revert r168627, correct? It's conservatively
correct, but you waste a register is many cases. I would prefer a more
targeted approach.

Any reason we shouldn't go with Jose's suggestion: reserve the frame pointer
to enable stack-realignment, if the user specifies an alignment larger than
the default stack alignment?

To be clear -- my suggestion was enable frame pointer when needed.

If the FP isn't available/reserved, then register allocator uses unaligned loads/stores to fill/spill. Thus, the FP isn't strictly needed for correctness in the cases you've described.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 3, 2014

Reed, your proposal is to revert r168627, correct? It's conservatively
correct, but you waste a register is many cases. I would prefer a more
targeted approach.

Any reason we shouldn't go with Jose's suggestion: reserve the frame pointer
to enable stack-realignment, if the user specifies an alignment larger than
the default stack alignment?

I feel like if the default stack alignment is 4 that should do the same
thing as the user specifying a stack alignment of 4, right?

IIRC, the default stack alignment on x86 is 16 bytes (i.e., -stack-alignment=2). If the user specifies a larger alignment than the default then we might consider reserving the FP and realigning the stack to the larger non-default alignment.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 3, 2014

IIRC, the default stack alignment on x86 is 16 bytes

It depends on the platforms. The default stack alignment on x86 on Windows is 4-bytes.

In fact it seems this bus is mis-titled: it seems the problem is not that LLVM doesn't honor -stack-alignment=, but rather that x86 backend is broken for stack alignments (default or overriden) less than 16 bytes.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 3, 2014

If the FP isn't available/reserved, then register allocator uses unaligned
loads/stores to fill/spill. Thus, the FP isn't strictly needed for
correctness in the cases you've described.

If I try the attached input, targetting x86_64 linux with the default stack alignment (which is 16bytes), and grep for rsp I get

$ llc-3.4 -o - -mtriple=x86_64-unknown-linux -mattr=+avx bug1347703.ll | grep rsp
subq $40, %rsp
movq %r8, 32(%rsp) # 8-byte Spill
movq %rdi, 24(%rsp) # 8-byte Spill
movq 96(%rsp), %rdi
vmovdqu %ymm1, -32(%rsp) # 32-byte Folded Spill
vmovdqu %ymm1, -64(%rsp) # 32-byte Folded Spill
vmovdqu %ymm1, -96(%rsp) # 32-byte Folded Spill
movq 24(%rsp), %rdi # 8-byte Reload
vandps -32(%rsp), %ymm3, %ymm3 # 32-byte Folded Reload
vandps -64(%rsp), %ymm10, %ymm10 # 32-byte Folded Reload
vandps -96(%rsp), %ymm10, %ymm10 # 32-byte Folded Reload
[...]

The spills are indeed with VMOVDQU, but doesn't VANDPS require the m256 argument to be 32-byte aligned?

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 3, 2014

It has been a while and I don't remember the numbers, but I'm pretty sure that reverting r168627 would cause a significant performance regression. We should figure out how to fix the problems with -stack-alignment=4 some other way.

@rnk
Copy link
Collaborator

rnk commented Nov 3, 2014

IIRC, the default stack alignment on x86 is 16 bytes

It depends on the platforms. The default stack alignment on x86 on Windows
is 4-bytes.

This is why I care. :)


I believe there are two workarounds for Jose:

  1. Use the alignstack(16) (or 32?) attribute on the functions in question. If you control the frontend, this is the easiest and least fragile thing to do.

  2. Use the -force-align-stack hidden x86 backend command line option. This is probably fragile and will break if and when we re-examine flags embedded in libraries like this.

If you can do 1, we can probably let this issue stand.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 3, 2014

I believe there are two workarounds for Jose:

  1. Use the alignstack(16) (or 32?) attribute on the functions in question.
    If you control the frontend, this is the easiest and least fragile thing to
    do.

The workaround we took was to set TargetOptions::NoFramePointerElim and TargetOptions::NoFramePointerElimNonLeaf on all x86 and x86_64 builds of Mesa. We were already doing this for non-release builds (to facilitate stack back traces when debugging/profiling), but after this issue it is clear that frame-pointer-omission is not just a minor optimization in LLVM, but has deep-reaching implications on X86 code generation, so I'm no longer comfortable in having (non-)release builds differing on this matter, as most of our automated tests actually run with non-release builds.

  1. Use the -force-align-stack hidden x86 backend command line option. This
    is probably fragile and will break if and when we re-examine flags embedded
    in libraries like this.

If you can do 1, we can probably let this issue stand.

We generate LLVM IR directly (Mesa llvmpipe driver), and I'm used to tweak things in order to get LLVM, and can do so.

But, for whatever my opinion is worth, I have to say this seems a serious flaw in LLVM x86 backend; and taking an approach that works 99% of cases for the sake of performance, instead of one that works 100% seems the wrong trade-off for a compiler; and if there are indeed possible workarounds then I fail to understand why pushing the workarounds to the user instead of automatically detecting the problematic situations and deploy the workarounds without intervention; at very least you should add an assert/error message when this happens -- anything less is just sweeping the problem under the rug...

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 5, 2014

Reed, your proposal is to revert r168627, correct? It's conservatively
correct, but you waste a register is many cases. I would prefer a more
targeted approach.

Any reason we shouldn't go with Jose's suggestion: reserve the frame pointer
to enable stack-realignment, if the user specifies an alignment larger than
the default stack alignment?

To be clear -- my suggestion was enable frame pointer when needed.

If the FP isn't available/reserved, then register allocator uses unaligned
loads/stores to fill/spill. Thus, the FP isn't strictly needed for
correctness in the cases you've described.

But, in this case, it would be forbidden for those fill/spill loads/stores (or maybe can only happen for loads) to get eliminated and folded into the arithmetic ops, though only for sse, not when using avx (as vex encoding can handle unaligned loads in ordinary arithmetic ops whereas sse can't).
However it seems a bit silly to use unaligned stores/loads for spills imho in any case. I'm aware newer cpus don't necessarily have much of a performance penalty for that (intel core ones in particular) though on some the performance impact can be rather horrendous (anyone still using a p4?).

@rnk
Copy link
Collaborator

rnk commented Nov 17, 2015

This is over a year old and the attached LLVM IR no longer parses with TOT llc.

llc's command line interface isn't exactly stable or user facing either, so I don't think there's anything to do here.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 17, 2015

This is over a year old and the attached LLVM IR no longer parses with TOT
llc.

llc's command line interface isn't exactly stable or user facing either, so
I don't think there's anything to do here.

This bug is not specific llc...

I provided the llc command line / IR merely to facilitate reproing and fixing the bug.

We saw the bug on Mesa + llvmpipe with JIT.

The problem here is that
LLVM is broken when incoming stack aligment is 4. You can probably repro the isse with clang -mstack-alignment=4 and the right input.

That said, it's clear that nothing happened. So probably nothing will ever happen. On Mac and Linux the ABI (both 32 and 64bits) states that stack alignment is always aligned 16bytes, so this is never a issue. (Maybe it matters for Microsoft, since on Windows 32 bits, the stack alignment is 4bytes.)

But, if you don't want to fix, then at very least mark it as WONTFIX. Closing as INVALID on the grounds you can't no longer repro is IMO insult to the time I spent on obtaining a small repro, as per the bug guidelines.

LLVM has very little stable interfaces, so if LLVM devs sleep it long enough you can close any bug on the basis you can repro...

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 17, 2015

bug1347703-r253245.ll
Updated IR, correct as of r253245.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 17, 2015

Here. Updated LLVM IR.

Basically took the original IR and massaged with vim subst commands.

:%s/getelementptr ({.})*/getelementptr \1, \1
:%s/load (.)* /load \1, \1
:%s/getelementptr (\w+)*/getelementptr \1, \1*

@rnk
Copy link
Collaborator

rnk commented Nov 17, 2015

But, if you don't want to fix, then at very least mark it as WONTFIX.
Closing as INVALID on the grounds you can't no longer repro is IMO insult to
the time I spent on obtaining a small repro, as per the bug guidelines.

Sorry, our bugzilla resolution statuses are really old and sometimes rude. I end up using "invalid" to mean "working as intended". I'm only trying to convey that we aren't taking action because:

  • There isn't a correctness issue, since we'll used unaligned spills and fills
  • Nobody has updated the issue in a year, indicating that very few people rely on the stack alignment TargetOption

It sounds like it's still important to you and you don't like your workaround, so we can reopen it.

To resummarize what's wrong, LLVM is making a bad performance decision: it is choosing to use unaligned vector spills in order to omit the frame pointer. In Chad's case, this helped, but in your case, it hurts. Is that accurate?

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 17, 2015

To resummarize what's wrong, LLVM is making a bad performance decision: it
is choosing to use unaligned vector spills in order to omit the frame
pointer. In Chad's case, this helped, but in your case, it hurts. Is that
accurate?
The bigger issue was that llvm used to produce incorrect code, because the vector spills/reloads were unaligned but using instructions which required alignment to do it. Now with AVX (which not just has unaligned movs like SSE, but where the arithmetic instructions don't need alignment) this is less of an issue (can still fold the reloads into the arithmetic ops, I just hope it doesn't do it any longer if only SSE is available).
But, our code will produce quite massive amounts of reloads/spills (of vector regs) sometimes and certainly using unaligned instructions for that sounds like an odd decision. It's true newer cpus "usually" don't really have huge penalties for that, but that still goes against every optimization manual I've ever seen. You really typically only should use unaligned data if you can't control data alignment easily. But certainly for spills/reloads, doing it unaligned makes no sense whatsoever.

@rnk
Copy link
Collaborator

rnk commented Nov 17, 2015

I agree, in general, doing unaligned vector loads seems pretty lame. I think in Chad's test case, the vector spills were not in the hot path, and it was on 32-bit x86, where going from 6 to 7 GPRs makes a big difference.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 17, 2015

I agree, in general, doing unaligned vector loads seems pretty lame. I think
in Chad's test case, the vector spills were not in the hot path, and it was
on 32-bit x86, where going from 6 to 7 GPRs makes a big difference.
In our case, we don't have much use for scalar regs :-). But in any case, there just has to be a better solution than just using unaligned vector loads/stores for vector spills (on 32bit, you don't have too many vector regs neither).

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 18, 2015

It sounds like it's still important to you and you don't like your
workaround, so we can reopen it.

Thanks.

To resummarize what's wrong, LLVM is making a bad performance decision: it
is choosing to use unaligned vector spills in order to omit the frame
pointer. In Chad's case, this helped, but in your case, it hurts. Is that
accurate?

When I filled this bug this was a crash, not performance issue.

The 1347703 is the number of a VMware internal bug report, and I've just doubled checked: this was causing crashes on Windows, in 32bits processes, both on AVX and non-AVX systems.

But I've just retried, and the crash/correctness issue might have indeed be fixed in LLVM since this was filed.

For example, if I do

llc -o - -mtriple=i686-pc-mingw32 -mattr=-avx,+sse2 -stack-alignment=4 bug1347703.ll

then all accesses to the stack are with unaligned MOVs. This was not the case before.

I have all llc versions from 3.3 to 3.6, and indeed it seems the issue was fixed going from LLVM 3.3 to 3.4.

Performance was never my worry here. Even with the crash fixed, it makes sense for us to continue using frame pointer. My main concern was the crash.

So it looks you can resolve this bug after all!

@rnk
Copy link
Collaborator

rnk commented Nov 18, 2015

Performance was never my worry here. Even with the crash fixed, it makes
sense for us to continue using frame pointer. My main concern was the crash.

So it looks you can resolve this bug after all!

OK, thanks for looking into it.

I think we'll mark this fixed then. If someone runs into the performance issue, there are tons of workarounds: pass -fno-omit-frame-pointer, -mstackrealign, -mstack-alignment=32, or whatever. Those all use a different codepath from TargetOptions::StackAlignmentOverride, and should avoid the unaligned loads.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 9, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

2 participants