LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 34647 - llc version 4.0 and later takes up to 45 times as long to compile shader code for Mesa
Summary: llc version 4.0 and later takes up to 45 times as long to compile shader code...
Status: CONFIRMED
Alias: None
Product: new-bugs
Classification: Unclassified
Component: new bugs (show other bugs)
Version: 4.0
Hardware: Other Linux
: P normal
Assignee: Unassigned LLVM Bugs
URL:
Keywords: slow-compile
Depends on:
Blocks:
 
Reported: 2017-09-16 18:58 PDT by Ben Crocker
Modified: 2020-08-19 10:38 PDT (History)
11 users (show)

See Also:
Fixed By Commit(s):


Attachments
Vertex shader bytecode (6.41 KB, application/octet-stream)
2017-09-16 18:58 PDT, Ben Crocker
Details
Fragment shader bytecode (31.50 KB, application/octet-stream)
2017-09-16 19:01 PDT, Ben Crocker
Details
Bytecode used for bisect operation (31.70 KB, text/html)
2017-10-18 06:48 PDT, Ben Crocker
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ben Crocker 2017-09-16 18:58:59 PDT
Created attachment 19167 [details]
Vertex shader bytecode

The Piglit (OpenGL test suite) ext_transform_feedback-max-varyings test
utilizes somewhat unusual shader programs (both vertex and fragment shaders).

The llc compiler prior to 4.0 compiled these programs in
not-unacceptable times of 0.078 seconds for a representative
vertex shader and 2.6-4.5 seconds for a representative fragment shader.

The V4.0 and later llc takes a MUCH longer time to compile the same
code: 1.66 seconds for the vertex shader (a factor of 20 times slower!)
and 1 minute 55 seconds for the fragment shader (a factor of 25-45 times
slower!).

I will attach sample vertex shader code (ir_draw_llvm_vs_variant0.bc)
and fragment shader code (ir_fs914_variant0.bc).

The target architecture is PPC64LE.
Comment 1 Ben Crocker 2017-09-16 19:01:35 PDT
Created attachment 19168 [details]
Fragment shader bytecode

Here is the promised fragment shader bytecode.
Comment 2 Simon Pilgrim 2017-09-18 08:52:55 PDT
What were your triple/cpu settings?
Comment 3 Ben Crocker 2017-09-18 09:06:43 PDT
(In reply to Simon Pilgrim from comment #2)
> What were your triple/cpu settings?

% llc --version
LLVM (http://llvm.org/):
  LLVM version 6.0.0svn
  DEBUG build with assertions.
  Default target: powerpc64le-unknown-linux-gnu
  Host CPU: pwr8
Comment 4 Ben Crocker 2017-09-18 11:15:55 PDT
(In reply to Simon Pilgrim from comment #2)
> What were your triple/cpu settings?

Tom Stellard suggested I also supply the -mcp and -mattr options.
Here they are:

% llc -mcpu=pwr8 -mattr=+altivec,+vsx
Comment 5 Simon Pilgrim 2017-09-19 05:17:24 PDT
Build time seems to be in RAGreedy (fragment shader):

llvm::MachineFunctionPass::runOnFunction	99.24 %
- `anonymous namespace'::RAGreedy::runOnMachineFunction	93.59 %	0.00 %
 - llvm::RegAllocBase::allocatePhysRegs	93.52 %	0.00 %
  - `anonymous namespace'::RAGreedy::selectOrSplit	92.44 %	0.00 %
   - `anonymous namespace'::RAGreedy::selectOrSplitImpl	92.20 %	0.00 %
    - `anonymous namespace'::RAGreedy::tryEvict	86.68 %	0.02 %
     - `anonymous namespace'::RAGreedy::canEvictInterference	86.27 %	0.06 %
      - `anonymous namespace'::RAGreedy::canReassign	80.64 %	0.35 %
       - llvm::LiveIntervalUnion::Query::checkInterference	61.62 %	0.31 %
        - llvm::LiveIntervalUnion::Query::collectInterferingVRegs	61.30 %	1.27 %
         - llvm::IntervalMap<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::const_iterator::find	19.26 %	0.36 %
          + llvm::IntervalMapImpl::LeafNode<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::findFrom	7.71 %	0.20 %
          + llvm::IntervalMap<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::const_iterator::treeFind	5.70 %	0.05 %
          + llvm::IntervalMap<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::const_iterator::setRoot	3.64 %	0.15 %
          + llvm::IntervalMap<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::rootLeaf	0.99 %	0.28 %
          + llvm::IntervalMap<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::const_iterator::branched	0.83 %	0.46 %
Comment 6 Ben Crocker 2017-10-17 08:26:40 PDT
I did a bisect operation as requested by Nemanja, and here is the result
(please pardon my use of git instead of SVN):

# first bad commit: [0ef3663fb81c9cd73f646728463a6105b5d9b88a] vec perm can go down either pipeline on P8. No observable changes, spotted while looking at the scheduling description.

This certainly looks suspicious, in light of the fact that the
change is in lib/Target/PowerPC/PPCScheduleP8.td.

Here is the text of the commit in the context of the surrounding commits:

commit b89cc7e5e30432b6093664a44ee2e2af9a42f3b6
Author: Nirav Dave <niravd@google.com>
Date:   Sun Feb 26 01:27:32 2017 +0000

    Revert "In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled."

    This reverts commit r296252 until 256-bit operations are more efficiently generated in X86.

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@296279 91177308-0d34-0410-b5e6-96231b3b80d8

commit 0ef3663fb81c9cd73f646728463a6105b5d9b88a
Author: Eric Christopher <echristo@gmail.com>
Date:   Sun Feb 26 00:11:58 2017 +0000

    vec perm can go down either pipeline on P8.
    No observable changes, spotted while looking at the scheduling description.

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@296277 91177308-0d34-0410-b5e6-96231b3b80d8

commit 3a603f41297cad31be9ce54e1c8c076c76c60ddf
Author: Sanjoy Das <sanjoy@playingwithpointers.com>
Date:   Sat Feb 25 22:25:48 2017 +0000

    Fix signed-unsigned comparison warning

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@296274 91177308-0d34-0410-b5e6-96231b3b80d8
Comment 7 Nemanja Ivanovic 2017-10-18 01:08:14 PDT
Hi Ben,
do you happen to have the compile times for the same shader code with each of the mentioned revisions? It would be good to see which one results in the largest jump. Then we can investigate why this results in such a large compile-time increase.
Comment 8 Ben Crocker 2017-10-18 06:48:37 PDT
Created attachment 19313 [details]
Bytecode used for bisect operation

Hi Nemanja,

Sorry, I did not keep the compile time information for each of the
individual bisect steps.  HOWEVER, I CAN tell you that, before the
problem commit, the compile time for the shader code was routinely in
the 6-7 second range, while after the problem commit, the compile
time was in the 37-45 second range.

BTW I've attached the specific bytecode file I used for the
bisect operation, ir_fs138_variant0.bc .
Comment 9 Ben Crocker 2017-10-18 06:52:21 PDT
P.S.  Note that I kept my LLVM build in /tmp, i.e. on RAM disk,
so the only disk I/O involved was reading the bytecode file and
writing the assembly language output.
Comment 10 jtony 2017-10-26 13:34:07 PDT
Hi Ben, I tried to compile all the three byte code files you put in the attachment on our PPC64LE dev machine. With/Without the first bad commit you mentioned (0ef3663fb81c9cd73f646728463a6105b5d9b88a) using the options you put in the comment (-mcpu=pwr8 -mattr=+altivec,+vsx). There is no significant compile time difference for all the 3 byte code files. I run 10 times with/without that patch. Can you retry this problem against the latest trunk of clang/llvm and see whether  you can still reproduce? I was just reverting the problematic patch you mentioned from Eric Christopher. If you can provide me with the git hash number for all the other three projects (clang/compiler-rt and test-suite) when you found the bad llvm commit (should have similar time stamp with the 0ef3663fb81c9cd73f646728463a6105b5d9b88a patch). I can revert all the projects to around the bad llvm commit time and test again to see whether I can reproduce. Thank you very much!


The following is one of my test results (there is no visible difference between different compile) 
 
time `llc fragmentShader.bc  -mcpu=pwr8 -mattr=+altivec,+vsx`

real    0m3.501s
user    0m3.491s
sys     0m0.008s
Comment 11 Nemanja Ivanovic 2017-10-27 08:40:14 PDT
I can reproduce this degradation. I'm not sure how you did your experiment Tony, but I get consistent run times around 0.5s before the first patch and 2.0s after it.
We will continue investigating.
Comment 12 jtony 2017-10-30 13:27:58 PDT
Hi Ben, can you validate that you are not comparing a Release build llc with a Debug build llc? We know that the Debug build llc is significantly slower than the Release build llc. According to my tests, the Release build llc compile time for  Fragment shader bytecode is always around 3 seconds with/without Eric's patch. While the Debug build llc compile time is almost 2 minutes with/without Eric's patch.  Can you do the test again for both the Release and Debug build and post your detailed result here if you still believe there is a compile degradation? Thanks a lot!
Comment 13 Ben Crocker 2017-10-30 14:23:42 PDT
I see I did not specify my exact build procedure; apologies!

Here it is:

In my LLVM directory, /tmp/llvm-bisect (i.e., on RAMdisk):

% cmake -G "Unix Makefiles" -DLLVM_BUILD_LLVM_DYLIB=ON -DCMAKE_INSTALL_PREFIX=/tmp/local /tmp/llvm-bisect
% make -j 144

I.e., I built with "gcc (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)", the
system compiler; we are somewhat constrained to use GCC when building
Mesa, LLVM, etc.

But I DID do Debug builds (i.e. let the build type default to Debug),
so maybe that has something to do with the differences in our
experiences.
Comment 14 jtony 2017-11-03 08:31:12 PDT
I tested reverting the patch from Eric reported by Ben on clang branch 5.0. Just revert that patch reduce the Release version llc compile time for file ir_fs138_variant0.bc by about 10% (the other two byte code see even smaller compile time difference). Which means just Eric's patch causes about 10% compile degradation on branch 5.0. But we did see about 3x~4x compile time difference for file ir_fs138_variant0.bc if I revert everything including Eric's patch and after (from about 0.5 seconds before to about 2.0s seconds after). I will continue look at this issue. Meanwhile please use Release version llc for the compilation in future, since that's way faster than the Debug version llc(about 25x~45x faster).
Comment 15 Eric Christopher 2017-11-09 22:06:39 PST
My patch is merely a scheduler description change, at worst it's highlighting a performance problem somewhere else sadly.
Comment 16 jtony 2017-11-15 16:36:26 PST
Comparing release builds, here are the compile-time differences with and without Eric's patch:
- Going back to the first revision with the patch only shows a degradation (2x-3x) with input file "ir_fs138_variant0.bc"
- Just pulling the patch from ToT doesn't show noticeable improvement for any of the input files


The compile-time increase comes from greedy register allocation. The patch changes the instruction scheduling - as it is meant to, which unfortunately means that in this particular case, we produce a schedule that is particularly bad for Greedy RA. Here are a few technical details as I understand them from my investigation:
- Greedy RA greedily assigns physical registers to virtual registers with the longest live range first
- Then it will try to fit in shorter live ranges and will evict virtual registers that were already assigned if it is profitable to do so. When it does this, it needs to re-queue the evicted live range
- It just so happens that with this schedule, we evict more registers so we converge more slowly
- It is possible that the new schedule for this test case produces live ranges that are near worst case for greedy RA (I haven't analyzed the algorithm enough to claim this is the case, but perhaps developers more familiar with this code can comment).
- Furthermore, I assume this near worst case could quite conceivably be produced without this patch with the right test case (again, I haven't confirmed this).

With this information in mind, I think we might have to consider this a limitation and close this PR. Considering MESA is a JIT, it may be worth while investigating the possibility of switching to the fast register allocator (like use option -regalloc=fast). I assume that will produce less optimal register allocation, but is presumably faster than the near-optimal greedy register allocator. Let me know what you think about this.
Comment 17 Quentin Colombet 2017-11-17 10:13:42 PST
Instead of using the fast regalloc (which is super bad for code quality), you could give a try to basic.
Comment 18 Jakub Kuderski 2020-08-19 10:21:36 PDT
We can still reproduce this bugs with LLVM 12.0 using the LLPC pipeline compiler for AMD GPU.

Our profiles show the same pathological behavior in `tryEvict`.