Created attachment 19167 [details] Vertex shader bytecode The Piglit (OpenGL test suite) ext_transform_feedback-max-varyings test utilizes somewhat unusual shader programs (both vertex and fragment shaders). The llc compiler prior to 4.0 compiled these programs in not-unacceptable times of 0.078 seconds for a representative vertex shader and 2.6-4.5 seconds for a representative fragment shader. The V4.0 and later llc takes a MUCH longer time to compile the same code: 1.66 seconds for the vertex shader (a factor of 20 times slower!) and 1 minute 55 seconds for the fragment shader (a factor of 25-45 times slower!). I will attach sample vertex shader code (ir_draw_llvm_vs_variant0.bc) and fragment shader code (ir_fs914_variant0.bc). The target architecture is PPC64LE.
Created attachment 19168 [details] Fragment shader bytecode Here is the promised fragment shader bytecode.
What were your triple/cpu settings?
(In reply to Simon Pilgrim from comment #2) > What were your triple/cpu settings? % llc --version LLVM (http://llvm.org/): LLVM version 6.0.0svn DEBUG build with assertions. Default target: powerpc64le-unknown-linux-gnu Host CPU: pwr8
(In reply to Simon Pilgrim from comment #2) > What were your triple/cpu settings? Tom Stellard suggested I also supply the -mcp and -mattr options. Here they are: % llc -mcpu=pwr8 -mattr=+altivec,+vsx
Build time seems to be in RAGreedy (fragment shader): llvm::MachineFunctionPass::runOnFunction 99.24 % - `anonymous namespace'::RAGreedy::runOnMachineFunction 93.59 % 0.00 % - llvm::RegAllocBase::allocatePhysRegs 93.52 % 0.00 % - `anonymous namespace'::RAGreedy::selectOrSplit 92.44 % 0.00 % - `anonymous namespace'::RAGreedy::selectOrSplitImpl 92.20 % 0.00 % - `anonymous namespace'::RAGreedy::tryEvict 86.68 % 0.02 % - `anonymous namespace'::RAGreedy::canEvictInterference 86.27 % 0.06 % - `anonymous namespace'::RAGreedy::canReassign 80.64 % 0.35 % - llvm::LiveIntervalUnion::Query::checkInterference 61.62 % 0.31 % - llvm::LiveIntervalUnion::Query::collectInterferingVRegs 61.30 % 1.27 % - llvm::IntervalMap<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::const_iterator::find 19.26 % 0.36 % + llvm::IntervalMapImpl::LeafNode<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::findFrom 7.71 % 0.20 % + llvm::IntervalMap<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::const_iterator::treeFind 5.70 % 0.05 % + llvm::IntervalMap<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::const_iterator::setRoot 3.64 % 0.15 % + llvm::IntervalMap<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::rootLeaf 0.99 % 0.28 % + llvm::IntervalMap<llvm::SlotIndex,llvm::LiveInterval * __ptr64,8,llvm::IntervalMapInfo<llvm::SlotIndex> >::const_iterator::branched 0.83 % 0.46 %
I did a bisect operation as requested by Nemanja, and here is the result (please pardon my use of git instead of SVN): # first bad commit: [0ef3663fb81c9cd73f646728463a6105b5d9b88a] vec perm can go down either pipeline on P8. No observable changes, spotted while looking at the scheduling description. This certainly looks suspicious, in light of the fact that the change is in lib/Target/PowerPC/PPCScheduleP8.td. Here is the text of the commit in the context of the surrounding commits: commit b89cc7e5e30432b6093664a44ee2e2af9a42f3b6 Author: Nirav Dave <niravd@google.com> Date: Sun Feb 26 01:27:32 2017 +0000 Revert "In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled." This reverts commit r296252 until 256-bit operations are more efficiently generated in X86. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@296279 91177308-0d34-0410-b5e6-96231b3b80d8 commit 0ef3663fb81c9cd73f646728463a6105b5d9b88a Author: Eric Christopher <echristo@gmail.com> Date: Sun Feb 26 00:11:58 2017 +0000 vec perm can go down either pipeline on P8. No observable changes, spotted while looking at the scheduling description. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@296277 91177308-0d34-0410-b5e6-96231b3b80d8 commit 3a603f41297cad31be9ce54e1c8c076c76c60ddf Author: Sanjoy Das <sanjoy@playingwithpointers.com> Date: Sat Feb 25 22:25:48 2017 +0000 Fix signed-unsigned comparison warning git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@296274 91177308-0d34-0410-b5e6-96231b3b80d8
Hi Ben, do you happen to have the compile times for the same shader code with each of the mentioned revisions? It would be good to see which one results in the largest jump. Then we can investigate why this results in such a large compile-time increase.
Created attachment 19313 [details] Bytecode used for bisect operation Hi Nemanja, Sorry, I did not keep the compile time information for each of the individual bisect steps. HOWEVER, I CAN tell you that, before the problem commit, the compile time for the shader code was routinely in the 6-7 second range, while after the problem commit, the compile time was in the 37-45 second range. BTW I've attached the specific bytecode file I used for the bisect operation, ir_fs138_variant0.bc .
P.S. Note that I kept my LLVM build in /tmp, i.e. on RAM disk, so the only disk I/O involved was reading the bytecode file and writing the assembly language output.
Hi Ben, I tried to compile all the three byte code files you put in the attachment on our PPC64LE dev machine. With/Without the first bad commit you mentioned (0ef3663fb81c9cd73f646728463a6105b5d9b88a) using the options you put in the comment (-mcpu=pwr8 -mattr=+altivec,+vsx). There is no significant compile time difference for all the 3 byte code files. I run 10 times with/without that patch. Can you retry this problem against the latest trunk of clang/llvm and see whether you can still reproduce? I was just reverting the problematic patch you mentioned from Eric Christopher. If you can provide me with the git hash number for all the other three projects (clang/compiler-rt and test-suite) when you found the bad llvm commit (should have similar time stamp with the 0ef3663fb81c9cd73f646728463a6105b5d9b88a patch). I can revert all the projects to around the bad llvm commit time and test again to see whether I can reproduce. Thank you very much! The following is one of my test results (there is no visible difference between different compile) time `llc fragmentShader.bc -mcpu=pwr8 -mattr=+altivec,+vsx` real 0m3.501s user 0m3.491s sys 0m0.008s
I can reproduce this degradation. I'm not sure how you did your experiment Tony, but I get consistent run times around 0.5s before the first patch and 2.0s after it. We will continue investigating.
Hi Ben, can you validate that you are not comparing a Release build llc with a Debug build llc? We know that the Debug build llc is significantly slower than the Release build llc. According to my tests, the Release build llc compile time for Fragment shader bytecode is always around 3 seconds with/without Eric's patch. While the Debug build llc compile time is almost 2 minutes with/without Eric's patch. Can you do the test again for both the Release and Debug build and post your detailed result here if you still believe there is a compile degradation? Thanks a lot!
I see I did not specify my exact build procedure; apologies! Here it is: In my LLVM directory, /tmp/llvm-bisect (i.e., on RAMdisk): % cmake -G "Unix Makefiles" -DLLVM_BUILD_LLVM_DYLIB=ON -DCMAKE_INSTALL_PREFIX=/tmp/local /tmp/llvm-bisect % make -j 144 I.e., I built with "gcc (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)", the system compiler; we are somewhat constrained to use GCC when building Mesa, LLVM, etc. But I DID do Debug builds (i.e. let the build type default to Debug), so maybe that has something to do with the differences in our experiences.
I tested reverting the patch from Eric reported by Ben on clang branch 5.0. Just revert that patch reduce the Release version llc compile time for file ir_fs138_variant0.bc by about 10% (the other two byte code see even smaller compile time difference). Which means just Eric's patch causes about 10% compile degradation on branch 5.0. But we did see about 3x~4x compile time difference for file ir_fs138_variant0.bc if I revert everything including Eric's patch and after (from about 0.5 seconds before to about 2.0s seconds after). I will continue look at this issue. Meanwhile please use Release version llc for the compilation in future, since that's way faster than the Debug version llc(about 25x~45x faster).
My patch is merely a scheduler description change, at worst it's highlighting a performance problem somewhere else sadly.
Comparing release builds, here are the compile-time differences with and without Eric's patch: - Going back to the first revision with the patch only shows a degradation (2x-3x) with input file "ir_fs138_variant0.bc" - Just pulling the patch from ToT doesn't show noticeable improvement for any of the input files The compile-time increase comes from greedy register allocation. The patch changes the instruction scheduling - as it is meant to, which unfortunately means that in this particular case, we produce a schedule that is particularly bad for Greedy RA. Here are a few technical details as I understand them from my investigation: - Greedy RA greedily assigns physical registers to virtual registers with the longest live range first - Then it will try to fit in shorter live ranges and will evict virtual registers that were already assigned if it is profitable to do so. When it does this, it needs to re-queue the evicted live range - It just so happens that with this schedule, we evict more registers so we converge more slowly - It is possible that the new schedule for this test case produces live ranges that are near worst case for greedy RA (I haven't analyzed the algorithm enough to claim this is the case, but perhaps developers more familiar with this code can comment). - Furthermore, I assume this near worst case could quite conceivably be produced without this patch with the right test case (again, I haven't confirmed this). With this information in mind, I think we might have to consider this a limitation and close this PR. Considering MESA is a JIT, it may be worth while investigating the possibility of switching to the fast register allocator (like use option -regalloc=fast). I assume that will produce less optimal register allocation, but is presumably faster than the near-optimal greedy register allocator. Let me know what you think about this.
Instead of using the fast regalloc (which is super bad for code quality), you could give a try to basic.
We can still reproduce this bugs with LLVM 12.0 using the LLPC pipeline compiler for AMD GPU. Our profiles show the same pathological behavior in `tryEvict`.