Skip to content

Commit

Permalink
[PPC64LE] Remove unnecessary swaps from lane-insensitive vector compu…
Browse files Browse the repository at this point in the history
…tations

This patch adds a new SSA MI pass that runs on little-endian PPC64
code with VSX enabled. Loads and stores of 4x32 and 2x64 vectors
without alignment constraints are accomplished for little-endian using
lxvd2x/xxswapd and xxswapd/stxvd2x. The existence of the additional
xxswapd instructions hurts performance in comparison with big-endian
code, but they are necessary in the general case to support correct
semantics.

However, the general case does not apply to most vector code. Many
vector instructions are lane-insensitive; they do not "care" which
lanes the parallel computations are performed within, provided that
the resulting data is stored into the correct locations. Thus this
pass looks for computations that perform only lane-insensitive
operations, and remove the unnecessary swaps from loads and stores in
such computations.

Future improvements will allow computations using certain
lane-sensitive operations to also be optimized in this manner, by
modifying the lane-sensitive operations to account for the permuted
order of the lanes. However, this patch only adds the infrastructure
to permit this; no lane-sensitive operations are optimized at this
time.

This code is heavily exercised by the various vectorizing applications
in the projects/test-suite tree. For the time being, I have only added
one simple test case to demonstrate what the pass is doing. Although
it is quite simple, it provides coverage for much of the code,
including the special case handling of copies and subreg-to-reg
operations feeding the swaps. I plan to add additional tests in the
future as I fill in more of the "special handling" code.

Two existing tests were affected, because they expected the swaps to
be present, but they are now removed.


git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@235910 91177308-0d34-0410-b5e6-96231b3b80d8
wschmidt-ibm committed Apr 27, 2015
1 parent 267cdc0 commit dcc4f72
Showing 9 changed files with 971 additions and 2 deletions.
1 change: 1 addition & 0 deletions lib/Target/PowerPC/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -36,6 +36,7 @@ add_llvm_target(PowerPCCodeGen
PPCTLSDynamicCall.cpp
PPCVSXCopy.cpp
PPCVSXFMAMutate.cpp
PPCVSXSwapRemoval.cpp
)

add_subdirectory(AsmParser)
1 change: 1 addition & 0 deletions lib/Target/PowerPC/PPC.h
Original file line number Diff line number Diff line change
@@ -39,6 +39,7 @@ namespace llvm {
FunctionPass *createPPCEarlyReturnPass();
FunctionPass *createPPCVSXCopyPass();
FunctionPass *createPPCVSXFMAMutatePass();
FunctionPass *createPPCVSXSwapRemovalPass();
FunctionPass *createPPCBranchSelectionPass();
FunctionPass *createPPCISelDag(PPCTargetMachine &TM);
FunctionPass *createPPCTLSDynamicCallPass();
15 changes: 15 additions & 0 deletions lib/Target/PowerPC/PPCInstrAltivec.td
Original file line number Diff line number Diff line change
@@ -11,6 +11,21 @@
//
//===----------------------------------------------------------------------===//

// *********************************** NOTE ***********************************
// ** For POWER8 Little Endian, the VSX swap optimization relies on knowing **
// ** which VMX and VSX instructions are lane-sensitive and which are not. **
// ** A lane-sensitive instruction relies, implicitly or explicitly, on **
// ** whether lanes are numbered from left to right. An instruction like **
// ** VADDFP is not lane-sensitive, because each lane of the result vector **
// ** relies only on the corresponding lane of the source vectors. However, **
// ** an instruction like VMULESB is lane-sensitive, because "even" and **
// ** "odd" lanes are different for big-endian and little-endian numbering. **
// ** **
// ** When adding new VMX and VSX instructions, please consider whether they **
// ** are lane-sensitive. If so, they must be added to a switch statement **
// ** in PPCVSXSwapRemoval::gatherVectorInstructions(). **
// ****************************************************************************

//===----------------------------------------------------------------------===//
// Altivec transformation functions and pattern fragments.
//
15 changes: 15 additions & 0 deletions lib/Target/PowerPC/PPCInstrVSX.td
Original file line number Diff line number Diff line change
@@ -11,6 +11,21 @@
//
//===----------------------------------------------------------------------===//

// *********************************** NOTE ***********************************
// ** For POWER8 Little Endian, the VSX swap optimization relies on knowing **
// ** which VMX and VSX instructions are lane-sensitive and which are not. **
// ** A lane-sensitive instruction relies, implicitly or explicitly, on **
// ** whether lanes are numbered from left to right. An instruction like **
// ** VADDFP is not lane-sensitive, because each lane of the result vector **
// ** relies only on the corresponding lane of the source vectors. However, **
// ** an instruction like VMULESB is lane-sensitive, because "even" and **
// ** "odd" lanes are different for big-endian and little-endian numbering. **
// ** **
// ** When adding new VMX and VSX instructions, please consider whether they **
// ** are lane-sensitive. If so, they must be added to a switch statement **
// ** in PPCVSXSwapRemoval::gatherVectorInstructions(). **
// ****************************************************************************

def PPCRegVSRCAsmOperand : AsmOperandClass {
let Name = "RegVSRC"; let PredicateMethod = "isVSRegNumber";
}
14 changes: 14 additions & 0 deletions lib/Target/PowerPC/PPCTargetMachine.cpp
Original file line number Diff line number Diff line change
@@ -38,6 +38,10 @@ static cl::opt<bool>
VSXFMAMutateEarly("schedule-ppc-vsx-fma-mutation-early",
cl::Hidden, cl::desc("Schedule VSX FMA instruction mutation early"));

static cl::
opt<bool> DisableVSXSwapRemoval("disable-ppc-vsx-swap-removal", cl::Hidden,
cl::desc("Disable VSX Swap Removal for PPC"));

static cl::opt<bool>
EnableGEPOpt("ppc-gep-opt", cl::Hidden,
cl::desc("Enable optimizations on complex GEPs"),
@@ -239,6 +243,7 @@ class PPCPassConfig : public TargetPassConfig {
bool addPreISel() override;
bool addILPOpts() override;
bool addInstSelector() override;
void addMachineSSAOptimization() override;
void addPreRegAlloc() override;
void addPreSched2() override;
void addPreEmitPass() override;
@@ -306,6 +311,15 @@ bool PPCPassConfig::addInstSelector() {
return false;
}

void PPCPassConfig::addMachineSSAOptimization() {
TargetPassConfig::addMachineSSAOptimization();
// For little endian, remove where possible the vector swap instructions
// introduced at code generation to normalize vector element order.
if (Triple(TM->getTargetTriple()).getArch() == Triple::ppc64le &&
!DisableVSXSwapRemoval)
addPass(createPPCVSXSwapRemovalPass());
}

void PPCPassConfig::addPreRegAlloc() {
initializePPCVSXFMAMutatePass(*PassRegistry::getPassRegistry());
insertPass(VSXFMAMutateEarly ? &RegisterCoalescerID : &MachineSchedulerID,
778 changes: 778 additions & 0 deletions lib/Target/PowerPC/PPCVSXSwapRemoval.cpp

Large diffs are not rendered by default.

147 changes: 147 additions & 0 deletions test/CodeGen/PowerPC/swaps-le-1.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
; RUN: llc -O3 -mcpu=pwr8 -mtriple=powerpc64le-unknown-linux-gnu < %s | FileCheck %s
; RUN: llc -O3 -mcpu=pwr8 -disable-ppc-vsx-swap-removal -mtriple=powerpc64le-unknown-linux-gnu < %s | FileCheck -check-prefix=NOOPTSWAP %s

; This test was generated from the following source:
;
; #define N 4096
; int ca[N] __attribute__((aligned(16)));
; int cb[N] __attribute__((aligned(16)));
; int cc[N] __attribute__((aligned(16)));
; int cd[N] __attribute__((aligned(16)));
;
; void foo ()
; {
; int i;
; for (i = 0; i < N; i++) {
; ca[i] = (cb[i] + cc[i]) * cd[i];
; }
; }

@cb = common global [4096 x i32] zeroinitializer, align 16
@cc = common global [4096 x i32] zeroinitializer, align 16
@cd = common global [4096 x i32] zeroinitializer, align 16
@ca = common global [4096 x i32] zeroinitializer, align 16

define void @foo() {
entry:
br label %vector.body

vector.body:
%index = phi i64 [ 0, %entry ], [ %index.next.3, %vector.body ]
%0 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cb, i64 0, i64 %index
%1 = bitcast i32* %0 to <4 x i32>*
%wide.load = load <4 x i32>, <4 x i32>* %1, align 16
%2 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cc, i64 0, i64 %index
%3 = bitcast i32* %2 to <4 x i32>*
%wide.load13 = load <4 x i32>, <4 x i32>* %3, align 16
%4 = add nsw <4 x i32> %wide.load13, %wide.load
%5 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cd, i64 0, i64 %index
%6 = bitcast i32* %5 to <4 x i32>*
%wide.load14 = load <4 x i32>, <4 x i32>* %6, align 16
%7 = mul nsw <4 x i32> %4, %wide.load14
%8 = getelementptr inbounds [4096 x i32], [4096 x i32]* @ca, i64 0, i64 %index
%9 = bitcast i32* %8 to <4 x i32>*
store <4 x i32> %7, <4 x i32>* %9, align 16
%index.next = add nuw nsw i64 %index, 4
%10 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cb, i64 0, i64 %index.next
%11 = bitcast i32* %10 to <4 x i32>*
%wide.load.1 = load <4 x i32>, <4 x i32>* %11, align 16
%12 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cc, i64 0, i64 %index.next
%13 = bitcast i32* %12 to <4 x i32>*
%wide.load13.1 = load <4 x i32>, <4 x i32>* %13, align 16
%14 = add nsw <4 x i32> %wide.load13.1, %wide.load.1
%15 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cd, i64 0, i64 %index.next
%16 = bitcast i32* %15 to <4 x i32>*
%wide.load14.1 = load <4 x i32>, <4 x i32>* %16, align 16
%17 = mul nsw <4 x i32> %14, %wide.load14.1
%18 = getelementptr inbounds [4096 x i32], [4096 x i32]* @ca, i64 0, i64 %index.next
%19 = bitcast i32* %18 to <4 x i32>*
store <4 x i32> %17, <4 x i32>* %19, align 16
%index.next.1 = add nuw nsw i64 %index.next, 4
%20 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cb, i64 0, i64 %index.next.1
%21 = bitcast i32* %20 to <4 x i32>*
%wide.load.2 = load <4 x i32>, <4 x i32>* %21, align 16
%22 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cc, i64 0, i64 %index.next.1
%23 = bitcast i32* %22 to <4 x i32>*
%wide.load13.2 = load <4 x i32>, <4 x i32>* %23, align 16
%24 = add nsw <4 x i32> %wide.load13.2, %wide.load.2
%25 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cd, i64 0, i64 %index.next.1
%26 = bitcast i32* %25 to <4 x i32>*
%wide.load14.2 = load <4 x i32>, <4 x i32>* %26, align 16
%27 = mul nsw <4 x i32> %24, %wide.load14.2
%28 = getelementptr inbounds [4096 x i32], [4096 x i32]* @ca, i64 0, i64 %index.next.1
%29 = bitcast i32* %28 to <4 x i32>*
store <4 x i32> %27, <4 x i32>* %29, align 16
%index.next.2 = add nuw nsw i64 %index.next.1, 4
%30 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cb, i64 0, i64 %index.next.2
%31 = bitcast i32* %30 to <4 x i32>*
%wide.load.3 = load <4 x i32>, <4 x i32>* %31, align 16
%32 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cc, i64 0, i64 %index.next.2
%33 = bitcast i32* %32 to <4 x i32>*
%wide.load13.3 = load <4 x i32>, <4 x i32>* %33, align 16
%34 = add nsw <4 x i32> %wide.load13.3, %wide.load.3
%35 = getelementptr inbounds [4096 x i32], [4096 x i32]* @cd, i64 0, i64 %index.next.2
%36 = bitcast i32* %35 to <4 x i32>*
%wide.load14.3 = load <4 x i32>, <4 x i32>* %36, align 16
%37 = mul nsw <4 x i32> %34, %wide.load14.3
%38 = getelementptr inbounds [4096 x i32], [4096 x i32]* @ca, i64 0, i64 %index.next.2
%39 = bitcast i32* %38 to <4 x i32>*
store <4 x i32> %37, <4 x i32>* %39, align 16
%index.next.3 = add nuw nsw i64 %index.next.2, 4
%40 = icmp eq i64 %index.next.3, 4096
br i1 %40, label %for.end, label %vector.body

for.end:
ret void
}

; CHECK-LABEL: @foo
; CHECK-NOT: xxpermdi
; CHECK-NOT: xxswapd

; CHECK: lxvd2x
; CHECK: lxvd2x
; CHECK-DAG: lxvd2x
; CHECK-DAG: vadduwm
; CHECK: vmuluwm
; CHECK: stxvd2x

; CHECK: lxvd2x
; CHECK: lxvd2x
; CHECK-DAG: lxvd2x
; CHECK-DAG: vadduwm
; CHECK: vmuluwm
; CHECK: stxvd2x

; CHECK: lxvd2x
; CHECK: lxvd2x
; CHECK-DAG: lxvd2x
; CHECK-DAG: vadduwm
; CHECK: vmuluwm
; CHECK: stxvd2x

; CHECK: lxvd2x
; CHECK: lxvd2x
; CHECK-DAG: lxvd2x
; CHECK-DAG: vadduwm
; CHECK: vmuluwm
; CHECK: stxvd2x


; NOOPTSWAP-LABEL: @foo

; NOOPTSWAP: lxvd2x
; NOOPTSWAP-DAG: lxvd2x
; NOOPTSWAP-DAG: lxvd2x
; NOOPTSWAP-DAG: xxswapd
; NOOPTSWAP-DAG: xxswapd
; NOOPTSWAP-DAG: xxswapd
; NOOPTSWAP-DAG: vadduwm
; NOOPTSWAP: vmuluwm
; NOOPTSWAP: xxswapd
; NOOPTSWAP-DAG: xxswapd
; NOOPTSWAP-DAG: xxswapd
; NOOPTSWAP-DAG: stxvd2x
; NOOPTSWAP-DAG: stxvd2x
; NOOPTSWAP: stxvd2x

1 change: 0 additions & 1 deletion test/CodeGen/PowerPC/vsx-ldst-builtin-le.ll
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
; RUN: llc -mcpu=pwr8 -mattr=+vsx -O2 -mtriple=powerpc64le-unknown-linux-gnu < %s > %t
; RUN: grep lxvd2x < %t | count 18
; RUN: grep stxvd2x < %t | count 18
; RUN: grep xxswapd < %t | count 36

@vf = global <4 x float> <float -1.500000e+00, float 2.500000e+00, float -3.500000e+00, float 4.500000e+00>, align 16
@vd = global <2 x double> <double 3.500000e+00, double -7.500000e+00>, align 16
1 change: 0 additions & 1 deletion test/CodeGen/PowerPC/vsx-ldst.ll
Original file line number Diff line number Diff line change
@@ -12,7 +12,6 @@
; RUN: llc -mcpu=pwr8 -mattr=+vsx -O2 -mtriple=powerpc64le-unknown-linux-gnu < %s > %t
; RUN: grep lxvd2x < %t | count 6
; RUN: grep stxvd2x < %t | count 6
; RUN: grep xxswapd < %t | count 12

@vsi = global <4 x i32> <i32 -1, i32 2, i32 -3, i32 4>, align 16
@vui = global <4 x i32> <i32 0, i32 1, i32 2, i32 3>, align 16

0 comments on commit dcc4f72

Please sign in to comment.