Index

Lecture 10

4 Software Pipelining (global scheduling)

• loop-carried dependencies limit the effectiveness of loop unrolling
• example Taylor series for $$e^x$$ (at $$a=0$$): $$e^x=\frac{x^0}{0!}+\frac{x^1}{1!}+\frac{x^2}{2!}+\cdots$$
• F0=x, F2=i=0, F4=x^i=1, F6=i!=1, F8=e^x=x^0/0!=1
instr mem1 mem2 fp1 fp2 int/br
Loop:1 ADD.D F2,F2,1,0
2
3
4 MUL.D F4,F4,F0 MUL.D F6,F6,F2
5
6
7 DIV.D F10,F4,F6 SUB R1,R1,1
8
9 BGT R1,Loop
10 ADD.D F8,F8,F10
• Kernel Dataflow
• software pipelined loop (without prolog or epilog)
instr mem1 mem2 fp1 fp2 int/br
Loop:1 ADD.D F2,F2,1,0
2 MUL.D F4,F4,F0 MUL.D F6,F6,F2 BGTZ R1,Loop
3 DIV.D F10,F4,F6 ADD.D F8,F8,F10 SUB R1,R1,1

F2 F4 F6 F10 assumes no stalling and these use the value from the previous iteration

• if the VLIW detects dependencies and stalls then this could wouldn't work
• rotating registers would be need
• software pipelined loop with rotating registers
instr mem1 mem2 fp1 fp2 int/br
Loop:1 ADD.D RR0,1,0
2 MUL.D RR5,RR1,F0 MUL.D RR6,RR2,RR0 BGTZ R1,Loop
3 DIV.D RR7,RR1,RR2 ADD.D F8,F8,RR3 RRB+=4

• Memory disambiguation is determining if a load access and a store acess are aliased (to the same memory location)
• Cannot always been done at complie time
• loads usually cannot be hoisted above stores becasuse they cannot be disambiguated at compile time
• example
SW  R12, (R4)   0 stall
LW  R6, (R8)    0 stall     hoisting lw above sw incement if R4==R8
ADD R5, R6, R7  1 stall
Sw  R5, (R10)   0 stall
• loads and stores are often not aliased so speculatively hoist the load above the store and add a guardian instruction at the original location
LW.S    R6, (R8)
SW      R12, (R4)
LW.C    R6, (R8)
SW      R5, (R10)
• IA-64 (Itanium)
ld8.a   r6=[r8];;   advanced load
st8     [r10]=r5