2023/05/15 - 15:35
- what is forwarding?
→ it sure is
- What types of hazards were there?
→ data hazard:
⇒ we have:
• target: src1: src2:
• R1 ← R2 + R3
• R4 ← R1 - R5
• R6 ← R1 ^ R7
• the value in R1 (target) will be used in all the following instructions as sources.
⇒ In the WB stage, we push R1 back to the GPR. Only once we've updated the GPR do we get to use the value. Ergo, if you have a second instruction and fetch it before WB has occurred, we discover a dependency. At that point, we would stall until the first instruction managed to get to the WB stage. This wastes 2 clock cycles
→ However, with forwarding, we can avoid this issue.
⇒ What stage makes the calculation that produces R1?
• the Execution stage of the first instruction that uses R1.
• so the two instructions after the first one are affected
⇒ forwarding introduces a bus from the ALU output back to its input
⇒ R1 is trapped in the latch between the ALU and the next stage. It is sitting in ALUout.
⇒ The value of ALUout moves down a stage at each clock cycle. So we need a bus of whatever precision these calculations are, and the bus goes from the output of the ALU to the input, and also from the output of the MEM stage latch back to the input.
⇒ We only need two because of the two clock cycles that are lost (JK ITS ONE BUS but its connected to both things)
⇒ the two things that can write to the bus : ALU out (at the ALU stage) and latch from the MEM stage.
→ each stage of the pipe has to be independent.
→ the hardware checks for sources that use targets that havent been written yet. there is hardware that checks for these dependencies to decide whether to forward values or not. It only has to check for the first two dependencies after the value is written, since the third one will not run into an issue due to half cycle clocking for read/write
→ this is the case of a data hazard on a register-register instruction
There are other hazards:
Example: writeback data hazard
Code snippet:
R1 ← R2 + R3
R4 ← Mem[R1]
Mem[R1 +12] ← R4
- which stage needs R1?
→ The ALU needs the memory address calculation (r1 +0) on line 2.
→ the next line needs R4 which needs R1, but it doesn't need it until the memory stage (4th stage). However R1 is needed in the ALU stage for the address calculation of this instruction.
- Is there a problem?
- The earliest point in time that we know R1 is the ALU stage of the first instruction
→ 3rd clock cycle.
- We need to forward it to the ALU stage of the next instruction so it can perform an address calculation
- this can produce R4, but when do we KNOW R4?
→ After clock cycle 4 of this instruction, because we have to access data memory from the address we calculated.
→ then you need another bus that forwards the value from the end of the memory stage to the beginning of the memory stage.
- We still have to perform the address calculation with R1. How do we get it?
→ ALU out has been shuffled one clock cycle forward, and it needs to be bussed down to the ALU input of the 3rd instruction
→ that is the address calculation that will have R4 put into it.
- We only need two buses, because one of them goes from ALUout and Memory back to ALUin, and one that goes from MEM output back to MEM input.
Can we break this?
- is there a register-register error that would break this down?
When forwarding Fails
R1 ← Mem[R2]
R4 ← R1 - R5
R6 ← R1 ^ R7
R8 ← R1 v R9
When do we know R1?
- Mem access stage (4th cycle)
But the next instruction needs R1 in the execution stage (cycle 3) (so now we're up the gum tree)
When we perofrm a load, and immediately try to consume the value, then we have to stall. there is no way to avoid it.
Forwarding the result of the memory access to an earlier stage would mean going backwards in time.
We can forward the result of mem to an execute stage as long as it stalls 1 cycle.
the next instruction would also have to stall becuase it could not perform decode while R1 is being written to memory (would go out of sync)
(address) (R1)
IF ID EX MEM WB
IF ID X EX MEM WB
IF X ID EX MEM WB
IF ID EX MEM WB
Compiler tries to minimize the chance that things are going to stall.
EX:
if we have
a = b + c
d = e - f
if b and c are loads, we will load them first, but not attempt to do any calculations on them right away. instead we will load e (another useful operation) to pad the clock cycles.
then we will perform the add.
then we will load f as padding, then store a. delaying the store avoids stalling on the load for f
then we perform the operation on e and f, and then store it.
this leaves all the delay untiil the last operation.
IF ID EX MEMb WB
IF ID EX MEMcfwd WB
IF ID EX MEMe WB
IF IDR1 R2(fwd)EX MEMfwd R4 WBR4
IF ID EX MEMf (fwd) WB
IF ID EX R4(fwd)MEM WB
IF ID →R5 fwdEX MEMfwd R6→ WB
IF ID EX →R6(fwd)MEM WB
this diagram is kinda a mess, add a picture of the board Index