2023/05/10

Hazards

- the holy grail is a CPI of 1, each stage is executing a separate instruction, independent of one another
- what could interfere with this?
→ if we had R[3] = R[7] + 75
? = R[3] + ?
→ we won't know this result until at least 2 clock cycles later
→ this is called a data hazard
→ what other hazards are there?
⇒ case: conditional
⇒ an if statement that modified the program counter. which stage is it going to alter it?
• during the memory stage
• you would end up losing 2 clock cycles as you flush the incorrect ones that started before the mem stage
• this is call a control hazard
• sometimes the correct one will be sequential and you wont have to flush anything, but you dont know what will happen
→ another hazard:
⇒ if your're not pipelining and you only have one instruction, once you perform the fetch, you dont need any of the fetch hardware because you've changed focus to the decode hardward. only 1/5th of your architecture is used at any one time.
⇒ You dont need both the ALU and the add unit, since they'll never be used concurrently. you could just add with the ALU since it wont be used for another 2 clock cycles
⇒ but if you're trying to pipeline, you need a dedicated add unit since the ALU will be active most of the time
⇒ this is call a structural hazard
⇒ another structural hazard: competition for the same hardware resource
• you have instructio nmemory and data memory. if we dont pipeline, both can be hung on the same bus. but as soon as you pipeline, you can't get away with that, because fetches on instruction memory will happen concurrently with accesses to data memory. you then need two buses. you need to budget your hardware
- 3 hazards we have to design our way around.
- without pipelining, instruction mem and data mem only need to be accessed every 5 clock cycles. with pipelining, they need to be accessed every cycle. separate buses necessary
- need 5 different instructio registers because the IR is necessary in each stage for each instruction
→ a whole bunch of bookkeeping needed in hardware.
→ register NPC needs to be duplicated 3 times
→ A twice, B 3 times
→ all because there are 5 instructions present in the cpu at a time.
→ this adds additional hardware complexity.
- another constraint:
→ we want to write to and read from the general purpose registers
→ there is potentially a conflict
→ one instruction may be writing to registers at the same time as another one reads.
→ this seems like a structural conflict.. but we can get around it
→ reading and writing is not time consuming. the trick is to write on the first HALF of the clock cycle (state it updated) Aand read on the second half of the clock cycle.
→ (we've wiggled our way out of that constraint)
- set of physical latches that retain the relevant registers. so we have access to 5 different versions.
→ pushes the IR from the instruction along to the next phase to make room for the next one
→ stores everything that may need to be propagated along the line, such as a, b, NPC
→ set of registers
→ managed by hardware
- these are as wide as the machine word length.

In theory, we can get a CPI of 1. In practice, hazards reduce the efficiency that we can actually squeeze out.
some simple “back of the envelope style” calculations:

example 1:

w.r.t the naive 5 stage piped and unpiped arhictectures, let
- clock period: 0.5ns
alu and branch take 4 cycles
memory take 5
req of alu: 40%
freg of branch: 20%
freq of mem: 40%
additional complexity of a pipeline addds 0.1ns to the period

assume IC unchanged
CPU time == CPI and tclk == sum of CPI*freq of each time of instruction

some instructions will take 4 cycles and some will take 5.
(freq of 4 cycle instruction) * 4 + (freq of 5 cycle instruction) * 5
0.6*4 + 0.4*5
result * 0.5ns = 2.2ns
for unpipelined

for pipelined(case of instruction level parallelism):
assume “pipeline is full”
ideal CPI is 1 because the pipeline is full
tclk = 0.6, because pipeline
cpu time = 0.6ns

speedup: 2.2/0.6 = 3.666

example 2:

- architecture that doesnt have separate instruction and data buses
- CPI of pipe without stalls is 1
- data load/store are 40% of the mix
- clock frequency of the the pipe encountering stalls is 1.05 times higher than the pipe without stalls (but having the complexity of dual instruction and data buses)

conflict: data memory and instruction fetch
we have to jam a stall cycle into the pipe.
always stall the later instruction.

Case of architecture with structural hazard:

- cPI = CPI ideal + CPI stalls
- ideal has a frequency of 100% and costs 1 clock cycle (100% * 1)
- 40% of the time we encounter a stall, which costs 1 cycle (0.4 * 1)
- (100% * 1) + (0.4 *1) OR (0.6 * 1 + 0.4 * 2) = 1.4
- clock frequency: tclk = 1/1.05 f
- cpu time = 1.4/1.05f = 1.33/f where f is the frequency

Ideal case (no stalls)

- CPU time = CPI ideal * tclk = 1 * 1/f
- speedup overall = time(old)/ time(new) = 1.33
- always trying to avoid stall cycles
- if such a stall is so detrimental to the operation of the pipeline, are there any conditions under which stuctural hazards might be permitted?
→ if you dont have enough space to have two buses on your motherboard
→ space is always critical
→ maybe you have enough time that the stalls arent really a problem
→ depends on application context

Pipeline data hazards:

ex: writeback data hazard
decode and writeback have the half clockcycle.
since the next clock cycle needs r1, you can feed r1 back into the alu, the next instruction can use it.
otherwise, we would have to wait untiil the value of r1 was written to the gpr.
by putting it back through the ALU we dont have to wait until writeback ot use the value
this is called forwarding
it has an expense because you have to support a bus from the output of the alu back to its input, but you avoid stall cycles.
we can identify whether its needed in a future instruction by inspecting the state of the latches. r1 is in the IR of the instructions, as either a source or a target.

Index