Reducing CPI below unity

Multi-issue: reducing CPI below 1

So far we've said thou shalt measure performance in terms of CPU time.
- CPU time, we've said, is equal to IC x CPI x tclk
- Where CPI is the most problematic. It is the most variable because of
→ sequence of instructions,
→ type of instruction,
→ architecture
- IC is pretty much fixed (determined by compiler), and so is clock period.
- we're saying IC is out of CPU control. tclk is fixed
- CPI is something we can alter.
- How can we minimize it?
→ pipelining: it's like a manufacturing process (assembly line.) divide the process into a series of equally sized steps, where each step is a clock cycle.
→ some insturctions might take 20 steps, others might take 2. but by dividing into steps, we can get to the point where at every clock cycle, we get an answer to SOMETHING, since something will be exiting the pipe.
→ diffferent things can cause stalls. we classified their sources:
⇒ Data hazard
• RAW
• WAW
• WAR
⇒ Structural hazard
• can't use same piece of hardward simultaneously by two different instructions
• two instructions trying to writeback in the same cycle.
• competition for a hardware resource
⇒ Control hazard
• knowing the outcome of a branch
• at each clock cycle you want to fetch a new instruction, but if you don't know which way the branch is going you don't know which one to fetch
→ How do we combat these?
⇒ Branch prediction: combats control hazards
• Branch delay slot relies on the compiler to figure out an instruction that it can put in place while stalling for branch resolution. Not the same as branch prediction
• Branch target buffer: after the first time you come across a branch, you know what its address is. You take the address and stick it in a table. Next time you come across the branch you get a hit on the table so you know you're looking at a branch instruction even though you've just fetched it.
◇ You immediately know you're looking at a conditional, and then you can look at your branch prediction.
◇ What we haven[t got to, is at the end of the day we still have to wait until the branch is resolved
⇒ Data hazards:
• Tomasulo's algorithm
• All about register renaming, so you can spot independencies between isntructions, and once you have their operands you can commence execution
⇒ Structural hazards:
• Be careful with scheduling of things
• comes back to tomasulo, a dynamic way of scheduling execution

Another way we can minimize CPI:
- rn the optimal CPI is 1. Why can't we do better?
- Because once we've filled the pipeline we can only complete one writeback per clock cycle. Things can't be in the same execution unit simultaneously.
- Even when we use tomasulo, we can only write one thing to the common data bus a time.
- In othe words, to do better than that, we have to fetch more than one instruction per clock cycle. Becuase if we could fetch two per clock cycle, they would be in writeback at the same time.
- This is a structural hazard, so the amount of duplication we have to do increases a lot. But whatever, we just need more hardware packed in there. This is not a huge deal because of the reduction in size of the components.

- Two paradigms to reduce CPI below 1
→ fetch one instruction, but it has several instructions packed in it
⇒ this has always been intel's approach, even if they didnt realize it
⇒ Complex instructions mean that one ISA instruction maps to several lines of microcode.
⇒ this didn;t work because it set up a bunch of constraints. e.g we always need two integer instructions and two FP instructions. How many pieces of code do we write that meet that criteria? not much. what we tend to get is a couple integer intrsuctiosn, then a burst of FP, then some more integer.... or we could just be using integers the whole time. There are only a small number of applications that could meet this strict criteria. It tends to be stuff like digital signal processing, not general computing purposes.
→ superscalar: fetching more than one instruction

Superscalar processes
- we always have to fetch one integer and one FP. We're back to the same problem. This represents something you're hoping tthe compiler to achieve, but in practice it doesn;t happen.
- The implication is that we have two fetch comonents, two decode, and the execution units for both integer and FP operations, and duplicate mem and WB units. everything needs to be completely independent.
- Superscalar processes have to pipeline the issue stage
→ stage 1 tests for hazards between fetched isntructions
→ stage 2 tests for hazards flgdslfjgd he scrolled down look at the pdf against issued instructions

Dynamic scheduling:
- we can kind of already do this, tomasulo gives us out of order execution.
- we want to just accept whatever instructions we are given, and leave it to tomasulo to sort them out. that's the hope, anyway
- duplicate the fetch decode issue components
→ all we do in fetch is genreate an address, and getting something from it.
→ this means we need twice as many buses, since we're fetching twice as many instructions per cycle.
→ Then we have decode, which answers the qiuestion of dependencies between the two fetched isntructions
→ issue looks for hazards further down the pipeline.
→ after that, tomasulo will sort it out.
→ What actually happens?

			1	2	3	4	5	6
F0 <-- M[R1]	F	D	I	E	E	W
F4 <-- F0 + F2	F	D	I	-	-	-	E	E	E	W
M[R1] <-- F4		F	D	I	E	X	X	X	X	X	E
R1 <-- R1 - 8		F	D	I	X	E	W
(R1 , R2)			F	D	I	X	X	R	(branch resolution)
			F	D	I...../flush	(wrong instruction)			F



F0 <-- M[R1]	F	D	I	E	E	W
F4 <-- F0 + F2	F	D	I	X	X	X	E	E	E	W
M[R1] <-- F4		F	D	I	E	X	X	X	X	X	E
R1 <-- R1 - 8		F	D	I	X	E	W
(R1 , R2) (BNE)			F	D	I	X	X	R	<-- we can't commence execution until we know this result
			F(flushed)	F(btb hit)
				F

What is the new bottleneck from fetching two instructions?
- the issue is we have a ton of time between branches where we can't do anything. What is that?
- Branch resolution
- Enter: branch prediction and branch target buffer
- these let us predict the event... the resolution of the branch. without waiting.
- However...
- The fetches continue happening, but the execution stage has to wait for branch resolution. this delay compounds, as more fetches come in, the stalls get added to each consecutive instrucion.
- All the next instructions will get stuck in the queue with nowhere to go
- This is the problem
- The next thing we have to fix is the decoupling of issue and branch resolution
- commence execution speculatively, and when branch resoultion finalizes, we either flush a bunch of stuff or keep going

Speculative branch resolution?
- we add another component, the reorder buffer.
- When the resolution goes in the opposite of the specualtion, you can flush the machine and the state is preserved in the reorder buffer.
- Only release what's on the head of the reorder buffer.
- reservation stations no longer push directly to the registers. the y push to the reorder buffer.
- this is because we don't know if the instruction will need undoing.
- if the heado f the queue are operands we know the answer two, we will write to the machine
- but if the head of the queue is an unresolved branch, it wont let anything else write to the registers until the branch gets resolved.

Index