Multicycle pipes

- up until 2005ish the division unit was not pipelined.
- making things more complex (adding more transistors) requires more power
- we're always trying to make transitors smaller and faster, which creates a balancing act for power consumption

- ARM has CPUs that can do asynchronous/self clocking (takes less power)

- Even though instructions are released sequentially, they dont necessarily finish in the order they were released. this introduces that many more data hazards.

- pipeline diagram example:

Instruction	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26
F0 <-- MEM[R2]	F	D	E	M	W
F4 <-- MEM[R3]		F	D	E	M	W
F0 <-- F0 X F4			F	D	X	X	M1	M2	M3	M4	M5	M6	M7	M	W
F2 <-- F0 + F2				F	X	X	D	X	X	X	X	X	X	X	X	A1	A2	A3	A4	M	W
R2 <-- R2 + #8							F	X	X	X	X	X	X	X	X	D	E	M	W
R3 <-- R3 + #8																F	D	E	M	W
R5 <-- R4 - R2																	F	D	X	E	M	W
BNEZ R5																		F	X	D	X	X	R (branch resolve)
																								F
With forwarding:
	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26
F0 <-- MEM[R2]	F	D	E	M	W
F4 <-- MEM[R3]		F	D	E	M >>	W
F0 <-- F0 X F4			F	D	X	>>M1	2	3	4	5	6	7	M	W
F2 <-- F0 + F2				F	X	D	X	X	X	X	X	X	A1	2	3	4	M	W
R2 <-- R2 + #8						F	X	X	X	X	X	X	D	E	M	W
R3 <-- R3 + #8													F	D	E	M	W
R5 <-- R4 - R2														F	D	X	E >>	M	W
BNEZ R5															F	X	D	>>R
																			F

- This structre is inefficient. We spend too much time stalling, and forwarding can't really soak up that many lost cycles.
- The “holy grail” of dynamic scheduling is for the hardware to figure that stuff out.

Sequence:
F0 ← F2 / F4 -- commence exeuction
F6 ← F0 + F8 -- stalll
F8 ← F10 - F14 -- commence execution
F6 ← F10 X F8

The way around this is to have the hardware give the registers different names to eliminate conflicts

F0 ← F2 / F4
F6(S) ← F0 + F8
MEM[0 + R1] ← F6(S)
----------------------------------
F8(T) ← F10 - F14
F6 ← F10 X F8(T)

the trick is: match the previous target to its source
weeds out R A W hazards.

by relabelling the RAWs, it identifies which things we can resequence.

Issue: queue of all the instructions you fetch.
List of instructions in the order they were fetched
we look at the head:
depending on which hardware unit is onvolved, we push the intst to that unit, assuming that the unit is free.
load/stoire instructions wind up at the address unit
the address unit has two more queuees (store buffer, load buffer). only the head can access memory.
that forces it to maintain in-order execution (relative to the other load/stores)
reservation stations are wehere register renaming takes place
commence operation ASAP... i.e. when both source vvalues are known.
each functional unit has a set of reservation stations
unknown values become a reservation station reference ... because another unit must be making that calculation

What's the bottleneck now?
reservations can be a bottleneck if you run out, but they're not the ibggest bottleneck
the issue is: we release instructions in order to their respective execution modules. they all share a common data bus. They are trying to update the GPR and the reservation stations

Index