- up until 2005ish the division unit was not pipelined.
- making things more complex (adding more transistors) requires more power
- we're always trying to make transitors smaller and faster, which creates a balancing act for power consumption


- ARM has CPUs that can do asynchronous/self clocking (takes less power)

- Even though instructions are released sequentially, they dont necessarily finish in the order they were released. this introduces that many more data hazards.

- pipeline diagram example:

Instruction1234567891011121314151617181920212223242526
F0 <-- MEM[R2]FDEMW
F4 <-- MEM[R3]FDEMW
F0 <-- F0 X F4FDXXM1M2M3M4M5M6M7MW
F2 <-- F0 + F2FXXDXXXXXXXXA1A2A3A4MW
R2 <-- R2 + #8FXXXXXXXXDEMW
R3 <-- R3 + #8FDEMW
R5 <-- R4 - R2FDXEMW
BNEZ R5FXDXXR (branch resolve)
F
With forwarding:
1234567891011121314151617181920212223242526
F0 <-- MEM[R2]FDEMW
F4 <-- MEM[R3]FDEM >>W
F0 <-- F0 X F4FDX>>M1234567MW
F2 <-- F0 + F2FXDXXXXXXA1234MW
R2 <-- R2 + #8FXXXXXXDEMW
R3 <-- R3 + #8FDEMW
R5 <-- R4 - R2FDXE >>MW
BNEZ R5FXD>>R
F


- This structre is inefficient. We spend too much time stalling, and forwarding can't really soak up that many lost cycles.
- The “holy grail” of dynamic scheduling is for the hardware to figure that stuff out.

Sequence:
F0 ← F2 / F4 -- commence exeuction
F6 ← F0 + F8 -- stalll
F8 ← F10 - F14 -- commence execution
F6 ← F10 X F8

The way around this is to have the hardware give the registers different names to eliminate conflicts

F0 ← F2 / F4
F6(S) ← F0 + F8
MEM[0 + R1] ← F6(S)
----------------------------------
F8(T) ← F10 - F14
F6 ← F10 X F8(T)

the trick is: match the previous target to its source
weeds out R A W hazards.

by relabelling the RAWs, it identifies which things we can resequence.

Issue: queue of all the instructions you fetch.
List of instructions in the order they were fetched
we look at the head:
depending on which hardware unit is onvolved, we push the intst to that unit, assuming that the unit is free.
load/stoire instructions wind up at the address unit
the address unit has two more queuees (store buffer, load buffer). only the head can access memory.
that forces it to maintain in-order execution (relative to the other load/stores)
reservation stations are wehere register renaming takes place
commence operation ASAP... i.e. when both source vvalues are known.
each functional unit has a set of reservation stations
unknown values become a reservation station reference ... because another unit must be making that calculation

What's the bottleneck now?
reservations can be a bottleneck if you run out, but they're not the ibggest bottleneck
the issue is: we release instructions in order to their respective execution modules. they all share a common data bus. They are trying to update the GPR and the reservation stations


Index