- What is RISC?
- What do we mean when we say “operations”
→ stuff that performs some sort of computation
→ logical, arithmetic
⇒ they can only manipulate registers. not memory
⇒ this means that LOAD and STORE are how we get things to and from memory, before or after performing the calculation
→ conditionals.
⇒
- register level transfer language : basic form of “assembly” not specific to any one CPU
- we can do operations that look like this:
→ R[x] ← R[y] <op> R[z] where R indicates a register reference. x,y,z are elements of register bank (General Purpose Register)
→ R[x] ← R[y] <op> ## (sign extended, specify integer or unsigned) Imm (immediate value/constant, known at compiletime)
→ <op> is either an arithmetic operation or a logical operation. “glorified lookup table”
→ *we assume that all our registers are 64 bit
→ Load-Store instructions:
⇒ load: R[x] ← Mem[R[y] + #Imm]
⇒ store: Mem[R[y] + #Imm] ← R[x]
⇒ again, playing with 3 fields in our instructions.
⇒ x, y: GPR
⇒ mem: data memory. Needs an address. Mem[~address~]
→ Branch or Jump instructions (Conditionals)
⇒ IF (R[x] <op> ZERO) THEN (PC ← PC + ##Imm)
⇒ first half: test the condition. compiler will do precomputations so you wind up testing against zero. (register that's always zet to 0)
⇒ If true, we manipulate the program counter. we add something to it that the compiler has figured out at compiletime.
Now what we're trying to do is build a pipeline, called a dataflow.
All our CPU does is the fetch, decode, execute cycle.
- first clock cycle, FETCH
→ This means that we access instruction memory
→ How do we know which instruction to access? PC tells us. PC is a register. Whatever happens to be there, we put into the Instruction register.
→ We also perform a “next sequential PC”. will have a predetermined offset, say, 4 words, to reach the next instruction. This sets up the PC to point to the next sequential instruction.
→ IR ← Mem[PC]
→ NPC ← PC + 4
→ This may or may not actually be the next sequential instruction, because maybe we are looking at a conditional.
→ Now we have to chop out different fields from that instruction.
- Instruction decode/ Register fetch
→ lets say you pull out the two fields R[y] and R[z] from the instruction (source 1, source 2)
→ go to your bank of GPRs, a physical array. identify whatever happens to be in the registers and jam them into special purpose registers A, B
→ Source 2 could be an immediate value though
→ A ← Reg[IR(source1)] //extract source 1
→ B ← Reg[IR(source2)] //extract source 2
→ Imm (immediate Register) ← (sign ext.) ##IR(Immediate) //extract immediate.
→ you're extracting these fields from the value in the instruction register.
- What can we see?
→ GPR bank
→ Immediate
→ Instruction Register, indirectly
→ PC, from a logical POV but not physically.
- IR is the numerical value representing the fetched instruction, which we then decode.
→ Has the actual bytes that you fetched from PC. 4 bytes.
-
→ During decode, you discover which of those three scenarios you need.
- Execution
→ we've extracted out registers A, B and Immediate.
→ R[y] is now conveniently in special purpose register A
→ one of 4 cases depending on the ID stage:
⇒ Memory Reference: ALU(out) ← A + Imm (R[y] + ##imm)
⇒ Register-Register ALU operation: ALU(out) ← A opcode B
⇒ Register-Immediate ALU operation: ALU(out) ← A opcode Imm
⇒ Branch: ALU(out) ← NPC + Imm
Cond ← A opcode 0
• Test: look at a register A, compare it to 0 (another SP register)
• opcode: equality/inequality, <, > etc
• NPC: PC + 4. If the condition succeeds, we will add an immediate value to the NPC to indicate the jump. Otherwise the default next instruction is executed (pc + 4)
• these are done in paralell. don't act on it yet, just perform the calculation. Haven't changed the PC yet.
→ We have to cover all scenarios.
→ ALU(out) is another register. (special purpose). The ISA cannot see it, but it's lcoal to the piece of hardware that has to perform the operation
→ We are gradually putting one foot in the ISA (software) POV and the hardware POV. In the hardware POV, everything takes a finite amount of time, ergo you have to be mindful of your signal's propagation. (clock gets placed in center of CPU)
- Next step: MEMORY
→ PC ← NPC (update Program Counter)
→ one of either:
⇒ Memory reference:
• LMD ← Mem[ALU(out)]
◇ fetching from DATA memory (not instruction memory) and putting it in another register that the ISA can't see
• Mem[ALU(out)] ← B
◇ put whatever is in B into data memory at that address
• ALU calculates the addresses.
⇒ Branch
• IF (cond) THEN PC ← ALU(out) (the value calculated in previous step)
- Write-back
→ we need to update the registers to reflect the new values.
→ the thing we're writing is either the contents of register ALU(out) or LMD (thing we just loaded from memory)
→ Reg-reg:
⇒ Reg[target] ← ALU(out)
→ Reg-imm
⇒ Reg[target] ← ALU(out)
→ Load:
⇒ Reg[target] ← LMD
- conditional and store only need 4 clock cycles.
cycle 1: fetch
- PC accesses Instruction memory and jams the value into IR
- theres a component that just adds 4 to the value in PC and shoves that into NPC
cycle 2: decode
- chop up the bits
- depending on the content of IR, decide which registers you need as your source values. look them up in GPR bank. They are addressable.
- stick the values into A and B
- A and B are connected to different multiplexors, controlling 2 inputs to the ALU.
- by the end of the clock cycle, resolve which registers you need as operands, and jam their values in to A and B
- the other scenario is an immediate value. decode it from the IR and stick its value into Imm. Make sure it is sign extended
→ piece of hardware that makes sure sign extension happens. takes a 16 bit number and spits out a 32 one
→ connected to the same mux as B, different from A.
→ no point putting them in the same multiplexor because only one would get forwarded.
- decode is about setting up the right values in the multiplexor so we can actually do the calculation we want.
- result stored in ALU(out)
cycle 3: execute
- having loaded everything, we can do a calculation
- conditional component is activated. depending on the result, can result in the value from ALU(out) being taken or the value of NPC
cycle 4: memory
- have an address from ALU(out)
- could be performing a store, in which case simply write to memory
- if performing a load, put it in LMD
- update PC
Cycle 5: writeback
- update general purpose registers
Different instructions might take different amounts of time
- branch and store: 4 cycles
- others: 5 cycles
- ALU instructions IDLE in mem stage, so an optimal implementation (without pipelining) would complete ALU instructions here.
If we couldn't do pipelining, each instruction would take 5 clock cycles. If you could perfectly fill the pipe, CPI would be 1. This is the motivation for instruction level parallelism. Every component of the machine is executing a different instruction in parallel. Index