Memory Hierarchy Recap

2023/06/28 - 15:23

Question: What is the resulting CPU time associated with each cache policy under the above considtions? (see pdf)

CPU time = IC x ( CPU(exe) + MemStallCycles/Instr) x tclk
cpu(exe) → CPI
memstalls/instr = memory misses

CPI(ideal)
= SUM( f(i) CPI(i)) where i is the instruction type
= 0.26 + 0.09x2 + 0.05
= 1.09

Mem Stall cycles / instr → mem access / instr → miss rate and miss pentalty
We need to divide this into instructions and data.
Case of insrtuction:
frequency of insructions is 100% (always accesses memory)
instruction cache has a miss rate of 0.5%
miss penalty = 50 cycles
0.5% x 50 = 0.25

Case of data:
As soon as we have a write buffer, that means we can ignore store instructions. They don't impact CPI becuase the buffer just chugs along in the BG
in this example only write-through has access to a write buffer
freq:
WB: freqload + freqstore
WT: freqload (write buffer)
WB:
On a read, if the block is clean, we only have one transfer. But if its dirty we have to write to MM so it involves two transfers.
Clean: costs 50 cycles, 70% of the time
Dirty: costs 100 cycles, 30% of the time
together this is 65 cycles avg.
Now we can go back to memory stall cycles:
WT:
fload(26%) x miss penalty(50 cycles) x miss rate (1.5%)
= 0.195
WB:
(26 + 5)% x miss penalty(65 cycles) x miss rate (1,5%)
= 0.34125

Reducing miss rate

3 performance properties:
- hit time
- miss rate
- miss penalty

If we look at things from the MOV of miss rate, we can talk about Compulsory misses. AKA, cold start. The larger the cahce, the larger the number of compulsory misses. But this number is not really important since it onyl happens at start.
The two important onces are
Capacity
Entire program does not reside within cache
flushed blocks later require relaoding
Conflict
too many MM blocks mapped to the same cache blocks. (direct and set associative)

Reducing miss rate might have a negative effect on hit time. E.g. with direct mapping, there's no searching for a hit, you know right away. But the expense is that the number of conflicts go up.

In practice, this means we miss a lot of temporal properties

What if we make larger caches?

This would reduce miss rate, but increase hit time. A bigger cache takes longer to search.
This is why on-chip caches are usually small, because we want to minimize time to resolve a hit.
Larger caches are used furhter up/deeper in the memory hierarchy, because as you go through your pyramid (registers at the top, then L1, L2, then MM, then disc)
At L1, the most important property is hit time. This is becuase it's closest to the registers.
As we go increasingly deeper in the hierarchy, we want to transition to minimizing the misses, because the penalty increases as you go down. Because there's more handshaking involved and you've gone from on-chip to off chip.
This means that the further down you go the more we want to optimize the miss rate.

Larger block sizes

There's a certain (shallow) sweet spot in terms of block size. You're trying to weigh off the penlty of transferring a block, and the penalty of the diversity of the blocks in cache. e.g. if you transfer in a large cache block, you can take advantage of it by sequentially accessing everything in the block.
Diversity refers to the range of address spaces appearing in cache
Once you pick a cache setup youre stuck with it.

Higher associativity:

reduce concflict misses

Penalty: higher access times

Way prediction

Compiler optimizations

relvant to the project
We're familiar with row major vs column major accessing. S

Blocking:
divides the rows/colulmns of the matrix into smaller chunks so that they can fit in cache, since the full row/col of the array will not fit in cache once they reach a certain size.
This is an increasingly important concept since neaural network ML aglortithms are based on matrix multiplication.

Index