Cache Miss Penalty Reduction

- Why is miss penalty an issue? what contributes to miss penalty? Where does it come from
→ cost of accessing a bus
→ block size
→ Why don't we just transfer really small blocks?
⇒ this would affect miss ferquency
⇒ you then decrease the amount of spatial locality you get
→ handshaking involves time to resolve a hit... toes technology have an impact, depending on how the cache is made? what are the options for cache memory...
⇒ SRAM
• static
• all about speed
• this one is all transistors
• hierarchy close to the CPU relies more on SRAM
• less dense/ less addresses, implicitly faster
⇒ DRAM
• dynamic
• all about density
• this one has a capacitor, which needs to be recharged
• as you go down the hierarchy you rely more on DRAM
• slower, impacts on handshaking time
• also it has more in it, affects search time
- we're going ot get a gap between the rate a CPU operates and the rate memory can keep up
- SRAM can mostly keep up, but as we go deeper thigns slow down
- we wind up with a mismatch, which is where multilevel caches come into play, so we can blend properties. depending on the level we're at in the hierarchy, we'll emphasize sram or dram.

- 2 crows and a starling 15:32

- L1 cache = hit time L1 + miss rate L1 * Miss penalty L1
→ mis penalty L1 = hit time L2 + miss rate L2 * miss penalty L2
- miss penalty is recursive
- local cache miss rate =
→ number of misses at cache level
→ /
→ relative number of cpu memory references generated

Example:
Consider number of misses per 1000 instrucitons
- at L1, its 40
- at L2, its 20
- we'll always find the numbero f misses at L1 is higher, but L2 penalty is gonna be greater
- Assume L2 penalty is 100 clk cycles
- L2 hit time is 10 clk (L2 hit time is also gonna be pretty high)
- L1 hit time is 1 clk
- # of data references per instruction: 1.5
→ what kind of operation does this characterize? wht instructions?
→ loads and stores
- What's the average memory access time?
→ Miss rate L1: 40/1000 → 4%
→ Miss rate L2: 20/40 → 50%
⇒ 40 are the misses from L1
⇒ L1 has more of the things that are easy to hit on, so the misses in L2 are more of the ugly stuff.
→ Global miss rate at L2: 20/1000 → 2%
→ What we actually want is avg mem access time...
→ AMAT = HT + MR * MP
→ avg mem access time = hit time + miss rate * miss penalty
→ MP(L1) = HT + MR * MP (L2)
→
→ AMAT(L1) = 1 + (4% * (10 + 50%*100)) = 3.4 clock cycles

What if we wanted stalls per instruction?
- When we talk about stalls, we're actualyl yalking about memory references. So we should talk about the differences between hits and misses, and how frequently we generate those requests.
- Stalls per instruction will involve AMAT.
→ (AMAT(3.4) - hit time (1)) * data refs per instruction (1.5)
→ (3.4 - 1) * 1.5 = 3.6 clock cycles per instruction

Memory hierarchy is the glue that we use to combine properties.

Antoehr example
let a 2-level cache be configured with a directly mapped L1 cache that has a 4% miss rate and hit time of 1 clock cycle. The L2 cache may havea directly mapped or a 2-way set associative cach as below. what hit time would the 2-way L2 cache need to have in order to better the performance of the 2-level cache configured with directly mapped caches at both levels?

L2 configs	directly mapped	2 way associative
hit time	10 clock cycles	unknown
miss rate	25%	20%
miss pentality	200 clk	200 clk

AMAT = HT(L1) + MR(L1) * (HT(L2) + MR(L2) * MP(L2))
= 1 + 0.04(10 + 0.25 * 200)
= 3.4
= target for max hit time of 2-way assoc L2

3.4 = 1 + 0.04(HT(2way) + 0.2 * 200)
or HT(2way) = 2.4/0.04 - 40
or HT(2way) < 20 cycles in order to be better

L2 cache policies

- critical word first
→ request word causing the miss first and forward to cPU
- early restart
→ interface to MM - fetch words of block in order they are stored.

- priority to read misses over write

- victim cache
→ mimics a wastepaper basket.
→ you can always dip your hand in the wastepaper basket to pull stuff back out again
→ says that at some point in time, i'm gonna run out of space, so i'm gonna identify something from cache and call it a victim. howver you havent actually moved it down the memory hierarchy, its still in cache. then on a cpu request if you get a hit on the victim cache, then you switch it for something else. you end up collecting statistics about the least recently used things. you mark items as victims, and over time you move them around
→ "ive randimoly marked a subset of cache locations as victims. either on a request from cpu, you get a hit on something makred as victim. and you say that clearly cant be a victim becuase i need it. and then you mark something else as victim. on the other hand if a cache is full, we look for a v marked location and kick it out.
→ v is just a label.

2023/07/10 - 15:12

- First level of cache is designed for minimizing hit time
→ directly mapped
→ small
- Subsequent cache levels can use “best of both worlds” models and will be larger.

Index