Reducing cache miss penalty and rates

- Non-blocking caches - stall reduction on cache misses
→ If a stall is blocking, that m3eans you can't generate any further hits on memory until that stall has been resolved
→ This can create a lot of problems, especially with out of order execution
→ a non-blockign cache can still service further requests (hits) while it is servicing a miss
→ That begs the question: how far does one go in trying to support that
⇒ according to the graph, for integer operations, it doesnt seem to matter how many hits we can resolve under a miss because they seem to perform the same
⇒ on the FP side, being able to resolve a hit under an increasing number of misses shows a lot of improvement in stall time
⇒ If you have a machine capable of outo f order execution, you have to have a non blocking cache
- Hardware instruction pre-fetch
→ When we service a miss, we read the next sequential block into cache.
⇒ From the CPU's pov, it doesn't see that happen. But the odds are that if we need one block, we're likely to need the next one. So we pre fetch the next block.
⇒ Assumes that when we access something, we will want to access something else in the region later. We pre fetch in the direction of further address
→ This will have an impact on low power applications because we may be fetching things we dont actually use.

example:

- 64K cache
- prefetching reduces miss rate by 20%
- prefetch hit: 1 clock cycle
- miss on both cache and prefetch: 15 cycles
- data references per instruction: 22%
- Table 1 detauls misses per isntruction on diff cache sizes
- What is the effective miss rate using prefetching? how much bigger a data cache would be needed to match the avg acces time if prefetching were not available?
-

Size	Instruction Cache	Data Cache	Unified Cache
8KB	8.16	44	63
16KB	3.82	40.9	51
32KB	1.36	38.4	43.3
64KB	0.61	36.9	39.4

Let the prefetch be tested. When cache misses
- model as a multilevel cache

Avg memory access time: Hit time(1) + Miss Rate(?*-) * Miss Penalty
→ Miss penalty = Prefetch hit rate (20%) (1 (hit time) + Miss rate * (1-prefetch hit rate) * miss penalty
→ *- from table: 36.9/1000. How many data references? 22% → 220 instructions create data references
⇒ Miss rate is 36.9/220 → 16.7%

AMAT = 1 + (16.7% * (20%(1 + miss rate(16.7%) * (1 - prefetch hit rate(80%)) * miss penalty(15))

AMAT = 1 + (16.7*20% (1 + (16.7%*80% * 15))
= 1.1 ← AMAT with prefetch
This is the target value for a cache without prefetch

Case of no prefetch:

AMAT = hit rate + miss rate * miss penalty = 1.1 ← target value
miss rate is the unknown
1.1 = 1 + miss rate * 15
miss rate = 0.67%
The system would need a miss rate of 0.67 in order to match prefetch performance. There's no way we're going to get a cache big enough for this.

Case of serial tets on prefetch:

keep them independent, not modeled as multilevels
AMAT = {cache hit} + {prefetch hit} + {prefetch miss}
AMAT = 1 + {miss rate * 20% } + {miss rate (1 - prefetch hit) * miss penalty}
= 1 + 16.7%*20% + 16.7% * 80% * 15
= 3.046 (new target value for no prefetch)

case of no prefetch:
target cache miss rate: 3.046-1 / 15 → 13.6%

try it on a large cache of 256 → 32.6/220 = 14.8% ← the serial test prefetch on a 64KB cache is still better than no prefetch with 256

Out of order process requrie non blocking caches.
Prefetching attemps to anticipatecache placements
Not good for low power application

A small, directly mapped cache is the only way to keep hit time small.

Index