Memory Meets Multi-Core

There are other types of multiprocessor
We've been looking at single instruction, single data
there's also single instruction, multi data
Laptops, desktop,s, anything with a visual interface will have some capability of doing that, because it has a GPU.
GPUs have become work horses for anything related to neural networks
We have a single algorithm and multiple sources of data.
We slice the data into different regions, categorized by different regions of memory
A pixel has 4 dimensions: RBG and opacity
perform a calculation to map input state to target state.
GPU goes from high level description to low level pixel description
The same calculations are going on in every different region of memory (pixel)
The same thing happens for neural networks.. but they use convolution which is kinda the same thing but with more phases in between.

Convolution: sum of products operation, where x is a mask, and x is whatever the input pixels are
∑wx
Feature detector in whatever the original image was. You query every possible location and query whether the pixels match the mask. Repeat it over and over and over and over. For one single mask, you wind up with a new image that represents the degree to which the mask exists in the entire original image.
The new image is a ghosty embodiment of how much the mask appears in the original image\
Neural networks involve an abusrd about of computation, which is why we use a GPU and not a CPU
You introduce a bottleneck as you go through the layers
becuase each level of convolutions you create a slightly smaller image (you don't allow overlap between the convolution regions)

The classic example of convolution is if I wanted to recognize a number 6 vs a number 8. Now I'm not intersted in trying to recognize 6 or 8 based on where the individual pixels are, but if i can recognize some feature relative to another feature, that might be all I need to distinguish a 6 from all other numbers.
You take the useful things to recognize and you make them masks.
You want to be able to recognize the building blocks of objects no matter where they are in the image
The convolution will result in an image that only highlights the spots of the image that matched the mask

You can't have conditionals in singel sintruction multiple data, because as soon as you do, it means one path might differ from another, and the cores become out of sync, and the model falls apart.

We could hypothetically have multiple instruction singel data
That;'s like saying you have a single piece of data tht multiple algirithms are being applied do
there isnt much in the way of hardware support that follows it

the lastone is multiple instruction multiple data is the most open ended but the hardest to get right
there is less direct hardware support. it's more specialised
what we tendto see is single single, the second one is GPUs, and after that you see applciation specific architectures.
so let's say hardware platforms that were implemented/designed just to implement deep learnign frameworks. lots of cloud infrastructure might be dedicated to implementing deep learning frameworks. if you want to use a voice activated application, it doesnt happen on your phone it happens on the cloud somewhere.
although, as time goes on it increasingly appears on your cellphone too

We're gonna look at memory ( the implications for memory)

- multithreading as we observed before, we've waved our hands in terms of saying, multiple threads are potentualliy nice, becuase what do multiple threads guarantee? thats useful in terms of instruction level parallelism? what is the ideal organization of instruction from the POV of Inst-lvl paralellism
→ INDEPENDENCE, orthononality
→ a sequentual set of instructions are not dependent on each other
- compiler tries its best, but a go-to is running multiple threads simultaneously where the threads are different applications.
- The OS can mix and match because the threads will have nothing to do with one another
- or at least that's a naive way of lookingat it....

- The bottleneck is walways memory
- even if you could run 100 threads on a cellphone, the problem you're going to face is memory.
- shared memory architectures, multi core, tend to look like this. al within the same cpu, within the same chip, we might have, say, 4 cores,.
- each core has on chip cache.
- at some point you have to get data and sintructions on and off chip. the memory architecture is going to haveto assume shared memeory, whehter its some level of cache or main memory itself.
- that is always a memory bottleneck.
- all the cores are the same, all the caches are the same. on a single multi cored cpu
- if we want to go beyond abt 12 cores, then we start dup0licating things across multiple CPUS, but then you need some kind of interconnection. you need some sort ofa netwrork to link all of them together. That is where things get messy again
- it's like a small high speed version of the internet. just to join all your cpus together. and another holy grail is trying to get optical interconnect to join all the cores up.

- the typical constraint is the core and tis cache are couple. one core cant access the cache of another core. the only way any sort of sharing can happen is another cache level, which costs a lot in terms of access time.

- as soon as you go to the interconnection network level, stuff can be anywhere. so prizes for guessing that parallelsim is not for free. as you increase the amount of paralellism, you spend more and more time trying to find the data you;re supposed to be using. the communication overhead becomes an ever larger cost.

- The basic challenges to parallel processing in general:
→ cost of communication
→ algorithmic limitations
⇒ which means not everything is parallelizable.
⇒ deep learning is parallelizable, its easy to get the locality of data vs an algorithm thats the same everywhere. what happens in one spacial region doesnt really interfere with what happens in another location. it can be done as a systolic array, another form of pipelining, the output from one pass becomes the input to the next.
- if we wanted a task in which out design objective was a speedup of 80 using 100 parallel processors, how much of the task would have to be parallelized vs how much is serial.
→ we are back to amdahls law.

Amdahl's law:
speedup = 1 / 1 - fract(enhancement) + fract(enh) / speedup(enhancement)

in this case, we've said that the total speedup is 80, speeedup from the enhancement is 100

80 = 1 / (1 - fract(enh) + (fract(enh)/100))
fract(enh) = 79/79.2 = 0.9975
which means only 0.25% of the appication can execute sequentially.

intro to shared memory multiprocessors.

2023/07/24 - 15:08

cache coherence
when we *aaghh* have our memory hierarchy and we have some data in main memory, and someone performs a read. that means we have the item of data in the local cache. then B comes along and reads it too. they both have a copy. then Core A writes to the location.... so we have to do something to make sure the copies at variois levels of memory are in sync. at this point we say we've lost coherence. because we've got different instances of state distrubuted across the machine
We can split this concept up into coherence vs consistency
this is where tings get fuzzy

So we want coherent memory if program order is preserved.
what might that mean?
- so we can say that some processor P relative to some address location X. and this processor of course can write to memory, and it can read. and if thats the case, we're talking about read after write. we say that its going to be maintained for the case of a single processor
- we can jazz it up a bit more and talk about two processors
- coherent view for two processors P and P', they also want to write to the same memory location
→ processor P might write, and P' might read.
→ that gives us consistency. consistency is maintained under what conditions.

consistency is about waiting long enough for the write to finish before read commencing. that in itself is tricky. so actually we're looking for serialization.

Write serialization (write after write) by different processors.
consistency is all about the timing.
consistency refers to the minimum time before the written values are available.

we need to alert the sytem to thefgact that we;re avout to perform a write, and only after that do we perform it. thats where serialization comes into play, which is counter to the parallelism we've been aiming for up to this point
one of the tricks is this concept of write invalidate . sure we're going to perform replication and migration but that doesnt capture the issue. its all about serialization because we have to alert the sytem to writes. reads are for free. so we;ve got coherence vs consistency
coherence is all about what values are returned when you're performing a read and write to the same lcoation. wheras consistency is about the temple effects? how long do yu have to flag somehintg as a write and what temporal locks do you have to enforce. TEMPORAL* not temple.

directory and snooping
directory we have to use under the something scenario
snooping we can get away with under our multicore CPU
because theyre all gonna sit there snooping the addresses that are rquested on teh bus. it keeps things simple but it makes tihnfs complciated enough. the communcaion medium linking caches aresnooped. each indivisual core has a record of what addresses it has at its lcoal cache., when someone places a request for something it has locally, it says, i have a copy of that.
whereas the dfirectory based schemes means you have to have a ocpy of directories across the communicated netwoek thats linking up the distributed cores and thats complicated ebcause the directories cant havea copy of everything. so we're going to concentrate on snooping
the trick is
- on a write, first instroduce the step called write incvalidate
- this kinda acts like a dirty flag.
- when a write happens, the address gets invalidated locally. it doesnt get updated in main memory until another location tries to read from it

each cache slot has 4 states
M modified: unique to this cache and dirty. this cache is the only one with a copy
E exclusive: cache slot is the same as in memory, but not present in another cache
S shared: cache slot is the same as in memory, and anotehr cache is also using it/ has a copy
I invalid: the cache slot does not conttain valid data.

now we can ask what happens if we attempt to perform reads and writes.
read hit:
- cpu cache has a hit and theres no interaction with main meory necessary. that processor gets a local copy becaise its alredy in cache. it can merrily read from cache and there's no change in slot state
read miss:
- 4 scenariops
→ exclusive: a different cpu's cache has the clean copy
⇒ whats that gona provoke? if one cpu (A) has the data, we're gonna copy that to B, and as soon as we do that then 2 cpus have the data. so their state goes from exclusive to shared.
⇒ invalid means the cpu that made the request didnt have it so it was invalid.
⇒ now whats gonna happen is this: B is the read miss. A has a copy, and it's exclusive. B is invalid. A is the source and B is the destination. to get that to happen, A will get pushed to shared and B will get pushed to shared
→ shared: multiple sources exist, like saying theres an A and A' and they both have the same thing. they are already in shared state. that causes B (invalid) to flip to shared. then you have 3 in share
→ modified state: in that case A (whcih has the copy) is in modified state aka dirty bit is set. both of them are going to go to shared. once they are both pushed to shared, then the data can be pushed
⇒ the thing about modified is that youre flagging the fact that you're the source.
→ invalid:invalid means that B is making a read request and theres no other copies in any other core's cache, so u have to go to memroy. this means that ur gonna go from invalid to exclusive because youre the only processor that has that piece of data.

write hit:
- shared: you performing awrite will result if you getting a more local copy, ergo you register a dirrty slot. it write invalidates, and because we're performing a write to it it is modified relative to MM.
→ broadcast the invalidate message, indicate that we've modified that slot.
→ why do we invalidate the shared locations? because when we do that it tells the other locations that we're going to change it... serializes the process. any other slots that are sharing this slot will go to invalid, and the one we wrote to will go to modified. because it's modified, we know its out of sync w main memory. then we know we have to push it to MM if we want to replace that cache slot.
- exclusive: we want to write against it and there's no other cache slots involved, so it changes to modified. we set the dirty flag.
- modified: if its already modified and we perofrm another write, the state doesnt change.

write miss:
- now we introduce this interesting scenario, a “read with intent to modify”. this implies that once this miss is serviced, the state of the memory location is going ot be modofied. because it's a write miss, what's the first step?
- the important thing on write miss, is as soon as you get the slot, you're immediately gonna make it dirty. you're immediately going to overwrite the copy of the data that you get from servicing the miss.
- two scenarios:
- modified: another cache has a hit. but it's modified. that cache has the most up to date version of the data. We push the modified slot into invalid because once we pull the slot into our cache we're just going to immediately change it. so our version goes to modified and the other modified one goes to invalid.

Theres a whole pile of caveats that come with this. an underling assumption that doesn't hold true in practice is that each operation can complete before something else happens. some of this is wrapped up in these messages like “read with intent otmodify” we want to flag asap what the intent of call B is. it comes back to what we're trying to do is parallel or unpredictable. we can find ourselves introducing a zero wait state environment. this can result in us trying to implement locking systems in the applications, or hoping that the OS has made sure that the threads on B aare different than the ones on A. the points in time where you'll see requests like that are parallel applications... that means that someone had to sit there inserting conditions for the threads to be aligned. only when threads are synchronized can u access each others memory, but the synchronization implies you've serialized things.

We can find ourselves comign across true sharing misses vs false sharing misses
how much impact might this have on operations

Coherence miss example

8-core scenario. each core has L1 and L2 cache
snoopiong is performed on L2 bus
L2 miss costs 15 cycles
cores oeprate at 3 Ghz with CPI of 0.7 and load store frequency of 40%
what is the coherence miss rate (CMR) per processor if no more than 50% of L2 bandwidth can be consumed by coherence traffic?

# cache cycles available
= clock rate / cycles per request * 50% threshold

the clock rate we said was 3Ghz....
so we want memory references * clock frequency * number of cored * coherence miss rate
the memory references are load, store references / CPI → 40% / 0.7 → 0.4/0.7 * 3 GHz * 8 * CMR

look at the photo cus no way ur gonna make sense of this

Index