Benchmarking
- what is a meaningful measure of performance? what are you measuring when you benchmark a CPU?
- What is included when you benchmark a CPU?
→ A CPU has:
⇒ Memory, in particular, Cache + RAM
• we have to execute SOMETHING to measure performance, and it will be sitting in memory somewhere.
⇒ could also add Disk, Network, but prof wants to focus on a minimalist visualization of the CPU
⇒ OS. can't do anything without OS
• some OS's do certain tasks better than others.
→ Why don't we care about the visual interface?
⇒ because that would mean thinking about the GPU.
- These components ^ are what we are measuring performance on.
But now, what is a a meaningful measure of performance if I'm working with the building blocks mentioned? (memory, OS)
- Key measurement criteria:
→ Time
→ clock speed
→ throughput
→ latency
- these all measure different things.
- Latency and Throughput:
→ applications that concentrate on latency and throughput are called transaction processing.
⇒ e.g. banks, cloud services. maximize the number of customer asks. but the end user does not care about the number of transactions that are being processed, you only care about yours. however the banks themselves love to increase the number of transactions per second. however when you're competing with many other users using the bank, your latency goes up.
⇒ Transaction processing are all about maximizing the use of the resource, but minimizing latency for the end user.
→ the average desktop computer is not using transaction processing however.
- Time and Clock Speed:
→ interested in the time taken for some meaningful task to be completed. This time has only an indirect correlation with clock speed, because clock speed is about “how thinly you slice the pizza”.
→ in the 90s, everything was about increasing clock speed. but if your clock period is longer and you get 20 things done, vs if your period is shorter and you only get 1 thing done, which is better?
→ youre only interested in the time it takes between the moment you hit execute to the time you get your result. the actual number of clock cycles is not really important.
→ same thing about the number of cores. Why do we wind up with multiple cores? where does multi core come from?
⇒ so we can execute in parallel. but what does that mean?
• there are several ways to look at it:
• algorithmic parallelization. someone had to think really hard about how to split it up so that each part runs in parallel.
• compilation. this is the case for intruction level parallelism. aka, each instruction does not make use of the result of the previous instruction, as much as possible. However we tend to write our code s.t. each line is dependent on the previous one, and the compiler has to rewrite this code to run in parallel.
• the OS runs multiple programs in parallel. the way we use computers, we have several applications running that we expect the CPU to be able to handle in paralell.
◇ As long as these programs are independent from each other, we're a happy bunny
◇ this runs into complications when the programs start to run out of memory, as they are competing for the same resource.
• we essentially circle back to latency and throughput.
- The memory hieracrchy is there to make it seem like everything is running smoothly together on the CPU.
- Time Elapsed
→ this is the gold standard
-
What does CPU time reflect?
CPU time = tclk * instruction count * CPI
- tclk = clock period. inverse of frequency. fixed duration. a function of the physical geometry of the circuit. puts a finite amount of complexity on the circuit. finite amount of propagation delay during which the circuit has to settle, because at the end of the clock cycle there needs to be nothing else changing. reflects physical property of CPU.
→ A transistor is the smallest building block we've got, and it takes a finite amount of time to switch. But we have to look at the most complex feature, its number of transistors, and make sure it can complete its circuit in one clock cycle.
- instruction count = number of instructions actually executed. unroll number of times a loop gets executed, some conditionals dont get executed... etc
- CPI = cycles per instruction. why don't all instructions take the same number of clock cycles to execute? as long as an instruction is using the same physical circuit, it will use the same amount of time. Let's say addition, subtraction use one circuit, multiplication uses another, logic uses another. the worst one is division. (division is evil). ~try to avoid division~ because it takes the longest.
→ types of circuit:
→ logic
→ +/-
→ *
→ %
→ integer, floating point
⇒ floating points take more time. they are more complex
⇒ integer circuits are faster. this is due to complexity of the corresponding hardware circuit
→ these are the only real circuits that are built in by default in hardware. everything else is probably software.
→ predictability also impacts CPI.
⇒ how can different instructions be less or more predictable
• branching (conditionals, loops)
◇ exiting the loops is unpredictable behaviour. you dont really know WHEN the condition will tell you to jump out of the loop
◇ ignoring users “users are a complete pain in the ass. replace them by robots”
• memory access
◇ accessing sequential memory is fast. but if the addresses you need to access are all over the place, this makes a mess of your memory hierarchy. each cache miss wastes thousands of clock cycles.
This is the purpose of benchmarking
for a particular application, how well-behaved or not is CPI.
Tclk and instruction count is pretty much fixed. CPI is the big unknown.
Other factors affecting CPI and IC
- IC
→ efficiency of compiler
- CPI
→ sensitivity to sequencing and instruction mix
- ISA
→ RISC vs CISC?
→ in cisc you could do:
⇒ Mem[A] = Mem[B] + Mem[C]
⇒ in one instruction
⇒ this is highly variable because you dont know where those memory addresses are. if theyre in cache, it could be fast, but if theyre not in cache “you could be sat there twiddling your thumbs” hehe
⇒ directly operating on the memory
⇒ CISC minimized instruction count, but made CPI much larger since the circuit complexity is much higher.
→ vs in RISC, the only thing that can access memory are load & store. you can only do executable operations on registers. you cant directly change anything in memory, you have to first load it into a register.
⇒ Reg[b] = Mem[B]
⇒ Reg[c] = Mem[C]
⇒ Reg[a] = Reg[b] + Reg[c]
⇒ Mem[A] = Reg[a]
→ this is where instruction level parallelism comes into play. the first two lines are not dependent on one another, so they could be serviced at the same time.
→ Instruction level parallelism is about chopping things up small enough that you can still do useful stuff while other parts stall.
→ CISC was big in the early days, when compilers were not good. As compilers got more powerful, RISC became more of an option. this meant the hardware could be a lot simpler, and we could focus on using hardware to do other things. One of the reasons we wound up with multi core was the hardware guys said “the cpu is so ridiculously complicated that if we add any more complexity, it will just get slower”. They just duplicated what they already had, because making it more complicated would just make the core slower. So they offloaded it to the OS/software people to sort it out.
Pragmatics
- all 3 features are inter-related. cannot impact one wihtout impacting others.
- CPI is difficult to estimate without appropriate benchmarking.
Example:
CPU time
- tclk and ‘IC’ are common in the question.
- CPI is given for different types of instructions.
- therefore CPI (overall) = Sum (over i element of instruction types) of freq(i) * CPI(i)
- always looking to minimize
- two design choices:
→ decrease CPI of FP SQRT to 2
→ decrease all FP instructions to 2.5
What tasks do we use for benchmarking?
- speed of processes we use often
→ time to get the outcome is most important (from a user's POV)
- Power consumption
→ Particularly on a cellphone
- How much overlap is there between those things?
→ It's proportional. They're in conflict. Power consumption is a measure, we don't want to increase it. We want something for nothing.
→ If we can minimize power consumption and also minimize time to get what we want, there's a tradeoff.
⇒ such is life
→ Efficiency plays into power consumption
→ Secondary effects:
⇒ throughput
⇒ Heat dissipation also comes into power consumption.
⇒ smaller is better.
→ None of these things are aligned.
→ Malcolm says a cellphone and a laptop are not real computers... why?
⇒ They are specialized. They're designed to be portable. What set of constraints does that result in?
⇒ a whole bunch of tradeoffs, in favor of power consumption. There's also a heat issue. The space issue plays into the heat issue. How satisfied would we be if all our compute tasks took half an hour? This would have an impact of power consumption and such.
- One size never fits all.
- Creating benchmark means we have to come up with tasks that exercise the function of our system in different ways, in accordance with our desired context.
- Fundamentals parts of a computer:
→ CPU
→ Memory
→ OS
→ Compiler
⇒ indicates the sequence of instructions.
⇒ when benchmarking, consider what optimizations to included. don't want to include optimizations that favour a particular CPU or memory architecture. You want to be as neutral as possible (for any potential user)
- We have to made decisions about what our performance metrics are.
- If youre benchmarking using tasks that dont exercise the different parts of our system in a way that's relevant to our userbase, we're not gonna get a good picture of how our system is performing.
- What mix of tasks should be included in any one benchmark?
→ Real programs are most representative, but even then
- Floating point programs tend to be more loopy than integer operation-based programs.
- SPEC is one of the larger benchmarking organizations.
→ Last CPU benchmark was released in 2017
→ Splits the applications in to integer and floating point.
→ Written in C, C++, Fortran. This is because in Java or Python, there are other elements that are not predictable across all operating systems or devices. For Java in particular, it becomes blurry whether you're benchmarking the VM or the hardware itself. There are too many other variables involved to definitively say you're benchmarking the CPU
→ Why is there more fortran under FP?
⇒ Science tends to use Fortran to describe their scientific applications.
→ Lets look at the applications
⇒ most of the FP operations are very scientific and niche. The only general purpose applications are really image manipulation, maybe ray tracing.
⇒ A few of the integer operations are general purpose, like video and data compression.
→ It's a collection of CPU applications with the I/O aspects pulled out.
→ KLOC stands for thousands of lines of code
Selecting Benchmarking program suites
- Should be meaningful to users.
- should exercise different parts of the systems
- Can't assume real applications because they typically step outside of just hte CPU, and touch on I/O etc.
- Use modified versions of real applications.
Example: different CPUs
- Smaller cache results in lower latency, as resolving a cache hit is much faster.
- Similar pattern observed with smaller main memory.
- A wider bus on the computer means more pins to support it.
- The cost of all this is space.
- The new chip has more on-chip cache and smaller off-chip. It's the fastest.
- If your CPU can service a hit with an outstanding miss, then it might have an impact during those scenarios where adding more cache space didnt seem to fix anything.
- You want the compiler to find ways of padding the code so that more stuff that's independent is near each other.
- This is called orthogonality.
- The more orthogonal, the more parallel your code can be.
Virtual memory: how much space does the application attempt to address. How much space is visible to the application. A virtualization of the address space even though the real address space is probably not contiguous.
The working space that your application is attempting to use.
Things like: data.
- Similar pattern observed with smaller main memory.
Index