Presentation 2

sadie noah

motivation
- the need for speed
- traditional computers will soon reach a point where we cant dissipiate power consumption any more
- approx computing lets us trade off some accuracy in exchange for energy efficiency, lower latency, simpler design
- you can do that at different levls of the stack, from hardware to an algorithmic perspective
- focus on hardware side
- acciracy vs energy
→ trade this off at different levels
→ application dependent
→ more acceptable to have errors
→ when:
⇒ energy intensive applications
⇒ non critical operations

architecture
- full adders and half adders are foundational for exact computing
- hald adder has 2 operands
- exact multiple bit adder: ripple carry adder, uses 4 full adders
- note: the execution is sequenctial because one adder can't start until the one before it does its carry out
- the bottleneck is the carry out
- ripple carry adders are modified to be more complex to fight the obttleneck but then they consume more power
- Generate propagate delete logic
→ g = A ^ B
→ p = A xor B
→ d: absence of g or p
- these signals are mutually exclusive
- we can redefine the carry out in terms of these signals
- carryin depends on what happens in the previous adder
- we redefine the carryout at each level in terms of P G D
- GPD can allow carryout to be calculated in parallel
- carry look ahead adder is faster than a ripple carrry adder... at what cost?
→ half adders compute GPD
→ which are forwarded.... too fast
→ boolean logic gets more complex as you go up the circuit towards the MSB, we add an extra and gate at every point. and the or gate needs to accomodate more
→ CLA adders get huge when we have to add 20 bits
→ we can use 5 4-bit CLAs...
- ripple is slow, sequential, o(N)
- CLA: parallel, O(log(n), more power

- OLOA adder, has an exact portion and an approx portion
- no communication between exact and approx, meaning they can happen in parallel
- optimized OLOA, still discard LSB
- bruh how am i supposed to absorb this
- hardware optimized OLOA has marginally more pardwater and power conjkdgndljhdajhdfj
- carry mask adder can dynamically change whether it acts as approximate or exact

Aprroximate multiplier design
- conventional multiplier: partial products accumulate to a result
- goal is to intriduce approximation in the stages and measure how it performs compared to an accurate multiplier
- at input:
→ 2 main strategies
⇒ dynamic segment method
• pass it input
• from the first nonzero element, how many bits do we wanna approximate. we picked 6 bits so it will pick 6 of the first bits after the first nonzero value and discard the rest
⇒ static segment method
• take a fixed number of bits from a section. could be first k bits, last k bits, or k bits in the middle.
- allowing for some error results in better perofrmance in power delay and area.
- approximation at partial product generation:
→ employ an under designed multiplier ....
→ circuit is much simpler

Floating porint
- we choose which values we want to represent based on what we want to use them for
- IEEE formats give a high dynamic range vs fixed point representation
- mantissa bit width determines how much precision we can get, exponent gives us how much dynamic range
- disadvantages of IEEE
→ memory refencere bottlenecks
→ loading more precision than we end up using
→ dataset may not require all that dynamic range
- for ML applications
→ push towards mix of 16 and 32 bit floaitng point
→ transfer values as FP16, translate to 32 bit if no GPU
→ use CPU FP unit or GPU 16 bit unit
→ half the memory storage and transferring smaller values
→ converting the exponent results in overflow/underflow
→ FP16 has 5 exponent bits whcih is too narrow for neural networks
→ google's Bfloat16 has 8 exponent bits, same dynamic range as 32 bit, but only 7 bits of mantissa which drops precision
→ then we can use mixed precision in DNNs, use bflat16 in convolution layers and fp32 for accumulation and normalization
→ native support in hardware in newer chips
→ we can choose diff performance metrics for diff applications
⇒ you can take ur specs of ur data and measure
• energy usage
• memory usage
• in supervised learning: compare accuracy vs established model

applications
- where it shines
→ sacrificing accuracy for computational efficency
⇒ image, audio, video processing. cant perceive changes in color or framerate
⇒ scientific and data driven
→ IOT
⇒ can send data to cloud computing environment for processing
⇒ or process data locally
• must be energy efficient
• can do approx calculations and get acceptable reuslts
- approx adders and multipliers
- a lot of ML problems dont have a perfect solution anyway, so approx is fine
→ search engine, recommendations
- even if there is a perfect solution, some imperfection is expected
- models are resilient to noise, and approximations are kinda like another type of noise
- some applications are less error tolerant
→ critical medical infrasrtucture
- algorithmic solutions
→ for the dataset he showed u can prune up to 80% before it starts

- Why was the particular approach to computation adopted in preference to a traditional computer architecture?

- What are the architectural innovations supporting the proposed approach to computation?

- What advantages and disadvantages result from adopting the proposed approach to computation?

- Given the unique properties of the proposed computing platform, how did the authors go about measuring performance?

- What are the ‘killer applications' for the proposed approach to computation

Index