



#### Fall 2011 Prof. Hyesoon Kim







### Why Power?

### Power density continues to get worse



Tech

Computing

Fred Pollack, Intel, Micro-32 Keynote



### **Power Dissipation in CMOS**



### $P_{tot} = P_{dyn} + P_{sta} = C_L V_{dd}^2 f + V_{dd} I_{leak}$





### **Power Basics**

- Power vs. Energy
- Dynamic power vs. Static power
  - Dynamic: "switching" power
  - Static: "leakage" power
  - Dynamic power dominates, but static power is increasingly important





### Why do we care?

- 1) Increase the CPU cost
  - Thermal cost: keeping the devices below the special temperature
- 2) the cost of power delivery





### **Power power power**



CPU Cycles

- Max Power
- Worst-case application
- Average power
- Thermal power: running average of worst-case app for several seconds : used to decide cooling option
- Transient power (power delivery), standby power (battery life),

Reducing Power in High-performance Microprocessors, Texar et al 198



### Where does Power Go?







### **Power Reduction Techniques**

- Voltage scaling
- Clock gating
- Utilize circuit design techniques
- Low power logic synthesis
  - Non-critical path → low power circuit (slow but so what? )
- Specific circuit technology
  - Reduce AF (domino circuit)





Computing

Tech



## **DVS (Dynamic Voltage Scaling)**

 O/S controls the processor speed: find the minimum voltage required for the desired speed.

 DVFS (Dynamic Voltage Frequency Scaling): Intel's CPU throttling technology, SpeedStep



### DTM (Dynamic Thermal Management) Techniques

- DTM: software and hardware techniques at run-time to control a chip's operating temperature
- Thermal package is designed for normal operating conditions rather than worst case
- Key goals:
  - To provide inexpensive hardware/software responses
  - Reduce power
  - Reduce impacts on performance as little as possible



Time

Tech

Dynamic Thermal Management for High-Performance Microprocessors, Brooks and Martonosi (01) Georgia

College of Computing



Georgia

Tech

(Collega of

Computing

### DTM

- Trigger mechanisms
  - Temperature sensors, on-chip activity counters
- DTM Response mechanisms
  - Clock frequency scaling
  - Voltage and frequency scaling
  - Decode throttling (PowerPC G3)
  - Speculation control
  - I-cache toggling (disabling instruction cache)
  - Migration computation

Dynamic Thermal Management for High-Performance Microprocessors, Brooks and Martonosi (01), Temperature-Aware Microarchitecture: Modeling and Implementation, skadron et al. '04





### **DTM: Migration Computation**



- Spare unit is located in cold area of chip
- Primary unit reaches 81.6C, issue is stalled, instructions ready to write back is allowed.
- All instructions use second register file.
- When the primary register file reaches 81.5C the process is reversed

Temperature-Aware Microarchitecture: Modeling and Implementation et al. 04



College of

Computing

Georgia

Tech

### Leakage Power Trend



- Technology scales, leakage power consumption is increased
- Leakage power/current increases as temperature increases

DESIGN CHALLENGES OF TECHNOLOGY SCALING, Borkar'99



Geordia

Tech

(College of i

Computing

- Body-bias control
- Dual-threshold domino circuits
- Input vector control (by inputting all 0's for a NAND gate)
- Power gating



### **Power Gating**



- Sleep signal to turn off the supply voltage
- Save both dynamic power and leakage power

Microarchitectural Techniques for Power Gating of Execution Units, Hu et al. Georgia College of Tech

## Pipeline Gating (Manne et al. '98)



Determine when a branch is more likely mispredicted and gate a pipeline

College of

Computting

Georgia

Tech

- Use confidence estimator
- Other metrics (number of instructions)
- JVM



### **Clock Gating**

- Adds additional logic to a circuit to prune the clock tree
- Reduce dynamic power consumption
- Power up delay (timing problem)
- Variations in current



# Benefits of Power Gating and Clock Gating

- Clock Gating → reduce dynamic power consumption but not the static power consumption
- Power gating → eliminate both dynamic and static power consumption





Computing

### **Architecture Power Simulators**

- SimpleScalar (Performance simulator)
- Wattch : Dynamic Power Simulator
- HotLeakage: Leakage current simulator
- HotPower: thermal spot
- MacPat:



### Wattch (Brooks et al. '00)



Pd = CvVdd<sup>2</sup>af C: load capacitance, Vdd: supply voltage, f: clock frequency A: switching activity

Georgia

Tech

COLLEGIA Of

Computing

- Array structures
- Fully associative content-addressable memories (CAM)
- Combinational logic and wires
- Clocking



### Array structure vs. CAM







| Hardware structures        | Model Type              |
|----------------------------|-------------------------|
| Instruction cache          |                         |
| Wakeup logic               |                         |
| Issue selection logic      |                         |
| Instruction window         |                         |
| Branch predictor           |                         |
| Register file              |                         |
| TLB                        |                         |
| Load/Store Queue           |                         |
| Data Cache                 |                         |
| Integer functional units   |                         |
| FP functional units        |                         |
| Global clock               |                         |
|                            |                         |
| Vattch (Brooks et al. '00) | Georgia College of Tech |



Georgia

Tech

College of Computing

### **Metrics**

- Power
- Energy
- EDP (Energy Delay Product)
- EDDP (Energy Delay^2 Product) : more emphasis on performance
- EPI (Energy per instructions)



### **Review: Performance vs. Power**

- Cooling capacity also decides the maximum power
- Back-of-the-Envelope calculation:
  - 3.8 GHz CPU at 100W
  - Dual-core: 50W per CPU
  - $P \propto V^3$ :  $V_{orig}^3/V_{CMP}^3 = 100W/50W \rightarrow V_{CMP} = 0.8 V_{orig}$  $- f \propto V$ :  $f_{CMP} = 3.0GHz$





### **Runtime Power Monitoring**



- Measuring current
- Real time power monitoring
- Power(Ci) = AccessRate(Ci)\* architecturalScaling(Ci) \* MaxPower(Ci) + NongatecClockPower(Ci)
- MaxPower  $\rightarrow$  proportional to area
- Accessrate  $\rightarrow$  dynamic events
- MaxPower, architecturalScaling, NongatedClockPower: foundout from empirical data
  Georgia
  College of

Tech

Computing

Isci and Martonosoni'03

## Examples of Processor Components

- Access Rate calculation
- Use hardware performance counters to get events

| Bus Control       | $\frac{IOQ \ Allocation}{\Delta Cycles_1} + \frac{Bus \ Ratio \cdot FSB \ Data \ Activity}{\Delta Cycles_2}$                                                        |
|-------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Front End BPU     | $\frac{8 \cdot ITLB \ Reference}{\Delta Cycles_1} + \frac{Branch \ Retired}{\Delta Cycles_2}$                                                                       |
| Secondary BPU     | $\frac{Branch \ Retired}{\Delta Cycles_2}$                                                                                                                          |
| L1 Cache          | $\frac{Ld Port Replay + St Port Replay}{\Delta Cycles_1} + \frac{Front End Event}{\Delta Cycles_2}$                                                                 |
| MOB               | $\frac{MOB \ Load \ Replay}{\Delta Cycles_2}$                                                                                                                       |
| Trace Cache       | $\frac{Uop\ Queue\ Writes}{\Delta Cycles_1}$                                                                                                                        |
| Integer Execution | $2 \cdot \left(\frac{Uop Queue Writes}{\Delta Cycles_1} - FP \ Exe. \ Access \ Rate\right) - L1 \ Cache \ Access \ Rate - \frac{Branch \ Retired}{\Delta Cycles_2}$ |
| L2 Cache          | $\frac{BSQ \ Cache \ Ref}{\Delta Cycles_1}$                                                                                                                         |
| DTLB              | L1 Cache Access Rate + MOB Access Rate                                                                                                                              |
| ITLB              | $\frac{TLB \ Ref}{\Delta Cycles_1} + \frac{BPU \ Fetch \ Req}{\Delta Cycles_2}$                                                                                     |

Georgia Tech

Computing

- Use train benchmarks to stress particular units
- Pentium 4 based design
- Different architectures have different units



### **Max Power Calculation**

Runtime\_power\_component = AccessRate x MaxPower

• Allowable maximum power consumption per arch. unit





Georgia Tech College of Computing

### **Power Breakdowns**



Isci and Martonosoni'03



### **GPU Power Breakdown**



Georgia

Tech

College of

Computing

Hong and Kim '10

## Hardware Performance Counters

- Built in counters inside hardware
- Example counters
  - Branch misprediction, cache misses, retired instructions, pipeline bubbles, DRAM traffics
  - 10s (even 100s) of even counters, but typically only few can be read simultaneously
- Software
  - Typically windows/Linux (Linux requires kernel recompilation)

Computing

– PAPI, PerfMon, Vtune, etc.







Georgia

Tech

College of

Computing

http://download.intel.com/technology/itj/ q32000/pdf/thermal\_perf.pdf



College of

Computing

Georgia

Tech

### **Thermal Model**



Thermal behavior is modeled using RC circuit

Temperature-Aware Microarchitecture: Modeling and Implementation, skadron et al. '04



### **Temperature Map Measurements**

• Use IR camera (Jose Renau, UCSC)







### **Power Virus**

- Maximum power consumption code
- How?
  - Use data from L1 or L2
  - Pipelines and queues are maintained full
  - For longer period (meaningful program



College of

Computing

Georgia

Tech



### **Other Issues**

- Power consumption
  - Not only CPUS
  - Memory, I/O devices, other units

