# **Energy-Aware Computing** Lecture 5: Simplescalar/Wattch howtos ### Outline - How to build, configure and run - Understanding the results - in particular power related - Simplescalar code navigation - Wattch code navigation #### How to build the executable - Download annotated/refactored (slightly) Version from http://www.inf.ed.ac.uk/teaching/ courses/eac/hl.tar.gz - de-compress and untar somewhere suitable - First time only: - -make config-alpha - -make symlinks - After source code changes: - -make ### Configure processor - Create default configuration file conf: - -./sim-outorder -dumpconfig conf - Huge number of options - options after bugcompat are for decay/leakage - but watch out: your added options may appear later! - Most options are self-explanatory and have a comment line describing available options - Also explained in user guide #### Run a benchmark - Download a benchmark from http://www.inf.ed.ac.uk/teaching/courses/eac/traces/ - Open the conf file and change the: ``` -max:inst 0 to -max:inst 1000000 ``` - -fastfwd 0 to -fastfwd 10000000 - The above options are just for testing - you should think carefully how to configure, simulate and organize files when experimenting - ./sim-outorder —config conf <trace> >& trace.out # Reading simulation output - Simulation outputs a large text file containing: - Some cacti output, in a weird order - check for error messages - Leakage power with detailed breakdown - Dynamic power per unit, per operation - strangely starts with some processor parameters - Simulation configuration options & notes - Number of fast-forwarding instructions - Lots of simulation "statistics"/counters: - up to mem.ptab\_miss\_rate - Dynamic power results - up to max\_cycle\_power\_cc3 - Leakage power results #### Power results - XYZ power consumption (at the beginning of the output file, from dump\_power\_stats()) is in Watts - crossover\_scaling accounts for short-circuit power. Fixed at 20% in Wattch - (total) XYZ\_power is a measure of *energy*: - for each cycle that XYZ is used, the power is added up. if you multiply this with the cycle time (variable Period in powerinit.c), you'll get actual energy - avg\_XYZ\_power is XYZ\_power/sim\_cycles - (multiplied with Period) a measure of energy / cycle - sim\_cycles total number of cycles in full simulation; does not include fast forwarding #### Power "statistics" - rename\_power includes RAT, dependency check logic (DCL), instruction decode - bpred\_power includes BTB, RAS, local, global predictors, chooser - icache power includes I\$ and I-TLB - alu power includes integer and FP ALUs. - fetch\_stage\_power = icache+bpred - dispatch\_stage\_power = rename\_power - issue\_stage\_power = alu+resultbus+dcache +dcache2+window+lsq - total\_power = rename\_power + fetch\_stage\_power + issue stage power + regfile power + clock power # Clock gating - A method for saving power when idle - More details in future lecture - Wattch has 4 "modes": - XYZ No conditional clocking - XYZ\_cc1 Simple conditional clocking - XYZ\_cc2 Aggressive ideal cc - XYZ\_cc3 Aggressive non-ideal cc # Simplescalar code navigation - We've seen some data structures last time - Most of the code in sim-outorder.c - a beast of $\sim$ 5,500 lines of code - You will probably need to see cache.c - just 1,621 lines, most for leakage/decay (TBI) - You shouldn't need to touch any other file - unless you add instructions, or do other major changes to the processor ### Understanding simplescalar - Start with sim\_main() at end of sim-outorder.c - Don't get bogged down to minor details. Grasp the basics - Then read the main functions: - ruu\_fetch(), ruu\_dispatch(), ruu\_issue(), lsq\_refresh(), ruu\_writeback(), ruu\_commit() - Could take a full day's work, prob. more - Do this; it will save multiple debugging time later # sim\_main() perform the fast-forward phase forever do ruu\_commit () ruu\_release\_fu() ruu\_writeback() lsq\_refresh() ruu\_issue() ruu\_dispatch() ruu fetch() - Every iteration is a *single cycle* - Multiple instructions are handled inside most functions in a loop - superscalar machine: many instructions per cycle ### ruu\_fetch() - Fetch and predict a number of instructions - It stalls on I\$ misses, branch misprediction - Using variable ruu\_fetch\_issue\_delay in sim\_main() - Instructions placed in fetch\_data[] circular buffer - I\$, iTLB accesses for updating their "status" and provide latency determines hit/miss - Instruction actually fetched from memory - Branch predictor can cheat - instruction opcode is passed to it # ruu\_dispatch() - Pick from fetchQ and decode instruction (in order) - in reality it also executes them - uses a C switch statement and lots of macros - Breaks loads/stores into effective address calculation (into RUU) and load/store (into LSQ) - Register renaming and dependency checking Using ruu\_link\_idep, ruu\_install\_odep - Checks if operands ready and places into readyQ - Checks for misprediction, sets spec\_mode keeps recovery info. # ruu\_issue() - Get next ready instruction from readyQ - Stores complete immediately - Loads check LSQ, access D\$, dTLB - All instructions (exc stores) try to get appropriate functional unit fu = res\_get (fu\_pool, MD\_OP\_class (rs→op) - Schedule future event for completion eventq\_queue\_event(rs, sim\_cycle + latency) # lsq\_refresh() - Scheduling for loads/stores - Scan LSQ in order - Store with unknown address, stop scanning - Store with unknown data, remember address in std\_unknowns - Ready load matching std\_unknowns, don't issue - Other ready loads move to readyQ # ruu\_writeback() - Gets events from eventQ, if time is right - If "recover instruction", squash pipe, correct PC, set ruu\_fetch\_issue\_delay - Update rename table - Broadcast result to consuming instructions - They may become ready; place in readyQ ### ruu\_commit() - Scan RUU in order - If instruction not complete (writeback), finish - If store, get mem-port, access D\$, dTLB - Release LSQ entry for loads/stores - Release RUU entry #### How wattch works - At initialisation phase (run once) - Calculate (calculate\_power) and report power per unit (dump\_power\_stats) - stored in power C-structure - Clear cumulative power (energy) global vars - Per simulation cycle: - -clear\_access\_stats() at beginning of cycle - Calculate "power" at end of cycle in update\_access\_stats() - Most code in power.c ~2,600 lines of code - Lots of access counter updates in sim-outorder.c ### Now it's your turn - You will add a filter-cache (for data accesses only) to a processor and evaluate it - Filter cache is a small, highly-associative 0-level cache. Level 1 D\$ is only accessed when filter-cache misses - Details in coursework #1a handout - worth 5% of total course marks - I will release model answer after the deadline