Control Flow Graphs

77
λ Fernando Magno Quintão Pereira PROGRAMMING LANGUAGES LABORATORY Universidade Federal de Minas Gerais - Department of Computer Science PROGRAM ANALYSIS AND OPTIMIZATION – DCC888 CONTROL FLOW GRAPHS The material in these slides have been taken from the "Dragon Book", Secs 8.5 – 8.7, by Aho et al.

description

Control Flow Graphs. The material in these slides have been taken from the "Dragon Book", Secs 8.5 – 8.7, by Aho et al . Intermediate Program Representations. Optimizing compilers and human beings do not see the program in the same way. We are more interested in source code. - PowerPoint PPT Presentation

Transcript of Control Flow Graphs

Page 1: Control Flow Graphs

λFernando Magno Quintão Pereira

PROGRAMMING LANGUAGES LABORATORYUniversidade Federal de Minas Gerais - Department of Computer Science

PROGRAM ANALYSIS AND OPTIMIZATION – DCC888

CONTROL FLOW GRAPHS

The material in these slides have been taken from the "Dragon Book", Secs 8.5 – 8.7, by Aho et al.

Page 2: Control Flow Graphs

Intermediate Program Representations

• Optimizing compilers and human beings do not see the program in the same way.– We are more interested in source code.– But, source code is too different from machine code.– Besides, from an engineering point of view, it is better to

have a common way to represent programs in different languages, and target different architectures.

FrontEnd Optimizer Back

End

Fortran

COBOL

Lisp

ARM

x86

PowerPC

Page 3: Control Flow Graphs

Basic Blocks and Flow Graphs

• Usually compilers represent programs as control flow graphs (CFG).

• A control flow graph is a directed graph.– Nodes are basic blocks.– There is an edge from basic block B1 to basic block B2 if

program execution can flow from B1 to B2.

• Before defining basic block, we will illustrate this notion by showing the CFG of the function on the right.

void identity(int** a, int N) { int i, j; for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { a[i][j] = 0; } } for (i = 0; i < N; i++) { a[i][i] = 1; }}

What does this program do?

Page 4: Control Flow Graphs

The Low Level Virtual Machine

• We will be working with a compilation framework called The Low Level Virtual Machine, or LLVM, for short.

• LLVM is today the most used compiler in research.• Additionally, this compiler is used in many important

companies: Apple, Cray, Google, etc.

Page 5: Control Flow Graphs

Using LLVM to visualize a CFG

• We can use the opt tool, the LLVM machine independent optimizer, to visualize the control flow graph of a given function

$> clang -c -emit-llvm identity.c -o identity.bc

$> opt –view-cfg identity.bc

• We will be able to see the CFG of our target program, as long as we have the tool DOT installed in our system.

Page 6: Control Flow Graphs

The CFG that LLVM producesHow are the basic blocks formed?

• LLVM represents programs as sequences of instructions called bytecodes. These instructions do not target a specific machine.

• Basic blocks, the nodes in the CFG, are made of sequences of bytecodes.

Page 7: Control Flow Graphs

Basic Blocks

• Basic blocks are maximal sequences of consecutive instructions with the following properties:– The flow of control can only

enter the basic block through the first instruction in the block.• There are no jumps into the

middle of the block.– Control will leave the block

without halting or branching, except possibly at the last instruction in the block.

How can we identify basic blocks from linear sequences of instructions?

Page 8: Control Flow Graphs

Identifying Basic Blocks

• The first instruction of a basic block is called a leader.• We can identity leaders via these three properties:

1. The first instruction in the intermediate code is a leader.2. Any instruction that is the target of a conditional or

unconditional jump is a leader.3. Any instruction that immediately follows a conditional or

unconditional jump is a leader.• Once we have found the leaders, it is straightforward to

find the basic blocks:– For each leader, its basic block consists of the leader

itself, plus all the instructions until the next leader.

Page 9: Control Flow Graphs

DCC 888

λ Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory

LOCAL OPTIMIZATIONS

Page 10: Control Flow Graphs

Local Optimization

• Code optimization techniques that work in the scope of a basic block are called local optimizations.– DAG based optimizations– Peephole optimizations– Local register allocation

• Code optimization techniques that need to analyze the entire control flow graph of a program are called global optimizations.– Most of the optimizations that we will see in this course

are global optimizations.

1) Can you think about a concrete local optimization?

2) Can you think about any global optimization?

3) Can you think about any larger scope of optimization?

Page 11: Control Flow Graphs

DAG-Based Optimizations

• Some of the earliest code optimizations that compilers have performed would rely on a directed acyclic representation of the instructions in the basic block. These DAGS were constructed as follows:– There is a node in the DAG for each input value appearing

in the basic block.– There is a node associated with each instruction in the

basic block.– If instruction S uses variables defined in statements S1, …,

Sn, then we have edges from each Si to S.– If a variable is defined in the basic block, but is not used

inside it, then we mark it as an output value.

Page 12: Control Flow Graphs

Example of DAG Representation

• In the program on the left, we have that the first occurrences of b, c and d are input values, because they are used in the basic block, but are not defined before the first use.

• We say that the definitions of c and d, at lines 3 and 4, are output values, because these definitions are not used in the basic block.

1: a = b + c

2: b = a – d

3: c = b + c

4: d = a – d

What is the DAG representation of this basic block?

Page 13: Control Flow Graphs

Example of DAG Representation

1: a = b + c

2: b = a – d

3: c = b + c

4: d = a – d

Could you design a simple algorithm to create DAGs from basic blocks?

Page 14: Control Flow Graphs

Example of DAG Representation

• For each input value vi:– create a node vi in the DAG– label this node with the tag in

• For each statement v = f (v1, ..., vn), in the sequence defined by the basic block:– create a node v in the DAG– create an edge (v1, v) for each i, 1 ≤ i ≤ n– label this node with the tag f

1: a = b + c

2: b = a - d

3: c = b + c

4: d = a - d

1) Can you think about any optimization that we could perform in general?

2) Could you optimize this DAG in particular?

Page 15: Control Flow Graphs

Finding Local Common Subexpressions

• For each input value vi:– create a node vi in the DAG– label this node with the tag in

• For each statement v = f (v1, ..., vn), in the sequence defined by the basic block:– create a node v in the DAG– create an edge (v1, v) for each i, 1 ≤ i ≤ n– label this node with the tag f

1: a = b + c

2: b = a - d

3: c = b + c

4: d = a - d

How could we adapt our algorithm to reuse expressions that have already been created?

Page 16: Control Flow Graphs

Finding Local Common Subexpressions

1. For each input value vi:1. create a node vi in the DAG2. label this node with the tag in

2. For each statement v = f(v1, ..., vn), in the sequence defined by the basic block:1. If the DAG already contains a node v'

labeled f, with all the children (v1, ..., vn) in the order given by i, 1 ≤ i ≤ n1. Let v' be an alias of v in the DAG

2. else:1. create a node v in the DAG2. create an edge (v1, v) for each i, 1 ≤ i ≤ n3. label this node with the tag f

1: a = b + c

2: b = a - d

3: c = b + c

4: d = a - d

How can we check this property?

Page 17: Control Flow Graphs

Value Numbers

• We can associate each node in the DAG with a signature (lb, v1, …., vn), where lb is the label of the node, and each vi, 1 ≤ i ≤ n is a child of the node– We can use this signature

as the key of a hash-function

– The value produced by the hash-function is called the value-number of that variable

• Thus, whenever we build a new node to our DAG, e.g., step 2.1 of our algorithm, we perform a search in the hash-table. If the node is already there, we simply return a reference to it, instead of creating a new node.

What is the result of optimizing this DAG?

Page 18: Control Flow Graphs

Value Numbers

• Value numbers allows us to refer to nodes in our DAG by their semantics, instead of by their textual program representation.

Page 19: Control Flow Graphs

Finding more identities

• We can use several tricks to find common subexpressions in DAGs– Commutativity: the value number of x + y and y + x should

be the same.– Identities: the comparison x < y can often be implemented

by t = x – y; t < 0– Associativity: a = b + c

t = c + de = t + b

a = b + ce = a + d

Do you know any other cool identity?

Page 20: Control Flow Graphs

Value Numbers

Dead code elimination: which nodes can be safely removed from the DAG?

Page 21: Control Flow Graphs

Dead Code Elimination

• We can eliminate any node if:– This node has no descendants, e.g., it is a root node.– This node is not marked as an output node.

• We can iterate this pattern of eliminations until we have no more nodes that we can remove.

• Which nodes could we eliminate if (-, d) were not an output node?

• Which nodes could we eliminate if (+, c) were not an output node?

Page 22: Control Flow Graphs

Dead Code Elimination

• We can eliminate any node if:– This node has no descendants, e.g., it is a root node.– This node is not marked as an output node.

• We can iterate this pattern of eliminations until we have no more nodes that we can remove.

Page 23: Control Flow Graphs

Algebraic Identities

• We can explore algebraic identities to optimize DAGs– Arithmetic Identities:

• x + 0 = 0 + x = x; x * 1 = 1 * x = x; x – 0 = x; x/1 = x– Reduction in strength:

• x2 = x * x; 2 * x = x + x; x / 2 = x * 0.5– Constant folding:

• evaluate expressions at compilation time, replacing each expression by its value

Could you provide a rationale for each of these reductions in strength?

Reduction in strength optimizes code by replacing some sequences of instructions by others, which can be computed more efficiently.

Page 24: Control Flow Graphs

Peephole Optimizations

• Peephole optimizations are a category of local code optimizations.

• The principle is very simple:– the optimizer analyzes

sequences of instructions.– only code that is within a small

window of instructions is analyzed each time.

– this window slides over the code.

– once patterns are discovered inside this window, optimizations are applied.

How to find the best window size?

Page 25: Control Flow Graphs

Redundant Loads and Stores

• Some memory access patterns are clearly redundant:

load R0, mstore m, RO

• Patterns like this can be easily eliminated by a peephole optimizer.

2) Why is this optimization only safe inside a basic block?

1) This code is very naïve. How could it appear in actual assembly programs?

Page 26: Control Flow Graphs

Branch Transformations

• Some branches can be rewritten into faster code. For instance:

if debug == 1 goto L1goto L2L1: …L2: …

if debug != 1 goto L2L1: …L2: …

• This optimization crosses the boundaries of basic blocks, but it is still performed on a small sliding window.

Page 27: Control Flow Graphs

Jumps to Jumps

• How could we optimize this code sequence?

goto L1 ...L1: goto L2

Page 28: Control Flow Graphs

Jumps to Jumps

• How could we optimize this code sequence?

goto L1 ...L1: goto L2

goto L2 ...L1: goto L2

Why is the code on the right faster?

Could we remove the third line altogether?

Page 29: Control Flow Graphs

Jumps to Jumps

• How could we optimize this code sequence?

goto L1 ...L1: goto L2

goto L2 ...L1: goto L2

goto L2 ...

• We can eliminate the second jump, provided that we know that there is no jump to L1 in the rest of the program, and L1 is preceded by an unconditional jump.• Why?

• Notice that this peephole optimization requires some previous information gathering: we must know which instructions are targets of jumps.

Page 30: Control Flow Graphs

Jumps to Jumps

• How could we optimize this code sequence?

if a < b goto L1 ...L1: goto L2

We saw how to optimize the sequence on the right. Does the same optimization apply on the code below?

goto L1 ...L1: goto L2

Page 31: Control Flow Graphs

Jumps to Jumps

• How could we optimize this code sequence?

if a < b goto L1 ...L1: goto L2 if a < b goto L2

...L1: goto L2

Nothing new here. The optimization is the same as in the previous case. And if there is no jump to L1, we can remove it from the code, as it becomes dead-code after our transformation.

Page 32: Control Flow Graphs

Jumps to Jumps

• How could we optimize this code sequence?– Under which assumptions is your optimization valid?

goto L1 ...L1: if a < b goto L2L3:

if a < b goto L2 goto L3 ...L3:

Page 33: Control Flow Graphs

Jumps to Jumps

• How could we optimize this code sequence?– Under which assumptions is your optimization valid?

• In order to apply this optimization, we need to make sure that:• There is no jump to L1• L1 is preceded by an unconditional goto.

Why are these assumptions necessary?

goto L1 ...L1: if a < b goto L2L3:

if a < b goto L2 goto L3 ...L3:

Page 34: Control Flow Graphs

Reduction in Strength

• Instead of performing reduction in strength at the DAG level, many compilers do it via a peephole optimizer.– This optimizer allows us to replace sequences such as 4 * x

by, x << 2, for instance.

4 * x is a pattern that is pretty common in loops that range over 32-bit words. Why?

Page 35: Control Flow Graphs

Machine Idioms

• Many computer architectures have efficient instructions to implement some common operations.– A typical example is found in the increment and decrement

operators:

incl %ediaddl $1, %edi

Is there any other machine idioms that we could think about?

Why is incl better than addl $1?

Page 36: Control Flow Graphs

Local Register Allocation

• Registers are memory locations which have very fast access times.

• However, registers exist in small number. For instance, the 32-bits x86 processor has eight visible registers.

• A key optimization consists in mapping the most used variables in a program onto registers.

• Compilers usually need to see the entire function (or even the whole program) to perform a good register allocation.

• But, if we need to do it quickly, we can only look into a basic block. In this case, we are doing local register allocation.

1) What are the most used variables in a program?

2) Why is it better to see the entire function, instead of a basic block, to do better register allocation?

3) Why would we be willing to do register allocation quickly?

4) How could we find registers for the variables in a basic block?

Page 37: Control Flow Graphs

Registers vs Memory

_arith_all_mem:## BB#0:subl $12, %espmovl 16(%esp), %eaxmovl %eax, 8(%esp) movl 20(%esp), %eaxmovl %eax, 4(%esp) movl 24(%esp), %eaxmovl %eax, (%esp)movl 8(%esp), %ecxaddl 4(%esp), %ecximull %eax, %ecxmovl %ecx, %eaxshrl $31, %eaxaddl %ecx, %eaxsarl %eax addl $12, %espret

_arith_reg_alloc:## BB#0:movl 4(%esp), %ecxaddl 8(%esp), %ecximull 12(%esp), %ecxmovl %ecx, %eaxshrl $31, %eaxaddl %ecx, %eaxsarl %eax ret

int arith(int a1, int an, int N) { return (a1 + an) * N / 2;}

The program that uses registers is substantially faster than the program that maps all the variables to memory. It is shorter too, as we do not need so many loads and stores to move variables to and from memory.

2) Can you think about an algorithm to find registers for the variables inside a basic block?

1) Why do we not need to worry about "memory allocation"?

Page 38: Control Flow Graphs

Local Register Allocationallocate(Block b) { for (Inst i = b.first_inst; i != b.last_inst; i++) { for (Operand o = i.first_operand; o != i.last_operand; o++) { if (o is in memory m) { r = find_reg(i) assign r to o add "r = load m" before i } } v = i.definition r = find_reg(i) assign r to v } for (Operand o = i.first_operand; o != i.last_operand; o++) { if (i is the last use of o) { return the register bound to o to the list of free registers } }}

1) Why do we need to insert this load instruction in our code?

2) When are registers mapped into memory?

3) How would be the implementation of find_reg?

Page 39: Control Flow Graphs

Spilling

If the register pressure is too high, we may have to evict some variable to memory. This action is called spilling. If a variable is mapped into memory, then we call it a spill.

find_reg(i) { if there is free register r return r else let v be the latest variable to be used after i, that is in a register if v does not have a memory slot let m be a fresh memory slot else let m be the memory slot assigned to v add "store v m" right after the definition of v return r}

1) Why do we spill the variable that has the last use after the current instruction i?

2) Why do we need to insert this store in our code?

3) How many stores can we have per variable?

Page 40: Control Flow Graphs

Spilling the Furthest Use

• Usually we spill the variable that has the furthest use from the spilling point.

• This strategy is called the Belady♧ Algorithm, and it is also used in operating systems, when deciding which page to evict to disk in the virtual memory system.

But we must be careful when doing local register allocation, because if a variable is used after the basic block, we must assume that it has a use at the end of the block.

♧: A Study of Replacement algorithms for a Virtual Storage Computer (1966) – IBM Systems Journal

Page 41: Control Flow Graphs

Example

1) How many registers will we need to allocate this program?

2) How many memory slots will we need to fit the spills?

Let's assume a simple architecture with two registers only.

Page 42: Control Flow Graphs

Example

Page 43: Control Flow Graphs

Example

Where can we fit variable c?

Spilling is in order. Which variable should we spill?

Page 44: Control Flow Graphs

Example

We only have two registers, and three variables, a, b, and c, are simultaneously alive. We had to choose between a and b to spill. Because a has the furthest use, we chose to evict it.

Page 45: Control Flow Graphs

Example

1) We need to reload variable b. Where?

2) Where can we put variable e?

Page 46: Control Flow Graphs

Example

Page 47: Control Flow Graphs

Example Why is it not necessary to spill any variable, e.g., e, for instance, to allocate f?

Page 48: Control Flow Graphs

Example How do we allocate "h = f – g"?

Page 49: Control Flow Graphs

Example

Page 50: Control Flow Graphs

Optimality

• If there exists an assignment of variables to registers that requires at most K registers, then our algorithm will find it.– If registers had different sizes, e.g.,

singles and doubles, then this problem would be NP-complete.

• On the other hand, if we need to spill, then the problem of minimizing the cost of loads and stores is NP-complete♤.

1) Can you prove that the algorithm is optimal for register assignment?

2) Any intuition on why minimizing this cost is an NP-complete problem?

♤: On local register allocation, SODA, 1998

Page 51: Control Flow Graphs

DCC 888

λ Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory

A QUICK INTRO TO LLVM

Page 52: Control Flow Graphs

LLVM is a Compilation Infra-Structure

• It is a framework that comes with lots of tools to compile and optimize code.

$> cd llvm/Debug+Asserts/bin$> lsFileCheck count llvm-dis llvm-stressFileUpdate diagtool llvm-dwarfdump llvm-symbolizerarcmt-test fpcmp llvm-extract llvm-tblgenbugpoint llc llvm-link macho-dumpc-arcmt-test lli llvm-lit modularizec-index-test lli-child-target llvm-lto notclang llvm-PerfectSf llvm-mc obj2yamlclang++ llvm-ar llvm-mcmarkup optllvm-as llvm-nm pp-trace llvm-sizeclang-check llvm-bcanalyzer llvm-objdump rm-cstr-callsclang-format llvm-c-test llvm-ranlib tool-templateclang-modernize llvm-config llvm-readobj yaml2objclang-tblgen llvm-cov llvm-rtdyld llvm-diffclang-tidy

Page 53: Control Flow Graphs

LLVM is a Compilation Infra-Structure

• Compile C/C++ programs:

$> echo "int main() {return 42;}" > test.c$> clang test.c $> ./a.out $> echo $?42

clang/clang++ are very competitive when compared with, say, gcc, or icc. Some of these compilers are faster in some benchmarks, and slower in others. Usually clang/clang++ have faster compilation times. The Internet is crowded with benchmarks.

Page 54: Control Flow Graphs

Optimizations in Practice

• The opt tool, available in the LLVM toolbox, performs machine independent optimizations.

• There are many optimizations available through opt.– To have an idea, type opt --help.

Page 55: Control Flow Graphs

Optimizations in Practice$> opt --help Optimizations available: -adce - Aggressive Dead Code Elimination -always-inline - Inliner for always_inline functions -break-crit-edges - Break critical edges in CFG -codegenprepare - Optimize for code generation -constmerge - Merge Duplicate Global Constants -constprop - Simple constant propagation -correlated-propagation - Value Propagation -dce - Dead Code Elimination -deadargelim - Dead Argument Elimination -die - Dead Instruction Elimination -dot-cfg - Print CFG of function to 'dot' file -dse - Dead Store Elimination -early-cse - Early CSE -globaldce - Dead Global Elimination -globalopt - Global Variable Optimizer -gvn - Global Value Numbering -indvars - Induction Variable Simplification -instcombine - Combine redundant instructions -instsimplify - Remove redundant instructions -ipconstprop - Interprocedural constant propagation -loop-reduce - Loop Strength Reduction -reassociate - Reassociate expressions -reg2mem - Demote all values to stack slots -sccp - Sparse Conditional Constant Propagation -scev-aa - ScalarEvolution-based Alias Analysis -simplifycfg - Simplify the CFG ...

What do you think each of these optimizations do?

Page 56: Control Flow Graphs

Levels of Optimizations

• Like gcc, clang supports different levels of optimizations, e.g., -O0 (default), -O1, -O2 and -O3.

• To find out which optimization each level uses, you can try:

$> llvm-as < /dev/null | opt -O3 -disable-output -debug-pass=Arguments

In my system (LLVM/Darwin), -O1 gives me:-targetlibinfo -no-aa -tbaa -basicaa -notti -globalopt -ipsccp -deadargelim -instcombine -simplifycfg -basiccg -prune-eh -inline-cost -always-inline -functionattrs -sroa -domtree -early-cse -lazy-value-info -jump-threading -correlated-propagation -simplifycfg -instcombine -tailcallelim -simplifycfg -reassociate -domtree -loops -loop-simplify -lcssa -loop-rotate -licm -lcssa -loop-unswitch -instcombine -scalar-evolution -loop-simplify -lcssa -indvars -loop-idiom -loop-deletion -loop-unroll -memdep -memcpyopt -sccp -instcombine -lazy-value-info -jump-threading -correlated-propagation -domtree -memdep -dse -adce -simplifycfg -instcombine -strip-dead-prototypes -preverify -domtree -verify

llvm-as is the LLVM assembler. It reads a file containing human-readable LLVM assembly language, translates it to LLVM bytecode, and writes the result into a file or to standard output.

Page 57: Control Flow Graphs

Virtual Register Allocation

• One of the most basic optimizations that opt performs is to map memory slots into variables.

• This optimization is very useful, because the clang front end maps every variable to memory:

int main() { int c1 = 17; int c2 = 25; int c3 = c1 + c2; printf("Value = %d\n", c3);}

$> clang -c -emit-llvm const.c -o const.bc

$> opt –view-cfg const.bc

Page 58: Control Flow Graphs

Virtual Register Allocation

• One of the most basic optimizations that opt performs is to map memory slots into variables.

• We can map memory slots into registers with the mem2reg pass:

int main() { int c1 = 17; int c2 = 25; int c3 = c1 + c2; printf("Value = %d\n", c3);}

$> opt -mem2reg const.bc > const.reg.bc

$> opt –view-cfg const.reg.bc

How could we further optimize this program?

Page 59: Control Flow Graphs

Constant Propagation

• We can fold the computation of expressions that are known at compilation time with the constprop pass.

$> opt -constprop const.reg.bc > const.cp.bc

$> opt –view-cfg const.cp.bc

What is %1 in the left CFG? And what is i32 42 in the CFG on the right side?

Page 60: Control Flow Graphs

One more: Common Subexpression Elimination

int main(int argc, char** argv) { char c1 = argc + 1; char c2 = argc - 1; char c3 = c1 + c2; char c4 = c1 + c2; char c5 = c4 * 4; if (argc % 2) printf("Value = %d\n", c3); else printf("Value = %d\n", c5);}

$> clang -c -emit-llvm cse.c -o cse.bc

$> opt –mem2reg cse.bc -o cse.reg.bc

$> opt –view-cfg cse.reg.bc

How could we optimize this program?

Page 61: Control Flow Graphs

One more: Common Subexpression Elimination

$> opt -early-cse cse.reg.bc > cse.o.bc

$> opt –view-cfg cse.o.bc

Can you intuitively tell how CSE works?

Page 62: Control Flow Graphs

Writing an LLVM pass

• LLVM applies a chain of analyses and transformations on the target program.

• Each of these analyses or transformations is called a pass.

• We have seen a few passes already: mem2reg, early-cse and constprop, for instance.

• Some passes, which are machine independent, are invoked by the opt tool.

• Other passes, which are machine dependent, are invoked by the llc tool.

• A pass may require information provided by other passes. Such dependencies must be explicitly stated.

– For instance: a common pattern is a transformation pass requiring an analysis pass.

Page 63: Control Flow Graphs

Counting Number of Opcodes in Programs

Let's write a pass that counts the number of times that each opcode appears in a given function. This pass must print, for each function, a list with all the instructions that showed up in its code, followed by the number of times each of these opcodes has been used.

Function foo

add: 4

alloca: 5

br: 8

icmp: 3

load: 11

ret: 1

select: 1

store: 9

int foo(int n, int m) { int sum = 0; int c0; for (c0 = n; c0 > 0; c0--) { int c1 = m; for (; c1 > 0; c1--) { sum += c0 > c1 ? 1 : 0; } } return sum;}

Page 64: Control Flow Graphs

Counting Number of Opcodes in Programs

#define DEBUG_TYPE "opCounter"#include "llvm/Pass.h"#include "llvm/IR/Function.h"#include "llvm/Support/raw_ostream.h"#include <map>using namespace llvm;namespace { struct CountOp : public FunctionPass { std::map<std::string, int> opCounter; static char ID; CountOp() : FunctionPass(ID) {} virtual bool runOnFunction(Function &F) { errs() << "Function " << F.getName() << '\n'; for (Function::iterator bb = F.begin(), e = F.end(); bb != e; ++bb) { for (BasicBlock::iterator i = bb->begin(), e = bb->end(); i != e; ++i) { if(opCounter.find(i->getOpcodeName()) == opCounter.end()) { opCounter[i->getOpcodeName()] = 1; } else { opCounter[i->getOpcodeName()] += 1; } } } std::map <std::string, int>::iterator i = opCounter.begin(); std::map <std::string, int>::iterator e = opCounter.end(); while (i != e) { errs() << i->first << ": " << i->second << "\n"; i++; } errs() << "\n"; opCounter.clear(); return false; } };}char CountOp::ID = 0;static RegisterPass<CountOp> X("opCounter", "Counts opcodes per functions");

This line defines the name of the pass, in the command line, e.g., opCounter, and the help string that opt provides to the user about the pass.

Our pass runs once for each function in the program; therefore, it is a FunctionPass. If we had to see the whole program, then we would implement a ModulePass.

Page 65: Control Flow Graphs

A Closer Look into our Pass struct CountOp : public FunctionPass { std::map<std::string, int> opCounter; static char ID; CountOp() : FunctionPass(ID) {} virtual bool runOnFunction(Function &F) { errs() << "Function " << F.getName() << '\n'; for (Function::iterator bb = F.begin(), e = F.end(); bb != e; ++bb) { for (BasicBlock::iterator i = bb->begin(), e = bb->end(); i != e; ++i) { if(opCounter.find(i->getOpcodeName()) == opCounter.end()) { opCounter[i->getOpcodeName()] = 1; } else { opCounter[i->getOpcodeName()] += 1; } } } std::map <std::string, int>::iterator i = opCounter.begin(); std::map <std::string, int>::iterator e = opCounter.end(); while (i != e) { errs() << i->first << ": " << i->second << "\n"; i++; } errs() << "\n"; opCounter.clear(); return false; } };}

We will be recording the number of each opcode in this map, that binds opcode names to integer numbers.

This code collects the opcodes. We will look into it more closely soon.

This code prints our results. It is a standard loop on an STL data structure. We use iterators to go over the map. Each element in a map is a pair, where the first element is the key, and the second is the value.

Page 66: Control Flow Graphs

Iterating Through Functions, Blocks and Insts

for(Function::iterator bb = F.begin(), e = F.end(); bb != e; ++bb) { for(BasicBlock::iterator i = bb->begin(), e = bb->end(); i != e; ++i) { if(opCounter.find(i->getOpcodeName()) == opCounter.end()) { opCounter[i->getOpcodeName()] = 1; } else { opCounter[i->getOpcodeName()] += 1; } }}

We go over LLVM data structures through iterators.

• An iterator over a Module gives us a list of Functions.

• An iterator over a Function gives us a list of basic blocks.

• An iterator over a Block gives us a list of instructions.

• And we can iterate over the operands of the instruction too.

for (User::op_iterator O = I.op_begin(), E = I.op_end(); O != E; ++O);

for (Module::iterator F = M.begin(), E = M.end(); F != E; ++F);

Page 67: Control Flow Graphs

Compiling the Pass

• To Compile the pass, we can follow these two steps:

1. Generally we save the pass into /llvm/lib/Transforms/DirectoryName, where DirectoryName can be, for instance, CountOp.

2. We build a Makefile for the project. If we invoke the LLVM standard Makefile, we save some time.

# Path to top level of LLVM hierarchyLEVEL = ../../..

# Name of the library to buildLIBRARYNAME = CountOp

# Make the shared library become a# loadable module so the tools can # dlopen/dlsym on the resulting library.LOADABLE_MODULE = 1

# Include the makefile implementationinclude $(LEVEL)/Makefile.common

Page 68: Control Flow Graphs

Running the Pass

• Our pass is now a shared library, in llvm/Debug/lib, if we have compiled our LLVM distribution with the –Debug directive, or in llvm/Release/lib, if we have not.

• We can invoke it using the opt tool:

$> clang –c –emit-llvm file.c –o file.bc

$> opt -load CountOp.dylib -opCounter -disable-output t.bc

• Remember, if we are running on Linux, then our shared library has the extension ".so", instead of ".dylib", as in the Mac OS.

Just to avoid printing the binary t.bc file

Page 69: Control Flow Graphs

LLVM Provides an Intermediate Representation

• LLVM represents programs, internally, via its own instruction set.– The LLVM optimizations manipulate these bytecodes.– We can program directly on them.– We can also interpret them.

int callee(const int* X) { return *X + 1;}

int main() { int T = 4; return callee(&T);}

$> clang –c –emit-llvm f.c –o f.bc

$> opt –mem2reg f.bc –o f.bc

$> llvm-dis f.bc

$> cat f.ll

; Function Attrs: nounwind sspdefine i32 @callee(i32* %X) #0 {entry: %0 = load i32* %X, align 4 %add = add nsw i32 %0, 1 ret i32 %add}

♤: Example taken from the slides of Gennady Pekhimenko "The LLVM Compiler Framework and Infrastructure"

Page 70: Control Flow Graphs

LLVM Bytecodes are Interpretable

• Bytecode is a form of instruction set designed for efficient execution by a software interpreter.– They are portable!– Example: Java bytecodes.

• The tool lli directly executes programs in LLVM bitcode format.– lli may compile these bytecodes just-in-time, if a JIT is

available.

$> echo "int main() {printf(\"Oi\n\");}" > t.c

$> clang -c -emit-llvm t.c -o t.bc

$> lli t.bc

Page 71: Control Flow Graphs

How Does the LLVM IR Look Like?

• RISC instruction set, with typical opcodes– add, mul, or, shift, branch, load, store, etc

• Typed representation.

• Static Single Assignment format– That is something smart, which we will see

later in the course.• Control flow is represented

explicitly.

%0 = load i32* %X, align 4%add = add nsw i32 %0, 1ret i32 %add

switch i32 %0, label %sw.default [ i32 1, label %sw.bb i32 2, label %sw.bb1 i32 3, label %sw.bb2 i32 4, label %sw.bb3 i32 5, label %sw.bb4]

switch(argc) { case 1: x = 2; case 2: x = 3; case 3: x = 5; case 4: x = 7; case 5: x = 11; default: x = 1; }

This is C

This is LLVM

Page 72: Control Flow Graphs

We can program directly on the IR♧

♧: although this is not something to the faint of heart.

int callee(const int* X) { return *X + 1;}

int main() { int T = 4; return callee(&T);}

This is C This is LLVM; Function Attrs: nounwind sspdefine i32 @callee(i32* %X) #0 {entry: %tmp = load i32* %X, align 4 %add = add nsw i32 %tmp, 1 ret i32 %add}

; Function Attrs: nounwind sspdefine i32 @main() #0 {entry: %T = alloca i32, align 4 store i32 4, i32* %T, align 4 %call = call i32 @callee(i32* %T) ret i32 %call}

$> clang -c -emit-llvm ex0.c -o ex0.bc

$> opt -mem2reg -instnamer ex0.bc -o ex0.bc

$> llvm-dis < ex0.bc

Which opts could we apply on this code?

Page 73: Control Flow Graphs

Hacking the Bytecode FileThis is the original bytecode

; Function Attrs: nounwind sspdefine i32 @callee(i32* %X) #0 {entry: %tmp = load i32* %X, align 4 %add = add nsw i32 %tmp, 1 ret i32 %add}

; Function Attrs: nounwind sspdefine i32 @main() #0 {entry: %T = alloca i32, align 4 store i32 4, i32* %T, align 4 %call = call i32 @callee(i32* %T) ret i32 %call}

This is the optimized bytecode; Function Attrs: nounwind sspdefine i32 @callee(i32 %X) #0 {entry: %add = add nsw i32 %X, 1 ret i32 %add}

; Function Attrs: nounwind sspdefine i32 @main() #0 {entry: %call = call i32 @callee(i32 4) ret i32 %call}

Can you point all the differences between the files?

$> clang ex0.hack.ll$> ./a.out$> echo $?$> 5

We can compile and execute the bytecode file:

Page 74: Control Flow Graphs

Understanding our Hand Optimization

We did, by hand, some sort of scalarization, i.e., we are replacing pointers with scalars. Scalars are, in compiler jargon, the variables whose values we can keep in registers.

; Function Attrs: nounwind sspdefine i32 @callee(i32* %X) #0 {entry: %tmp = load i32* %X, align 4 %add = add nsw i32 %tmp, 1 ret i32 %add}

; Function Attrs: nounwind sspdefine i32 @main() #0 {entry: %call = call i32 @callee(i32 4) ret i32 %call}

int callee(const int* X){ return *X + 1;}int main() { int T = 4; return callee(&T);}

int callee(int X) { return X + 1;}

int main() { int T = 4; return callee(T);}

; Function Attrs: nounwind sspdefine i32 @main() #0 {entry: %T = alloca i32, align 4 store i32 4, i32* %T, align 4 %call = call i32 @callee(i32* %T) ret i32 %call}

; Function Attrs: nounwind sspdefine i32 @callee(i32 %X) #0 {entry: %add = add nsw i32 %X, 1 ret i32 %add}

Page 75: Control Flow Graphs

Generating Machine Code

• Once we have optimized the intermediate program, we can translate it to machine code.

• In LLVM, we use the llc tool to perform this translation. This tool is able to target many different architectures.

$> llc --version

Registered Targets: alpha - Alpha [experimental] arm - ARM bfin - Analog Devices Blackfin c - C backend cellspu - STI CBEA Cell SPU cpp - C++ backend mblaze - MBlaze mips - Mips mips64 - Mips64 [experimental] mips64el - Mips64el [experimental] mipsel - Mipsel msp430 - MSP430 [experimental] ppc32 - PowerPC 32 ppc64 - PowerPC 64 ptx32 - PTX (32-bit) [Experimental] ptx64 - PTX (64-bit) [Experimental] sparc - Sparc sparcv9 - Sparc V9 systemz - SystemZ thumb - Thumb x86 - 32-bit X86: Pentium-Pro x86-64 - 64-bit X86: EM64T and AMD64 xcore - XCore

Page 76: Control Flow Graphs

Generating Machine Code

$> clang -c -emit-llvm identity.c -o identity.bc

$> opt -mem2reg identity.bc -o identity.opt.bc

$> llc -march=x86 identity.opt.bc -o identity.x86

.globl_identity

.align 4, 0x90_identity:

pushl %ebxpushl %edipushl %esixorl %eax,

%eaxmovl

20(%esp), %ecxmovl

16(%esp), %edxmovl %eax,

%esijmp LBB1_1.align 4, 0x90

LBB1_3:movl (%edx,

%esi,4), %ebxmovl $0,

(%ebx,%edi,4)incl %edi

LBB1_2:cmpl %ecx,

%edijl LBB1_3incl %esi

LBB1_1:cmpl %ecx,

%esimovl %eax,

%edijl LBB1_2jmp LBB1_5

LBB1_6:movl (%edx,

%eax,4), %esimovl $1,

(%esi,%eax,4)incl %eax

LBB1_5:cmpl %ecx,

%eaxjl LBB1_6popl %esipopl %edipopl %ebxret

• Once we have optimized the intermediate program, we can translate it to machine code.

• In LLVM, we use the llc tool to perform this translation. This tool is able to target many different architectures.

Page 77: Control Flow Graphs

A Bit of History

• Some of the first code optimizations were implemented only locally

• Fortran was one of the first programming languages to be optimized by a compiler

• The LLVM infra-structure was implemented by Chris Lattner, during his PhD, at UIUC

• Kam, J. B. and J. D. Ullman, "Monotone Data Flow Analysis Frameworks", Actal Informatica 7:3 (1977), pp. 305-318

• Kildall, G. "A Unified Approach to Global Program Optimizations", ACM Symposium on Principles of Programming Languages (1973), pp. 194-206

• Lattner, C., and Adve, V., "LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation", CGO, pp. 75-88 (2004)