In this section we try to describe a few of the many transformations that are
applied to a C program to convert it to CIL. The module that implements this
conversion is about 5000 lines of OCaml code. In contrast a simple program
transformation that instruments all functions to keep a shadow stack of the
true return address (thus preventing stack smashing) is only 70 lines of code.
This example shows that the analysis is so much simpler because it has to
handle only a few simple C constructs and also because it can leverage on CIL
infrastructure such as visitors and pretty-printers.
In no particular order these are a few of the most significant ways in which
C programs are compiled into CIL:
-
CIL will eliminate all declarations for unused entities. This means that
just because your hello world program includes stdio.h it does not mean
that your analysis has to handle all the ugly stuff from stdio.h.
- Type specifiers are interpreted and normalized:
int long signed x;
signed long extern x;
long static int long y;
// Some code that uses these declaration, so that CIL does not remove them
int main() { return x + y; }
See the CIL output for this
code fragment - Anonymous structure and union declarations are given a name.
struct { int x; } s;
See the CIL output for this
code fragment - Nested structure tag definitions are pulled apart. This means that all
structure tag definitions can be found by a simple scan of the globals.
struct foo {
struct bar {
union baz {
int x1;
double x2;
} u1;
int y;
} s1;
int z;
} f;
See the CIL output for this
code fragment
- All structure, union, enumeration definitions and the type definitions
from inners scopes are moved to global scope (with appropriate renaming). This
facilitates moving around of the references to these entities.
int main() {
struct foo {
int x; } foo;
{
struct foo {
double d;
};
return foo.x;
}
}
See the CIL output for this
code fragment
- Prototypes are added for those functions that are called before being
defined. Furthermore, if a prototype exists but does not specify the type of
parameters that is fixed. But CIL will not be able to add prototypes for those
functions that are neither declared nor defined (but are used!).
int f(); // Prototype without arguments
int f(double x) {
return g(x);
}
int g(double x) {
return x;
}
See the CIL output for this
code fragment - Array lengths are computed based on the initializers or by constant
folding.
int a1[] = {1,2,3};
int a2[sizeof(int) >= 4 ? 8 : 16];
See the CIL output for this
code fragment - Enumeration tags are computed using constant folding:
int main() {
enum {
FIVE = 5,
SIX, SEVEN,
FOUR = FIVE - 1,
EIGHT = sizeof(double)
} x = FIVE;
return x;
}
See the CIL output for this
code fragment - Initializers are normalized to include specific initialization for the
missing elements:
int a1[5] = {1,2,3};
struct foo { int x, y; } s1 = { 4 };
See the CIL output for this
code fragment - Initializer designators are interpreted and eliminated. Subobjects are
properly marked with braces. CIL implements
the whole ISO C99 specification for initializer (neither GCC nor MSVC do) and
a few GCC extensions.
struct foo {
int x, y;
int a[5];
struct inner {
int z;
} inner;
} s = { 0, .inner.z = 3, .a[1 ... 2] = 5, 4, y : 8 };
See the CIL output for this
code fragment - String initializers for arrays of characters are processed
char foo[] = "foo plus bar";
See the CIL output for this
code fragment
- String constants are concatenated
char *foo = "foo " " plus " " bar ";
See the CIL output for this
code fragment
- Initializers for local variables are turned into assignments. This is in
order to separate completely the declarative part of a function body from the
statements. This has the unfortunate effect that we have to drop the const
qualifier from local variables !
int x = 5;
struct foo { int f1, f2; } a [] = {1, 2, 3, 4, 5 };
See the CIL output for this
code fragment
- Local variables in inner scopes are pulled to function scope (with
appropriate renaming). Local scopes thus disappear. This makes it easy to find
and operate on all local variables in a function.
int x = 5;
int main() {
int x = 6;
{
int x = 7;
return x;
}
return x;
}
See the CIL output for this
code fragment
- Global declarations in local scopes are moved to global scope:
int x = 5;
int main() {
int x = 6;
{
static int x = 7;
return x;
}
return x;
}
See the CIL output for this
code fragment - Return statements are added for functions that are missing them. If the
return type is not a base type then a return without a value is added.
The guaranteed presence of return statements makes it easy to implement a
transformation that inserts some code to be executed immediately before
returning from a function.
int foo() {
int x = 5;
}
See the CIL output for this
code fragment - One of the most significant transformations is that expressions that
contain side-effects are separated into statements.
int x, f(int);
return (x ++ + f(x));
See the CIL output for this
code fragment
Internally, the x ++ statement is turned into an assignment which the
pretty-printer prints like the original. CIL has only three forms of basic
statements: assignments, function calls and inline assembly.
- Shortcut evaluation of boolean expressions and the ?: operator are
compiled into explicit conditionals:
int x;
int y = x ? 2 : 4;
int z = x || y;
// Here we duplicate the return statement
if(x && y) { return 0; } else { return 1; }
// To avoid excessive duplication, CIL uses goto's for
// statement that have more than 5 instructions
if(x && y || z) { x ++; y ++; z ++; x ++; y ++; return z; }
See the CIL output for this
code fragment - GCC’s conditional expression with missing operands are also compiled
into conditionals:
int f();;
return f() ? : 4;
See the CIL output for this
code fragment - All forms of loops (while, for and do) are compiled
internally as a single while(1) looping construct with explicit break
statement for termination. For simple while loops the pretty printer is
able to print back the original:
int x, y;
for(int i = 0; i<5; i++) {
if(i == 5) continue;
if(i == 4) break;
i += 2;
}
while(x < 5) {
if(x == 3) continue;
x ++;
}
See the CIL output for this
code fragment - GCC’s block expressions are compiled away. (That’s right there is an
infinite loop in this code.)
int x = 5, y = x;
int z = ({ x++; L: y -= x; y;});
return ({ goto L; 0; });
See the CIL output for this
code fragment
- CIL contains support for both MSVC and GCC inline assembly (both in one
internal construct)
- CIL compiles away the GCC extension that allows many kinds of constructs
to be used as lvalues:
int x, y, z;
return &(x ? y : z) - & (x ++, x);
See the CIL output for this
code fragment
- All types are computed and explicit casts are inserted for all
promotions and conversions that a compiler must insert:
- CIL will turn old-style function definition (without prototype) into
new-style definitions. This will make the compiler less forgiving when
checking function calls, and will catch for example cases when a function is
called with too few arguments. This happens in old-style code for the purpose
of implementing variable argument functions.
- Since CIL sees the source after preprocessing the code after CIL does
not contain the comments and the preprocessing directives.
- CIL will remove from the source file those type declarations, local
variables and inline functions that are not used in the file. This means that
your analysis does not have to see all the ugly stuff that comes from the
header files:
#include <stdio.h>
typedef int unused_type;
static char unused_static (void) { return 0; }
int main() {
int unused_local;
printf("Hello world\n"); // Only printf will be kept from stdio.h
}
See the CIL output for this
code fragment