C-Like Compiler

For Cornell's compilers class, I worked with two teammates to design and implement a compiler for a procedural C-like language, called Eta. Here's a diagram of what our compiler looked like:

alt text

Eta programs look like this:

use io

sort(a: int[], len:int): int[] {
  i:int = 0
  n:int = len
  while (i < n) {
    j:int = i
    while (j > 0) {
      if (a[j-1] > a[j]) {
        swap:int = a[j]
        a[j] = a[j-1]
        a[j-1] = swap
      }
      j = j-1
    }
    i = i+1
  }

  return a
}

main(args: int[][]) {
  data:int[] = {...}
  data = sort(data, 26)
  print(data)
}

The first stage of our compiler, the lexer, breaks the program up into lexical tokens. For example, a program that looks like this:

use io

main(args: int[][]) {
  x:int = 4;
  y:bool = true;
}

would produce this data:

1:1 use
1:5 id io
3:1 id main
3:5 (
3:6 id args
3:10 :
3:12 int
3:15 [
3:16 ]
3:17 [
3:18 ]
3:19 )
3:21 {
4:3 id x
4:4 :
4:5 int
4:9 =
4:11 integer 4
4:12 ;
5:3 id y
5:4 :
5:5 bool
5:10 =
5:12 true
5:16 ;
6:1 }

which specifies the location of each token in the source file.

The next stage, the parser, takes the output of the lexer and forms an abstract syntax tree (AST). The AST for the above program looks like this:

(
 (
  (use io)
 )
 (
  (main((x int))(int)
   (
    (= (x int) 4)
    (= (y bool) true)
   )
  )
 )
)

At this point, we know the syntactic form of the program. So, we can raise errors because of things like syntax errors.

The next stage is the type checker, which traverses over the AST. When there are type errors, it raises an error. For example, an assignment like y:int = true would raise a type error because y is declared to have type int but a value of type bool gets assigned to it.

Once we have verified the program free of syntax and type errors after parsing and type checking, we generate our intermediate representation, which is a lowered representation of the program that is closer to the level of assembly code. This is also where we perform our optimizations, like register allocation and common subexpression elimination.

Finally, we generate assembly code from the intermediate representation. The intermediate representation and assembly outputs get pretty lengthy for nontrivial programs, so I've omitted them here. But by the end of this pipeline, we've generated assembly that conforms to the System V ABI and can be linked and executed!