This set of notes covers CIS 4521 at UPenn, which is compilers and interpreters. The goal of this course is to understand the inner working of compilers and how they are made, as well as program our own compiler from a high-level, type-safe language to x86.
Table of Contents
Open Table of Contents
Compilers
A compiler is a program that translates a source language to a target language. Typically, the source language is a higher-level language, and the target language is a low-level language, typically assembly or machine code.
High-level (source) code is optimized for human readability and programming. It generally has the following properties:
- Expressive: matches human ideas of grammar/syntax
- Redundant: contains more info than needed to help catch errors
- Abstract: exact code does not fully determine instructions, but should determine high-level intent
Low-level code, on the other hand, is optimized for hardware. It is hard for people to read and understand, removes redundancy and ambiguity, and loses all information about programmer intent. Most assembly languages, for example, are practically unreadable.
Compilers play the role of allowing users to compile the high-level readable languages into low-level hardware-optimized instructions. Some languages are farther from machine code than others, like JavaScript compared to C. However, no matter the source language, a compiler primarily has the same goals in translation:
- Source level expressiveness for the task
- Low level code optimization e.g. removing dead code
- Reasonable translation efficiency, generally < $O(n^3)$
- Maintainable code
- Correctness, of course
Programming languages are defined to describe computation precisely. Therefore, translation can be precisely described. This means that a compiler can be logically correct with respect to the source and target languages. There is a lot of ongoing research for proving compilers to be correct.
Compilers are clearly very complex programs. However, we can break them down into a series of program representations. These program representations get closer towards our target language at each step. These are referred to as intermediate representations (IR). Languages like Rust, C++, and Python all have compilers that compile down to an IR, specifically LLVM IR, which we will cover later. While the series of steps is not universal across compilers, a single compiler generally follows a format:
- Semantic Analysis: type checking, error checking, syntax checking
- Optimization: dead code elimination, common subexpression elimination, function inlining, register allocation
- Code Generation: target language instruction selection
The entire compiler takes as a whole can be much more flexible. It can have many stages, but here are some typical ones, in order:
- Lexical Analysis: Outputs a token stream
- Parsing: Outputs an abstract syntax structure
- Disambiguation: Edits the abstract syntax structure
- Semantic analysis: Annotates the abstract syntax structure
- Translation: Outputs IR code
- Control Flow Analysis: Constructs a control flow graph
- Data Flow Analysis: Constructs an interference graph
- Register Allocation: Assembly code generation prep
- Code Emission: Code generation
Optimizations are done at many of these stages, and different source languages may require more different stages. Of course, assembly is not the end of the process.
Assembling and Linking
Once the compiler translates everything to assembly code, a program called the assembler converts it to machine code. This is a relatively straightforward process, as assembly is meant to be a very low-level human-readable version based off the machine’s ISA. The next object is the linker, which links together multiple compiled and assembled machine code files to execute together, and be able to access one other’s capabilities. This is how things like libraries or packages are handled. Once that is finished, you end up with fully resolved executable machine code, that is then loaded onto the CPU’s cache with a loader, creating an executable image.