Skip to content

CIS 4521 Notes 1.1

Published: at 08:39 PM

This set of notes covers CIS 4521 at UPenn, which is compilers and interpreters. The goal of this course is to understand the inner working of compilers and how they are made, as well as program our own compiler from a high-level, type-safe language to x86.

Table of Contents

Open Table of Contents

Compilers

A compiler is a program that translates a source language to a target language. Typically, the source language is a higher-level language, and the target language is a low-level language, typically assembly or machine code.

High-level (source) code is optimized for human readability and programming. It generally has the following properties:

Low-level code, on the other hand, is optimized for hardware. It is hard for people to read and understand, removes redundancy and ambiguity, and loses all information about programmer intent. Most assembly languages, for example, are practically unreadable.

Compilers play the role of allowing users to compile the high-level readable languages into low-level hardware-optimized instructions. Some languages are farther from machine code than others, like JavaScript compared to C. However, no matter the source language, a compiler primarily has the same goals in translation:

Programming languages are defined to describe computation precisely. Therefore, translation can be precisely described. This means that a compiler can be logically correct with respect to the source and target languages. There is a lot of ongoing research for proving compilers to be correct.

Compilers are clearly very complex programs. However, we can break them down into a series of program representations. These program representations get closer towards our target language at each step. These are referred to as intermediate representations (IR). Languages like Rust, C++, and Python all have compilers that compile down to an IR, specifically LLVM IR, which we will cover later. While the series of steps is not universal across compilers, a single compiler generally follows a format:

The entire compiler takes as a whole can be much more flexible. It can have many stages, but here are some typical ones, in order:

Optimizations are done at many of these stages, and different source languages may require more different stages. Of course, assembly is not the end of the process.

Assembling and Linking

Once the compiler translates everything to assembly code, a program called the assembler converts it to machine code. This is a relatively straightforward process, as assembly is meant to be a very low-level human-readable version based off the machine’s ISA. The next object is the linker, which links together multiple compiled and assembled machine code files to execute together, and be able to access one other’s capabilities. This is how things like libraries or packages are handled. Once that is finished, you end up with fully resolved executable machine code, that is then loaded onto the CPU’s cache with a loader, creating an executable image.