CS8395 — Homework 2

Ghidra and LLVM

This assignment is due March 27, 2023 at 11:59PM Central time.


HW1 had you use a sandbox to exploit a buffer overflow vulnerability and to construct a payload using MetaSploit. In this homework, we will use Ghidra, a tool created by the NSA for analyzing binaries, and LLVM, a compiler framework, to analyze and transform programs. Both are quite common tools used in the security community to conduct a variety of analyses.

While you should read this spec in its entirety, you can find the specific items to turn in by searching for "Turn-in".

Turn-in for HW2

Read this assignment document and complete the tasks described. The final deliverable is a zip file containing your written PDF report, your modified crackme (if you patched it), and a copy of your InstrumentFunctions.cpp LLVM pass. There is no length requirement or limit for the report, but I expect this will take 5 pages or fewer (depending on the size of your screenshots where appropriate).

Ensure that your name, VUNetID, and email address appear at the top of each of page.

Ensure that you have two sections to your submission labeled: Task 1 and Task 2.

I strongly recommend the use of LaTeX to complete this (and all) assignments. LaTeX is the lingua franca for scientific communication in computer science — every peer-reviewed publication I have submitted and reviewed has been written in LaTeX. You can use overleaf.com to help you write LaTeX documents. I have also created a template that you can copy, available here.

Starter Package

You can find the starter package at kjl.name/cs8395/hw2-baked.zip. You will find a file called crackme contained in the part-1 directory, and starter code in the part-2 directory.

Task 1: Crackme with Ghidra

A Crackme is a toy program that contains a small puzzle to solve. They are training tools for reverse engineering, a critical part of security. "Cracking" the crackme often consists of determining a "flag" or special value that causes the crackme to behave in some desirable way. In this assignment, you will download a crackme and use Ghidra to help determine or circumvent two value checks.

You can download the Crackme for this assignment in the starter package. This is a stripped 32-bit Linux ELF binary. You can run the program in the terminal with ./crackme password1 password2.

Example execution and output of crackme.

Your goal is to either (1) find the correct values of password1 and password2 or (2) to patch the program so that it produces the desired output:

Desired output of crackme.

Using Ghidra

These instructions assume you are continuing to use your Kali VM from HW1. If you are not using a Kali VM, Ghidra works across platforms, but you may have a slightly different version or process for getting it running on your own system.

Ghidra is a powerful binary analysis tool produced by the NSA. It is open source and free to download. If you are using your VM from HW1, you can install it from the command line:

sudo apt install ghidra

When you have installed Ghidra, you can start it by running ghidra on the command line. When it opens, it will prompt you about projects. You can create a new empty project (you can call it hw2 or whatever you like), then click File and click Import File..., then navigate to the HW2 Crackme file provided above. When you open the file, it will prompt you with information it determines about the crackme:

Import File window

Press OK, then double click the file to open it. It will prompt you to analyze the file; do so with the default options.

File analysis

Once the analysis is complete, you are shown a window with a large amount of assembly code in it. This is the main result of lifting the crackme binary to assembly. The analysis Ghidra conducts makes several pieces of information available for helping to understand how the program works. First, on the left side of the window, you can use the Symbol Tree to navigate to the start of the main function for this program.

Finding main

Within this view, you can navigate the assembly and decompiled code on the right to see how the program executes. Note that I don't want to give away details of the inner workings of the program (indeed, that is your goal in this HW). Instead, I will provide several hints to help you crack this crackme.

Turn-in for Task 1: Ghidra Crackme

For this task, you must document your experience using Ghidra to crack the input program. Turn in the following:

  • Document that you are able to run the crackme program. Show the output of the program before you crack it (i.e., the "Incorrect" 0/2 output). A screenshot suffices.
  • Since there are several ways to crack the program, document that you can run the program with the correct output shown (i.e., the "Correct" 2/2 output). A screenshot suffices.
  • In addition, you must describe how you cracked the program. If you found values for the passwords, show their values.
  • If you patched the program, provide a copy of your cracked program in your submitted zip file.

Task 2: LLVM and program analysis

For this task, you will use LLVM to analyze and instrument a given program to manually implement stack canaries. You will do this by creating an LLVM Pass to transform an input program so that each function contains additional code that implements the stack canary.

You are implementing a temperamental stack canary protector in this assignment. However, it is not a complete implementation and this should not be used to protect critical software. Indeed, part of this assignment is writing why this solution is insufficient in practice.

Program Instrumentation and Modification

When you write software, you generally rely on a tool to build, compile, or interpret it. gcc or clang for C/C++, javac for Java, python for Python. These language processing tools automatically convert your source code to an executable form that the CPU is happy with. Suppose that, over time, we figure out what different vulnerabilities look like in source code. Rather than manually updating all of our projects, can we automatically transform our programs to contain fewer vulnerabilities?

We can build programs that analyze and transform programs for us. This is often referred to as static analysis — the systematic examination of programs to discover properties that are important. For example, we might use a static analysis to determine if it's ever possible that a particular buffer is transmitted over the network (see Information Flow; you could check if variable x contains data that ever reaches a function call, send()). Static analysis is used for a variety of powerful security analyses. In this assignment, you will build a program transformation tool that hardens input programs against stack overflow attacks.

LLVM Primer

LLVM is a set of tools that lets us do these automated program analyses (it used to stand for "Low-level virtual machine," but this is a terribly outdated name, so we just call it LLVM). It is an extremely powerful, widely-used compiler framework (indeed, if you've used clang before, LLVM is its parent project; the Clang front-end uses LLVM IR under the hood). LLVM allows you to work with, analyze, and transform programs during the compilation process. While it has historically been used to implement compiler optimizations (e.g., to make programs faster or smaller), people use LLVM for a variety of creative analyses.

LLVM and Intermediate Representations

At its core, LLVM is part of a compiler toolchain (clang) that produces a robust "Intermediate Representation" (IR). In Compiler Theory, an IR is a representation of a program that permits some analysis that is helpful for compilation. For example, a parser produces an Abstract Syntax Tree of the program to verify that a given source code file contains syntactically valid source code.

We frequently work with an IR called a "Control Flow Graph" (CFG). You saw CFGs in part 1 of HW2. This is a graph structure that represents how the CPU flows through instructions in the program. For example, consider the function below:

The foo function would have a CFG that looks like:

        if (x < 5)
     /            \
   true          false
x = x + 1       x = x - 1
     \            /

Each function in a program can be represented as a CFG. Indeed, LLVM gives us access to Basic Blocks (nodes) in the CFG. Both Abstract Syntax Trees and Control Flow Graphs are examples of Intermediate Representations used by compiler toolchains like LLVM.

For this assignment, you will use LLVM to make a pass over the instructions contained in a program's CFG to create and enforce stack canaries.

Stack Canaries

Recall from lecture we discussed how stack overflow attacks work. An attacker controls a buffer in which they place malicious code to run. The attacker overwrites a saved return address that is placed on the stack when a function is first called. One way of detecting stack overflow attacks is through the use of "stack canaries." Loosely, we place a known random value on the stack at the beginning of a function. If an attacker smashes the stack, they will overwrite that stack canary. Thus, we can check whether the canary has been changed right before the function returns. Unless the attacker knows the exact value we chose as the canary, we can tell if that value has been overwritten.

In this task, your job is to write a tool that can automatically transform a given C program so that it contains code to generate and check stack canaries. You will use LLVM to produce this tool.

Intuition for Task 2

Suppose we have a program that contains a stack overflow vulnerability. We could potentially protect the program by detecting if the stack has been corrupted. In particular, if we place a magic value on the stack, then check if that value changed later, we can see if an attacker overwrote parts of the stack they were not supposed to. Consider a "before" and "after" version of such a program:

Example for stack canaries

Getting Started

In LLVM, you can create "passes" that run over a given input program. A pass takes in a program as input, and produces a program as output. You will create a pass that, when applied to a given program, transforms all of the functions in that program to contain code that creates and checks stack canaries. The resulting transformed program can be compiled and executed like any other.

First, you must install LLVM. For this assignment, please use LLVM 13. On Kali, you can use

sudo apt-get install llvm-13

After installing it, there's a whole lot of code placed under /usr/lib/llvm-13/include/llvm.

If you ever need clarification on how to invoke a particular LLVM method, you can often grep through that directory. For example,

grep -r 'ReturnInst' /usr/lib/llvm-13/include/llvm

Building the starter LLVM Pass

First, ensure you have llvm 13 installed

sudo apt-get install llvm-13

You will also need to set the LLVM_DIR environment variable to point to LLVM's location.

export LLVM_DIR=/usr/lib/llvm-13

(you can add this to your ~/.bashrc file to ensure that gets set every time you start your VM).

In this package, there is starter code under the "pass" directory. To build the LLVM pass, we use cmake. If you are working in the part-2 directory:

cd pass
mkdir build
cd build

If the build is successful, it produces a file called pass/build/lib/libInstrumentFunctions.so. The .so file is your LLVM pass module. We use it to automatically transform an input file.

To use the pass against an input file, we first create LLVM IR of the subject program (i.e., the program to be transformed) by using clang:

clang -m32 -emit-llvm -S -fno-stack-protector -static  subject.c  -o subject.ll

This produces a file called subject.ll that contains the raw LLVM IR. This is similar to assembly code, but it is in a particular format that can be readily analyzed by LLVM passes and related tools.

Once you have the LLVM IR, you can run the pass against it:

$LLVM_DIR/bin/opt -load-pass-plugin ./pass/build/lib/libInstrumentFunctions.so --passes="cs8395-hw" subject.ll -S -o instrumented.ll
Alternatively, there is also a helper script to run the pass once you have it built:
./plug.sh subject.ll instrumented.ll

This produces a file that is transformed according to the pass tool you generate. Once you have the instrumented LLVM IR, you can compile it:

$LLVM_DIR/bin/llc -m32 instrumented.ll
clang -m32 -fno-stack-protector -static instrumented.s -o instrumented.o

You use llc to lower the LLVM IR to raw x86 assembly. Then, we compile it with clang to produce a binary (you can likely use gcc as well). Finally, instrumented is a normal program you can run directly:


There is also an invoke.sh script you can use (it's slightly better than the one for HW1).

./invoke.sh ./instrumented.o input_file
Understanding the starter code

The starter code contains a functioning LLVM pass that instruments the program to print out every function the program executes. For example, consider the program below:

By default, this program just prints out 13. However, after running the starter LLVM transformation, the program outputs:

In main
In foo
In bar
In foo

That is, we have instrumented the program so that, when it runs, it prints out each function that is executed. We have used a static analysis to change the behavior of the program once it is compiled. Your task is to implement an LLVM transformation pass that uses and enforces a stack canary in each function.

Requirements for Task 2

You must modify only the file ./pass/lib/InstrumentFunctions.cpp to implement the stack canary protection.

The starter code instruments each function in the subject program with a call to printf that makes the program print the name of each function that executes. In addition, the starter code also creates a (currently unused) stack canary variable in each function. You must make your LLVM pass emit instructions that (1) load the existing canary from memory, and (2) confirm that the canary has not changed, all before the function returns.

At a minimum, your code must transform the subject.c program such that:

Notably, you don't have to do anything fancier — ordinarily, you might expect the program to abort (or similar). However, to make the implementation more straightforward, you only have to detect and print when the canary was overwritten with a different value.

More concretely, you need LLVM code that, for each function, creates new "if-then" logic before the function returns. If the canary is changed, print the name of the function. Otherwise, the canary is unchanged and execution proceeds as normal. Example outputs below.
Example outputs pre- and post-transformation
  1. Original, uninstrumented program with malicious payload that shells out to echo Hello, world:

  2. Instrumented program, running with a clean, non-malicious input:

  3. Instrumented program, running with a malicious input:

Hints and Commentary

The way LLVM works is by considering "instructions" in the IR. Instructions in LLVM can be things like add, subtract, compare, jump, branch, call, return, etc. You can see how the starter code emits an instruction to allocate space for the new canary, then stores a randomly-generated value in that new space.

I will note that LLVM has a bit of a learning curve. There are tons of instructions, methods, and types available throughout its codebase. My suggestion is to focus on the following keywords (documentation available via google):

You can likely start at return instructions and work backwards to add code that checks if the canary has been changed. The canary is already created for you by the starter code in each function — you just need to complete the implementation by adding the conditional check before the function returns.

At the top of each function, the starter code uses an IRBuilder to create (1) an AllocaInst and (2) a StoreIntr. In LLVM IR, programs allocate memory (like on the stack) by using an "Allocate" instruction. In particular, we build a 32-bit Integer and allocate it on the stack. Then, we store a randomly-initialized value in that newly-allocated space. You can see in the code how these LLVM instructions get placed into the existing control flow graph of the function (there are helper methods from the IRBuilder class that make it more straightforward).

In addition, the starter code also places a printf call in the function which, at runtime, tells you the name of the function that is currently executed. Since the requirement is that you print out the name of the function whose stack is smashed, you can adapt this code. The difference is that we only want the print to execute if the canary was changed. Given what you can see about how LLVM instructions are placed, how would you change the printf-related instructions so that they are guarded by an if-then check before the function returns?

Turn-in for Task 2: LLVM Stack Canary Protector

You must implement manual stack canary protection with the LLVM pass as described above. In your submission, please include:

  • A copy of your modified InstrumentFunctions.cpp file that implements your pass.
  • In your writeup, address the following:
    • Demonstrate that your LLVM pass can successfully transform the subject.c program such that it prints out the name of the function whose stack is smashed during execution. Screenshots of the program before/after applying your transformation are acceptable.
    • Describe how you implemented your solution. What resources did you use? What parts of LLVM did you use? How did you decide what to do for each step (e.g., loading the canary, comparing with the original value, etc.)?
    • What are the limitations of this approach to implementing stack canaries? What could go wrong with this approach? How would you compromise it? What assumptions would you have to make about an attacker who would be able to compromise this technique?
    • Now that you have an idea of the types of transformations LLVM is capable of performing, what other security analyses do you think are possible or useful?

What to turn in for HW2

you must submit a single .zip file called vunetid.zip. While you can work with others in the class conceptually, please submit your own copy of the assignment. Your zip file must contain:

Use the submission system (VU login required) to submit your zip file.