CS8395 — Homework 2

Ghidra and LLVM

This assignment is due March 27, 2023 at 11:59PM Central time.

Introduction

HW1 had you use a sandbox to exploit a buffer overflow vulnerability and to construct a payload using MetaSploit. In this homework, we will use Ghidra, a tool created by the NSA for analyzing binaries, and LLVM, a compiler framework, to analyze and transform programs. Both are quite common tools used in the security community to conduct a variety of analyses.

Ghidra is an example of a tool that lifts binary programs to a higher-level intermediate representation that makes it easier to analyze and understand programs. Since malware is usually distributed without source code (or constructed manually in assembly), lifting to an IR is often the first step in analysis and transformation.

Other examples of lifting tools are IDA Pro, McSema, Radare, and Angr. These tools can be integrated with scripts and plugins depending on the specific application.

LLVM is a poorly-named set of tools that can be used to automatically transform input programs. In this homework, you will use LLVM to automatically transform an input program to instrument it to create and enforce stack canaries.

While you should read this spec in its entirety, you can find the specific items to turn in by searching for "Turn-in".

Turn-in for HW2

Read this assignment document and complete the tasks described. The final deliverable is a zip file containing your written PDF report, your modified crackme (if you patched it), and a copy of your InstrumentFunctions.cpp LLVM pass. There is no length requirement or limit for the report, but I expect this will take 5 pages or fewer (depending on the size of your screenshots where appropriate).

Ensure that your name, VUNetID, and email address appear at the top of each of page.

Ensure that you have two sections to your submission labeled: Task 1 and Task 2.

I strongly recommend the use of LaTeX to complete this (and all) assignments. LaTeX is the lingua franca for scientific communication in computer science — every peer-reviewed publication I have submitted and reviewed has been written in LaTeX. You can use overleaf.com to help you write LaTeX documents. I have also created a template that you can copy, available here.

Starter Package

You can find the starter package at kjl.name/cs8395/hw2-baked.zip. You will find a file called crackme contained in the part-1 directory, and starter code in the part-2 directory.

Task 1: Crackme with Ghidra

A Crackme is a toy program that contains a small puzzle to solve. They are training tools for reverse engineering, a critical part of security. "Cracking" the crackme often consists of determining a "flag" or special value that causes the crackme to behave in some desirable way. In this assignment, you will download a crackme and use Ghidra to help determine or circumvent two value checks.

You can download the Crackme for this assignment in the starter package. This is a stripped 32-bit Linux ELF binary. You can run the program in the terminal with ./crackme password1 password2.

Example execution and output of crackme.

The output consists of three lines:

Correct or Incorrect, depending on whether the first password you provide is correct.
Correct or Incorrect, depending on whether the second password you provide is correct.
Score, which sums how many passwords you got correct.

Your goal is to either (1) find the correct values of password1 and password2 or (2) to patch the program so that it produces the desired output:

Desired output of crackme.

Using Ghidra

These instructions assume you are continuing to use your Kali VM from HW1. If you are not using a Kali VM, Ghidra works across platforms, but you may have a slightly different version or process for getting it running on your own system.

Ghidra is a powerful binary analysis tool produced by the NSA. It is open source and free to download. If you are using your VM from HW1, you can install it from the command line:

sudo apt install ghidra

When you have installed Ghidra, you can start it by running ghidra on the command line. When it opens, it will prompt you about projects. You can create a new empty project (you can call it hw2 or whatever you like), then click File and click Import File..., then navigate to the HW2 Crackme file provided above. When you open the file, it will prompt you with information it determines about the crackme:

Import File window

Press OK, then double click the file to open it. It will prompt you to analyze the file; do so with the default options.

File analysis

Once the analysis is complete, you are shown a window with a large amount of assembly code in it. This is the main result of lifting the crackme binary to assembly. The analysis Ghidra conducts makes several pieces of information available for helping to understand how the program works. First, on the left side of the window, you can use the Symbol Tree to navigate to the start of the main function for this program.

Finding main

Within this view, you can navigate the assembly and decompiled code on the right to see how the program executes. Note that I don't want to give away details of the inner workings of the program (indeed, that is your goal in this HW). Instead, I will provide several hints to help you crack this crackme.

In the decompiled code, you can see several pertinent pieces for determining whether the two command line inputs are correct. For example, at the very bottom of main, you can see the printf line that displays the score, which references local_24 (again, this assumes you are using the Kali VM; the offsets may be labeled differently if you're using a different setup). Thus, you can work backwards to find instructions that modify this location.
Tools like Ghidra often provide a "Cross reference" feature that lets you see places in the code that refer to the same location. For example, with local_24 described above, you can find it in the disassembly, right click it, go to References, and click "Show references to local_24":

You can also rename such symbols as you figure out what they mean (e.g., "local_24" might better be described as "score"; you don't have to do this, but it may help you keep track of things).
There are two passwords to discover or subvert in this assignment. Are there strings you can find throughout the binary itself to help determine one of the passwords? In the Tools menu, check out the Defined strings option to explore string literals embedded within the executable itself.
You can extract a control flow graph from each function. From the Graphs menu, click Code Flow and it will generate a graphical representation of the function you are currently analyzing. For main, it looks something like this:

(you'll have to zoom in on your Ghidra to see the details). This graph shows you the instructions that execute, and shows edges where the flow of execution may diverge. For example, in main, there appears to be a branch (specifically, a jz instruction) whose outcome determines whether the function y or z gets called. Note that such graphs are generated only for a single function — function calls do not produce edges but are instead treated as a single instruction.
Control flow can be tricky. The second password involves a bit more complexity for determining both whether it's correct, but also how the correctness is reported. Indeed, reporting the outcome of evaluating the second password may not involve a straightforward branch or jump instruction.
You can patch instructions to change what happens in the program. For example, if you find in a function that contains a branch, you can overwrite the branch instruction with a NOP (no operation) instruction to "skip over" the evaluation done before the branch. That is, you can force a particular path to execute if you carefully overwrite instructions. If you want to do so, right click an instruction, then click "Patch Instruction." When you patch an instruction, you must replace both the opcode and, where relevant, the operands. In addition, the x86 instruction set is variable length, which means that each instruction takes a different number of bytes to represents. For example, suppose we wanted to patch this JZ instruction with a NOP:

After patching with a NOP, we get:

Because the NOP is a single 0x90 byte, but the JZ was two bytes long, we have a leftover byte that also must be patched. Moreover, if you need to insert a longer instruction, you must be careful of overwriting too many bytes.
If you patch the program, you can have Ghidra output the new version by clicking File, then "Export Program...". You can save the executable, then run it in the terminal (you may have to use chmod u+x output_filename to get it to execute properly).

Turn-in for Task 1: Ghidra Crackme

For this task, you must document your experience using Ghidra to crack the input program. Turn in the following:

Document that you are able to run the crackme program. Show the output of the program before you crack it (i.e., the "Incorrect" 0/2 output). A screenshot suffices.
Since there are several ways to crack the program, document that you can run the program with the correct output shown (i.e., the "Correct" 2/2 output). A screenshot suffices.
In addition, you must describe how you cracked the program. If you found values for the passwords, show their values.
If you patched the program, provide a copy of your cracked program in your submitted zip file.

Task 2: LLVM and program analysis

For this task, you will use LLVM to analyze and instrument a given program to manually implement stack canaries. You will do this by creating an LLVM Pass to transform an input program so that each function contains additional code that implements the stack canary.

You are implementing a temperamental stack canary protector in this assignment. However, it is not a complete implementation and this should not be used to protect critical software. Indeed, part of this assignment is writing why this solution is insufficient in practice.

Program Instrumentation and Modification

When you write software, you generally rely on a tool to build, compile, or interpret it. gcc or clang for C/C++, javac for Java, python for Python. These language processing tools automatically convert your source code to an executable form that the CPU is happy with. Suppose that, over time, we figure out what different vulnerabilities look like in source code. Rather than manually updating all of our projects, can we automatically transform our programs to contain fewer vulnerabilities?

We can build programs that analyze and transform programs for us. This is often referred to as static analysis — the systematic examination of programs to discover properties that are important. For example, we might use a static analysis to determine if it's ever possible that a particular buffer is transmitted over the network (see Information Flow; you could check if variable x contains data that ever reaches a function call, send()). Static analysis is used for a variety of powerful security analyses. In this assignment, you will build a program transformation tool that hardens input programs against stack overflow attacks.

LLVM Primer

LLVM is a set of tools that lets us do these automated program analyses (it used to stand for "Low-level virtual machine," but this is a terribly outdated name, so we just call it LLVM). It is an extremely powerful, widely-used compiler framework (indeed, if you've used clang before, LLVM is its parent project; the Clang front-end uses LLVM IR under the hood). LLVM allows you to work with, analyze, and transform programs during the compilation process. While it has historically been used to implement compiler optimizations (e.g., to make programs faster or smaller), people use LLVM for a variety of creative analyses.

LLVM and Intermediate Representations

At its core, LLVM is part of a compiler toolchain (clang) that produces a robust "Intermediate Representation" (IR). In Compiler Theory, an IR is a representation of a program that permits some analysis that is helpful for compilation. For example, a parser produces an Abstract Syntax Tree of the program to verify that a given source code file contains syntactically valid source code.

We frequently work with an IR called a "Control Flow Graph" (CFG). You saw CFGs in part 1 of HW2. This is a graph structure that represents how the CPU flows through instructions in the program. For example, consider the function below:

The foo function would have a CFG that looks like:

        ENTRY
          |
        printf("Hello...");
        if (x < 5)
     /            \
   true          false
x = x + 1       x = x - 1
     \            /
        printf("Goodbye...");

Each function in a program can be represented as a CFG. Indeed, LLVM gives us access to Basic Blocks (nodes) in the CFG. Both Abstract Syntax Trees and Control Flow Graphs are examples of Intermediate Representations used by compiler toolchains like LLVM.

For this assignment, you will use LLVM to make a pass over the instructions contained in a program's CFG to create and enforce stack canaries.

Stack Canaries

Recall from lecture we discussed how stack overflow attacks work. An attacker controls a buffer in which they place malicious code to run. The attacker overwrites a saved return address that is placed on the stack when a function is first called. One way of detecting stack overflow attacks is through the use of "stack canaries." Loosely, we place a known random value on the stack at the beginning of a function. If an attacker smashes the stack, they will overwrite that stack canary. Thus, we can check whether the canary has been changed right before the function returns. Unless the attacker knows the exact value we chose as the canary, we can tell if that value has been overwritten.

In this task, your job is to write a tool that can automatically transform a given C program so that it contains code to generate and check stack canaries. You will use LLVM to produce this tool.

Intuition for Task 2

Suppose we have a program that contains a stack overflow vulnerability. We could potentially protect the program by detecting if the stack has been corrupted. In particular, if we place a magic value on the stack, then check if that value changed later, we can see if an attacker overwrote parts of the stack they were not supposed to. Consider a "before" and "after" version of such a program:

Example for stack canaries

"Before"

"After"

On the left, we have an unaltered program that has a stack overflow vulnerability. On the right, we created some random int values on the stack, then check their value right before the function returns. This way, if an attacker tries to smash the stack inside the foo function, they will overwrite the part of the stack that contains foo_canary. Unless the attacker knows the exact value of the canary (yes, yes, it's right there in the code, discuss that in your writeup), it is very likely that the attacker will dump a random value there. If they do, we can detect that and intervene before they hijack control (e.g., by aborting the program).

Getting Started

In LLVM, you can create "passes" that run over a given input program. A pass takes in a program as input, and produces a program as output. You will create a pass that, when applied to a given program, transforms all of the functions in that program to contain code that creates and checks stack canaries. The resulting transformed program can be compiled and executed like any other.

First, you must install LLVM. For this assignment, please use LLVM 13. On Kali, you can use

sudo apt-get install llvm-13

After installing it, there's a whole lot of code placed under /usr/lib/llvm-13/include/llvm.

If you ever need clarification on how to invoke a particular LLVM method, you can often grep through that directory. For example,

grep -r 'ReturnInst' /usr/lib/llvm-13/include/llvm

Building the starter LLVM Pass

First, ensure you have llvm 13 installed

sudo apt-get install llvm-13

You will also need to set the LLVM_DIR environment variable to point to LLVM's location.

export LLVM_DIR=/usr/lib/llvm-13

(you can add this to your ~/.bashrc file to ensure that gets set every time you start your VM).

In this package, there is starter code under the "pass" directory. To build the LLVM pass, we use cmake. If you are working in the part-2 directory:

cd pass
mkdir build
cd build
cmake -DLLVM_INSTALL_DIR=$LLVM_DIR ../
make

If the build is successful, it produces a file called pass/build/lib/libInstrumentFunctions.so. The .so file is your LLVM pass module. We use it to automatically transform an input file.

To use the pass against an input file, we first create LLVM IR of the subject program (i.e., the program to be transformed) by using clang:

clang -m32 -emit-llvm -S -fno-stack-protector -static  subject.c  -o subject.ll

This produces a file called subject.ll that contains the raw LLVM IR. This is similar to assembly code, but it is in a particular format that can be readily analyzed by LLVM passes and related tools.

Once you have the LLVM IR, you can run the pass against it:

$LLVM_DIR/bin/opt -load-pass-plugin ./pass/build/lib/libInstrumentFunctions.so --passes="cs8395-hw" subject.ll -S -o instrumented.ll

Alternatively, there is also a helper script to run the pass once you have it built:

./plug.sh subject.ll instrumented.ll

This produces a file that is transformed according to the pass tool you generate. Once you have the instrumented LLVM IR, you can compile it:

$LLVM_DIR/bin/llc -m32 instrumented.ll
clang -m32 -fno-stack-protector -static instrumented.s -o instrumented.o

You use llc to lower the LLVM IR to raw x86 assembly. Then, we compile it with clang to produce a binary (you can likely use gcc as well). Finally, instrumented is a normal program you can run directly:

./instrumented.o

There is also an invoke.sh script you can use (it's slightly better than the one for HW1).

./invoke.sh ./instrumented.o input_file

Understanding the starter code

The starter code contains a functioning LLVM pass that instruments the program to print out every function the program executes. For example, consider the program below:

By default, this program just prints out 13. However, after running the starter LLVM transformation, the program outputs:

In main
In foo
In bar
In foo
13

That is, we have instrumented the program so that, when it runs, it prints out each function that is executed. We have used a static analysis to change the behavior of the program once it is compiled. Your task is to implement an LLVM transformation pass that uses and enforces a stack canary in each function.

Requirements for Task 2

You must modify only the file ./pass/lib/InstrumentFunctions.cpp to implement the stack canary protection.

The starter code instruments each function in the subject program with a call to printf that makes the program print the name of each function that executes. In addition, the starter code also creates a (currently unused) stack canary variable in each function. You must make your LLVM pass emit instructions that (1) load the existing canary from memory, and (2) confirm that the canary has not changed, all before the function returns.

At a minimum, your code must transform the subject.c program such that:

If no stack smashing has occurred, no additional output is produced.
If the stack is smashed and the function's canary is overwritten, you print the name of the function that was executing when the stack was smashed.

Notably, you don't have to do anything fancier — ordinarily, you might expect the program to abort (or similar). However, to make the implementation more straightforward, you only have to detect and print when the canary was overwritten with a different value.

More concretely, you need LLVM code that, for each function, creates new "if-then" logic before the function returns. If the canary is changed, print the name of the function. Otherwise, the canary is unchanged and execution proceeds as normal. Example outputs below.

Example outputs pre- and post-transformation

Original, uninstrumented program with malicious payload that shells out to echo Hello, world:

Instrumented program, running with a clean, non-malicious input:

Instrumented program, running with a malicious input:

While the original program is susceptible to the buffer overflow attack from HW1, the instrumented program now includes canary protection that causes it to print "Function X in bad shape!" if the canary is overwritten. However, if a clean input is provided, the program behaves normally without additional output.

Hints and Commentary

The way LLVM works is by considering "instructions" in the IR. Instructions in LLVM can be things like add, subtract, compare, jump, branch, call, return, etc. You can see how the starter code emits an instruction to allocate space for the new canary, then stores a randomly-generated value in that new space.

I will note that LLVM has a bit of a learning curve. There are tons of instructions, methods, and types available throughout its codebase. My suggestion is to focus on the following keywords (documentation available via google):

AllocaInst
CreateLoad, CreateStore
Instruction
SplitBlockAndInsertIfThen and SplitBlockAndInsertIfThenElse
IRBuilder, SetInsertPoint

You can likely start at return instructions and work backwards to add code that checks if the canary has been changed. The canary is already created for you by the starter code in each function — you just need to complete the implementation by adding the conditional check before the function returns.

At the top of each function, the starter code uses an IRBuilder to create (1) an AllocaInst and (2) a StoreIntr. In LLVM IR, programs allocate memory (like on the stack) by using an "Allocate" instruction. In particular, we build a 32-bit Integer and allocate it on the stack. Then, we store a randomly-initialized value in that newly-allocated space. You can see in the code how these LLVM instructions get placed into the existing control flow graph of the function (there are helper methods from the IRBuilder class that make it more straightforward).

In addition, the starter code also places a printf call in the function which, at runtime, tells you the name of the function that is currently executed. Since the requirement is that you print out the name of the function whose stack is smashed, you can adapt this code. The difference is that we only want the print to execute if the canary was changed. Given what you can see about how LLVM instructions are placed, how would you change the printf-related instructions so that they are guarded by an if-then check before the function returns?

Turn-in for Task 2: LLVM Stack Canary Protector

You must implement manual stack canary protection with the LLVM pass as described above. In your submission, please include:

A copy of your modified InstrumentFunctions.cpp file that implements your pass.
In your writeup, address the following:
- Demonstrate that your LLVM pass can successfully transform the subject.c program such that it prints out the name of the function whose stack is smashed during execution. Screenshots of the program before/after applying your transformation are acceptable.
- Describe how you implemented your solution. What resources did you use? What parts of LLVM did you use? How did you decide what to do for each step (e.g., loading the canary, comparing with the original value, etc.)?
- What are the limitations of this approach to implementing stack canaries? What could go wrong with this approach? How would you compromise it? What assumptions would you have to make about an attacker who would be able to compromise this technique?
- Now that you have an idea of the types of transformations LLVM is capable of performing, what other security analyses do you think are possible or useful?

What to turn in for HW2

you must submit a single .zip file called vunetid.zip. While you can work with others in the class conceptually, please submit your own copy of the assignment. Your zip file must contain:

hw2.pdf, a single PDF file containing the written elements required in this assignment. You should have a Task 1 and Task 2 section in your writeup.
If you cracked the crackme program by patching it, you must provide a copy of your modified program (called vunetid-crackme).
A copy of your InstrumentFunctions.cpp file that implements the LLVM-based stack canary protection.

Use the submission system (VU login required) to submit your zip file.