PA2 — The Lexer

PA2 is due 9/15 at 11:59pm Central.

Programming assignments 2 through 5 will direct you to design and build an optimizing compiler for Cool. Each assignment will cover one component of the interpreter: lexical analysis, parsing, semantic analysis, code generation, and optimization. Each assignment will ultimately result in a working compiler phase which can interface with the other phases.

You may complete this assignment using OCaml, Haskell, JavaScript, Python, Ruby, C, or C++.

You may work in a team of up to two people for this assignment. You may work in a team for any or all subsequent programming assignments. You do not need to keep the same teammates. The course staff are not responsible for finding you a willing team. If you want to work on a team, you must register your group on the autograder before submitting!

Goal

For this assignment you will write a lexical analyzer, also called a scanner, using a lexical analyzer generator. You will describe the set of tokens for Cool in an appropriate input format and the analyzer generator will generate actual code. You will then write additional code to serialize the tokens for use by later interpreter stages.

Specification

You must create three artifacts:

A program that takes a single command-line argument (e.g., file.cl). That argument will be an ASCII text Cool source file. Your program must either indicate that there is an error in the input (e.g., a malformed string) or emit file.cl-lex, a serialized list of Cool tokens. Your program's main lexer component must be constructed by a lexical analyzer generator. The "glue code" for processing command-line arguments and serializing tokens should be written by hand. If your program is called lexer, invoking lexer file.cl should yield the same output as cool --lex file.cl. Your program will consist of a number of OCaml files, a number of Python files, a number of JavaScript files, or a number of Ruby files.
A plain ASCII text file called readme.txt describing your design decisions and choice of test cases. See the grading rubric. A few paragraphs should suffice.
Testcases good.cl and bad.cl. The first should lex correctly and yield a sequence of tokens. The second should contain an error.

You must use the python ply library for this assignment. This will allow you to specify tokens using regular expressions.

Students have asked previously about creating a lexer from scratch (i.e., without lex). While you are welcome to try, it is not recommended. There are numerous odd corner cases that make the task a significant undertaking without corresponding pedagogical value.

Line Numbers

The first line in a file is line 1. Each successive '\n' newline character increments the line count. Your lexer is responsible for keeping track of the current line number.

Error Reporting

To report an error, write the string

ERROR: line_number: Lexer: message

to standard output and terminate the program. You may write whatever you want in the message, but it should be fairly indicative. Example erroneous input:

Backslash not allowed \

Example error report output:

ERROR: 1: Lexer: invalid character: \

The .cl-lex File Format

If there are no errors in file.cl your program should create file.cl-lex and serialize the tokens to it. Each token is represented by a pair (or triplet) of lines. The first line holds the line number. The second line gives the name of the token. The optional third line holds additional information (i.e., the lexeme) for identifiers, integers, strings and types. For example, for an integer token the third line should contain the decimal integer value.

Example input:

Backslash not
   allowed

Corresponding .cl-lex output:

1  
type  
Backslash  
1  
not  
2  
identifier  
allowed

The official list of token names is:

at case class colon comma divide dot else equals esac false fi identifier if in inherits integer isvoid larrow lbrace le let loop lparen lt minus new not of plus pool rarrow rbrace rparen semi string then tilde times true type while

In general the intended token is evident. For the more exotic names:

at = @, larrow = <-, lbrace = {, le = <=, lparen = (, lt = <, rarrow = =>, rbrace = }, semi = ;, tilde = ~.

The .cl-lex file format is exactly the same as the one generated by the reference compiler when you specify --lex. In addition, the reference compiler (and your upcoming PA3 parser!) will read .cl-lex files instead of .cl files.

You can invoke the reference compiler for a given something.cl program:

$ ./cool --lex something.cl

This will produce a corresponding something.cl-lex file that shows the lexed tokens. You can use this for your own testing as you create your own lexer for PA1.

Lexical Analyzer Generators

You must use a lexical analyzer generator library for this assignment. You will use Python for this assignment, and the corresponding lexer analyzer generator library, ply, which you can install with pip.

$ pip3 install ply

Nearly every programming language contains a similar lexer generator library, all of which are derived from lex (or flex), the original lexical analyzer generator for C. Thus, you may find it handy to refer to the Lex paper or the Flex manual. When you're reading, mentally translate the C code references into Python.

Commentary

You can do basic testing with something like the following:

$ cool --out reference --lex file.cl    
$ python3 my-lexer.py file.cl   
$ diff -b -B -E -w file.cl-lex reference.cl-lex

You may find the reference compiler's --unlex option useful for debugging your .cl-lex files.

Need more testcases? Any Cool file you have (including the one you wrote for PA1) works fine. The contents of cool-examples.zip should be a good start. There's also one among the PA1 hints. You'll want to make more complicated test cases — in particular, you'll want to make negative testcases (e.g., testcases with malformed string constants).

Video Guides

NOTE: Some of these video guides are from a previous offering of a similar course at the University of Virginia. The assignment for this semester has changed slightly. While they are still relevant, you are responsible for completing the assignment according to this course's grading rubric.

A number of Video Guides are provided to help you get started on this assignment on your own. The Video Guides are walkthroughs in which the instructor manually completes and narrates, in real time, the first part of this assignment — including a submission to the grading server. They include coding, testing and debugging elements.

If you are still stuck, you can post on the forum, approach the TAs, or approach the professor. The use of online instructional content outside of class weakly approximates a flipped classroom model. Click on a video guide to begin, at which point you can watch it fullscreen or via Youtube if desired.

Python + PLY

C/C++ + Flex

What To Turn In For PA2

You must turn in several files to the autograder:

readme.txt — your README file
good.cl — a novel positive testcase
bad.cl — a novel negative testcase
main.py — your lexer. You can optionally include up to 20 *.py Python source files, but the autograder will invoke your submission using python3 main.py.

Working In Teams

You must pre-register your team on the autograder for each assignment. Follow the instructions on the autograder, using your teammate's Vanderbilt email address to create the team. Each team gets 5 submissions per day, shared between members.

You may complete this project in a teams of one or two members. Teamwork imposes burdens of communication and coordination, but has the benefits of more thoughtful designs and cleaner programs. Team programming is also the norm in the professional world.

Students on a team are expected to participate equally in the effort and to be thoroughly familiar with all aspects of the joint work. All members bear full responsibility for the completion of assignments. One member turns in one solution for each programming assignment; each member receives the same grade for the assignment. Teams may not be dissolved in the middle of an assignment.

Grading Rubric

PA2 Grading (out of 85 points total):

61 points — for autograder tests (-1 point per failed test, minimum score of 0)

There are 32 positive test cases (i.e., expected to lex properly), and 29 negative test cases (i.e., expected to produce an error message).

8 points — for a clear description in your readme.txt

8 — thorough discussion of design decisions (including handling of strings and comments) and choice of test cases; a few paragraphs of coherent English sentences should be fine
4 — vague or hard to understand; omits important details
0 — little to no effort

8 points — for valid and novel good.cl and bad.cl files (i.e., 4 points each).
- 4 — wide range of test cases added, stressing most Cool features and an error condition, novel files
- 2 — added some tests, but the scope not sufficiently broad
- 0 — little to no effort or submitted part of course files as test cases
8 points — for code cleanliness
- 8 — code is mostly clean and well-commented
- 4 — code is sloppy and/or poorly commented in places
- 0 — little to no effort to organize and document code