I've wanted to write a compiler for a while now. I'm fascinated with how code can be processed into something that is understood natively by a processor. I'm less fascinated with interpreters, which is unfortunate, since there seems to be a lot more material on writing an interpreter and interpreter design on the internet. It makes sense. They're easier to write. Code can be processed incrementally, and really, it's just a translation of one language's actions into another.
Compilers on the other hand, generate completely stand-alone code. In a sense, you're still writing an interpreter, as the compiler is tasked simply with a translation job. The complexity comes from the translation being between a relatively high level to the lowest possible level (somewhat).
Before actually trying to implement a compiler, I need to do some reading. I did some preliminary reading, and I've decided that I'll have the greatest chance of success if I choose a language with a relatively simple grammar. C is pretty simple, especially if you omit some of the fancier stuff (flexible declaration locations, function pointers, inline functions) but it's still pretty long. I chose Scheme. The grammar consists basically of two parenthesis, number, character, and boolean literals, and identifiers. Everything else is a function (within some obvious limits, looking at you, macros).
You can write a compiler for scheme and then implement the core functions as you go. In fact, the scheme interpreter I'm using to learn scheme, gambit (which is also a compiler), omits lots of common functions found in the MIT scheme interpreter. If you look at the string library for MIT Scheme, you can count on about 80% of the functions on the page not being available in gambit.
Onto the fun part.
I decided to start tackling Project Euler problems to learn Scheme. I've always found that the first 10-20 problems pretty much put you through a language's paces. You can see my current progress on github in my projectEuler repository (no cheating!). Not only have I learnt a lot of Scheme over the last few weeks, I've learned a lot about functional programming, and it's been a real mental challenge to warp my mind into the functional way of thinking.
I thought I'd showcase a few of the cool things that I did, and some of the novel (probably convoluted approaches) I took.
;;Project Euler Problem #1
;; If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5,
;; 6 and 9. The sum of these multiples is 23.
;; Find the sum of all the multiples of 3 or 5 below 1000.
;;
;; Answer : 233168
;;v2
(begin
(define sumDivisibleBy (lambda (n end)
(let ((count (quotient end n)))
(* count (+ count 1) 0.5 n))))
(display (- (+ (sumDivisibleBy 3 999) (sumDivisibleBy 5 999)) (sumDivisibleBy 15 999)))
(newline)
)
So, this one's pretty simple. We define a function that finds the sum of all numbers divisible by n that are less than end. We know that n can fit into end end/n times (fourth grade here), so we simply use the formula for the sum of a series of integers to find the sum of the numbers, then multiply that by n. Simple enough.
Problem 3 isn't really important to this following bit of code. I was planning on writing a prime sieve for this problem, for memoization of the prime factorization algorithm (to keep it from continuously rechecking if a number was prime), but sieves are usually useful when you know the upper bound that you need to target. You could stop and re-sieve with a higher bound, but I didn't want to muck around with that logic at the time.
Anyway, here's an interesting range function:
; returns a list containing every 'increment' integer between
; start (inclusive) and end (exclusive)
(define range (lambda (start end increment)
(if (not (or (and (< start end) (> increment 0)) (and (> start end) (< increment 0))))
'()
(begin
(if ((if (< start end) > <=) start end)
'()
(begin
(cons start (range (+ start increment) end increment))))))))
Again, MIT-scheme provides a method quite like this built-in, but gambit did not. I think the lack of a huge library really helped me learn better. I was continuously forced to go and write methods that would have otherwise been provided for me.
The interesting thing here is in the second if
statement, and in the recursiveness (it's a really slow range function, I wrote a faster one using vectors later) of the function. The first conditional checks to make sure that the function wasn't called with an increment = 0
or a lower bound that is greater than the upper bound for increment > 0
and vice versa. It returns an empty list ('()
) if these checks fail. The second if statement checks to make sure that the function is still within the bounds of the range. That is, if start < end
for increment > 0
or vice versa. What's interesting about it, is that there is an if
inside an if
which returns an operator for the outer if
to use.
(if (< start end) > <=)
This code will cause the outer if statement to compare the start and end with the appropriate comparator.
However, it should be noted that this code could also be written much more concisely as:
(if (= start end))
But isn't the code above so much cooler?
This problem is simple, for the first 100 natural numbers, find the difference between the sum of squares and the square of the sum. Hey, didn't I just write a range function? This one was super simple.
(define findDiff (lambda (n)
(let* (
(sumSquares (apply + (map (lambda (n) (expt n 2)) (range 1 (+ n 1) 1))))
(squareSum (expt (apply + (range 1 (+ n 1) 1)) 2)))
(- squareSum sumSquares))))
The sum of squares is simply mapping a lambda that is, sort of currying expt
with a power of 2 over the range, and then applying the +
function to the resulting list. The square of the sum is even easier, we just apply +
to a range, and then call expt
. Take the difference, and that's the sum of the squares.
This problem was strangely harder than problem 6, but it's probably more in the way I solved it. The challenge is to find the first number that's divisible by all numbers 1 through 20. I'm not going to show all the code here. I'll explain the helper functions used, but there's too much code to put in one place. In actuality, the solution boils down to five lines that process lists.
The math involved is pretty hairy, so I'll summarize it. Turns out, to find the LCM of two numbers, you can use their prime factorizations. For each prime number that is represented in the factorizations, find the highest power it is raised to, then multiply each prime that is represented raised to it's highest power.
(define findFirstDivisibleUpTo (lambda (u)
(let* (
(factorizations (map (lambda (n) (hist-supplement '() n)) (map factorize (range 2 (+ u 1) 1))))
(neededPrimes (filter prime? (range 2 (+ u 1) 1)))
(counts (map (lambda (p) (map (lambda (n) (hist-get-value n p)) factorizations)) neededPrimes))
(greatestCounts (map greatest counts)))
(apply * (map (lambda (n) (expt (list-ref neededPrimes n) (list-ref greatestCounts n))) (range 0 (length neededPrimes) 1))))))
(display (findFirstDivisibleUpTo 20))(newline)
This is probably terrible code, but I just absolutely love the process that used here. It follows the mathematical process almost exactly.
The first line does a lot,
(factorizations (map (lambda (n) (hist-supplement '() n)) (map factorize (range 2 (+ u 1) 1))))
On the far right, we map the factorize function onto a range from 2 to u (the argument to the findFirstDivisibleUpTo
function). factorize
returns a list containing the prime factorization of a number. This generates a list of lists. Then, for each list (which contains a prime factorization) we map the hist-supplement
function onto it. This function accepts a histogram, and supplements a list of numbers into it, the number of times the number appears in the argument is the frequency of that number which is supplemented into the histogram. A histogram is a list with dotted pairs that contain the frequency that each number appears. So, what this code has done has generated found the power that each prime is raised to in each prime factorization.
Here's some sample execution.
↪ gsi
Gambit v4.7.3
> (load "5.scm")
> (factorize 40)
(2 2 2 5)
> (hist-supplement '() '(5 5 2))
((2 . 5) (1 . 2))
> (map factorize (range 2 21 1))
((2)
(3)
(2 2)
(5)
(2 3)
(7)
(2 2 2)
(3 3)
(2 5)
(11)
(2 2 3)
(13)
(2 7)
(3 5)
(2 2 2 2)
(17)
(2 3 3)
(19)
(2 2 5))
> (map (lambda (n) (hist-supplement '() n)) (map factorize (range 2 21 1)))
(((1 . 2))
((1 . 3))
((2 . 2))
((1 . 5))
((1 . 2) (1 . 3))
((1 . 7))
((3 . 2))
((2 . 3))
((1 . 2) (1 . 5))
((1 . 11))
((2 . 2) (1 . 3))
((1 . 13))
((1 . 2) (1 . 7))
((1 . 3) (1 . 5))
((4 . 2))
((1 . 17))
((1 . 2) (2 . 3))
((1 . 19))
((2 . 2) (1 . 5)))
Now, just for bookkeeping, we need a list of possible primes we could have encountered while calculating the prime factorizations of the first u
numbers. This one is simple, it requires a filter function which I wrote, it's a classic filter, it returns a new list containing only elements in the source list for which a function returned true for.
(neededPrimes (filter prime? (range 2 (+ u 1) 1)))
Here's what we get if we run this line.
> (filter prime? (range 2 21 1))
(2 3 5 7 11 13 17 19)
Here's where it gets fun. The next line kind of rotates the factorizations list into a matrix. A nested mapping is performed. For every prime number in needed primes, we map the hist-get-value
function onto the factorizations
list. This means for each prime number, we get a list of exponents that that prime is raised to in each factorization, 2 through u
. This is handy, and you'll see why in the next step.
> (map (lambda (p) (map (lambda (n) (hist-get-value n p)) factorizations)) neededPrimes)
((1 0 2 0 1 0 3 0 1 0 2 0 1 0 4 0 1 0 2)
(0 1 0 0 1 0 0 2 0 0 1 0 0 1 0 0 2 0 0)
(0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1)
(0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0)
(0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0)
(0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0)
(0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0)
(0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0))
Remembering the algorithm, we needed to figure out the highest power that each prime was raised to. This is, thankfully, a simple mapping. I wrote a greatest function that returns the greatest number in list, and I map it over each list above.
> (map greatest counts)
(4 2 1 1 1 1 1 1)
Now we're getting somewhere. We have that list of primes that were represented, so, now it's just a matter of raising each prime in that list to the corresponding power, and then multiplying them together.
> (apply * (map (lambda (n) (expt (list-ref neededPrimes n) (list-ref greatestCounts n))) (range 0 (length neededPrimes) 1))))))
232792560
This method is hilariously fast. And Scheme's native handling of large numbers is awesome (especially for Project Euler).
↪ time gsi upTo1000Test.scm
7128865274665093053166384155714272920668358861885893040452001991154324087581111499476444151913871586911717817019575256512980264067621009251465871004305131072686268143200196609974862745937188343705015434452523739745298963145674982128236956232823794011068809262317708861979540791247754558049326475737829923352751796735248042463638051137034331214781746850878453485678021888075373249921995672056932029099390891687487672697950931603520000
0.54 real 0.51 user 0.01 sys
Compared to the brute force approach, it's nowhere close. For perspective, the brute force solution takes ~25 seconds to find the first number that is divisible 1-20. Though to be honest I wasn't going for speed here. I just found it interesting. What's neat is that, as soon as I read the description of the algorithm on Wikipedia, my mind instantly started thinking in terms of lists and mappings. My brain has become infected with Scheme.
where now?
I think I'm pretty comfortable with Scheme now. In fact, I probably could have stopped a while ago. The next step is reading the Dragon Book so I can start writing a compiler. I've found a copy and have started taking notes while I'm on trains and when I have idle moments. I've put the notes I'm taking up on github in the sssc repository under the lit section. Ideally I'll put the notes up for each chapter that I read, along with homework problems and solutions.
goals
I initially want to just get a lexer and a parser working. Then I'll begin working on the intermediate representation. After that, I'll focus on code generation for MIPS, since I have an emulator I'm comfortable with from a class I took (ECE 2035, in the readme there's a guide to setting up an executable copy of MiSaSiM).
After MIPS... well, I think it'd be fun to generate AVR assembly, or possibly PIC assembly (for the MPLAB assembler). That way I could actually write some code for an ATMega or PIC16 in Scheme, which would be quite fun actually.
Way further down the road, probably after I've taken my advanced computer architecture class, and another VLSI class, I'd like to actually come up with my own simple architecture and simple assembler, with an implementation in Verilog, and then target that. That way I could have built pretty much an entire system. So, we'll see where this goes. I'm excited for now.