CS116 - From NAND to Tetris

"Any sufficiently advanced technology is indistinguishable from magic."
-- Arthur Clarke

For many computer scientists, computers work by magic. Stanford tries to help more computer scientists understand computers by teaching low level classes. By "low level" classes, I mean classes where students program closer to the hardware. CS107, the first class in C (rather than C++ or Java), does a lot of this. I loved CS107, but even after taking it, there was still a lot of magic in how computers worked. CS107 made me understand how programs work if we accept that assembly works, but not understanding how assembly was implemented was a significant gap in understanding.

As a result, I was excited to learn about "From NAND to Tetris." It was taught by a visiting professor, and the course is available for free to anyone who's interested at http://www.nand2tetris.org/.

The idea of the course is starting at NAND, the smallest subcomponent of a chip, and to keep implementing higher level abstractions until you have Tetris.

Note that most of the implementations that I discuss below are probably not what you would see in the real world. That's because this course was about conceptually understanding the full computer stack and not about optimizing each component.

Still, after taking this class, all of the magic should be gone.

NAND and Logic
"NAND" stands for "Not AND." NAND is a logical operator that is the inverse of AND. So, AND(p, q) is true if p is true and q is true, and it is false otherwise. NAND(p, q) is true if p is false, q is false, or both are false. NAND isn't the only possible starting point (NOR works just as well), but the hardware implementations are pretty efficient and general, so we went with that.

With only NAND, you can create all of the other logical gates. For instance NOT(p) = NAND(p, p). Now, we have a NOT chip using only NAND. AND(p, q) = NOT(NAND(p, q)). OR(p, q) = NAND(NOT(p), NOT(q)). NOR = NOT(OR(p, q)). XOR(p, q) = AND(NAND(p, q), OR(p, q)).

Some logical chips that I hadn't thought about before were the multiplexers (MUX) and demultiplexers (DMUX). MUX(p, q, select) outputs p if select is false and q if select is true. Another way to think about it is that MUX will output true if select and q are both true, or if select is false and p is true. In chips, a MUX is used kind of like an "if" for programmers (well, more like a ternary "sel? q: p"). Since a DMUX does the opposite of a MUX, and a MUX turns several values into just one, a DMUX needs to output two different values. DMUX(p, sel) outputs {p, 0} if sel is 0, and {0, p} if sel is 1.

We can also add in multi-bit gates. A bit is a binary digit. So far, all of the variables I have discussed were one bit (true or false), but we might want to do things with several bits. For instance, if we have a 16 bit computer (a computer that thinks in terms of numbers between 0 and 2^16 - 1), we might want to have a 16-bit bitwise AND. We might want to also have multi-way functions. For instance, an 8-way OR would be true if any of 8 individual bits were true (the OR implementaion a couple of paragraphs ago was a two way OR). We might want a 4-way DMUX that would have a 2-bit selector, which could select between 4 possible outputs.

Of course, while I have been talking about "true" and "false" here, those just refer to voltages, where "true" might mean a high voltage.

Logic is well and good, but we want to do stuff with numbers. We want to add!

Instead of thinking about true and false, I can think of 1 and 0. Then, when adding two digits, I have to worry about two things: the value in my current place and the carry. If I only care about adding two bits, then that's easy. If we have a "half adder" that can add two digits, we can build a "full adder" out of that (we need to string together two half-adders to make a full adder because I have to worry about p, q, and the carry in any given place). I can extend a full adder to add 16 bit numbers by stringing together 15 full adders and a half adder.

Now, I have everything that I need to make a basic ALU -- arithmetic logic unit. Our ALU took two 16-bit inputs and six control bits. You can zero the x input, negate the x input, zero the y input, negate the y input, set the output to x+y or to x & y (bitwise AND), and negate the output. This lets you compute lots of arithmetic functions, including 0, 1, -1, x, y, !x (! is shorthand for NOT), -x, x+1, x-1, x+y, x-y, x & y, x | y (| is shorthand for OR). And everything you can do to x, you can also do to y.

Implementing the ALU was a challenge to think about because it require really grasping the declarative programming model. Most programmers learn imperative programming styles, where you tell the computer commands, and it carries those commands out one by one. In declarative programming, on the other hand, there is no concept of things happening sequentially. You have a bunch of chips, and you're putting electricity through them. The electricity goes through all of them all the time as long as there's a circuit. That's why you need to use MUX: there is no IF. As a result, you have to have chips that to compute everything -- you always have to run the chips that negate x and that zero x (but that won't ever be used sometimes because of the MUXing).

Sequential Logic
Learning about sequential logic was the most conceptually important part of the course for me. Given only circuits and some other primitives (like a crystal that oscillates at a particular frequency, providing us a "clock signal" that lets us act periodically), we can go from declarative programming to imperative programming.

First, make a Data Flip Flop chip. A DFF chip retains its value until it is given a new value. It outputs the old value on every clock tick. In this class, we didn't actually implement a DFF; we just used it.

One part of sequential logic is memory. Next, make a bit of memory. This is a thin layer on top of a DFF that lets you choose whether or not you're changing the data in addition to getting the output. Once you have one bit, a register is just an array of 16 bits. Addressing into RAM requires a DMUX to pass the load bit to the correct register.

Another part of sequential logic is the current instruction. We need a program counter that goes up by one with every instruction, can be reset (when the computer reboots), and that can be set to a particular instruction (when we have control structures like IF or functions or GOTOs.

A Computer
At this point, we almost have a computer. A computer has two components that need to be connected: the CPU and the memory (including instructions).

The memory is pretty much just what we implemented. It will load values from the address specified by the CPU and give that value to the CPU. It will write a value that the CPU tells it to an address specified by the CPU if the CPU tells it to.

The CPU is the implementation of our particular assembly code. Whatever assembly instructions exist, they will be encoded as bits in a particular pattern in our instruction part of memory. The CPU will read one instruction and do whatever it says. Assembly instructions, in general, can have the CPU read from memory, write to memory, or jump to a particular instruction. Thus, the CPU needs to get input of a value of memory (from the previous CPU instruction) and an address of an assembly instruction, and the CPU needs to output the address of the next assembly instruction, the result of some ALU calculation, and whether to write that calculation to memory (and if so, what address of memory).

Now we have a computer capable of running assembly code. At this point, the lessons from CS107 kick in.