Building an OS Kernel
Talking to the Metal
In the Fall 2024 semester, Qinghan Chen and I implemented a Unix-like preemptive multitasking kernel for x86 as the term project for CMU’s 15-410, Operating System Design and Implementation. The handout gave us a freestanding C runtime, an LMM physical-frame allocator, and a console driver. Everything else (virtual memory, scheduling, processes, system calls, interrupt handling, software exceptions) we wrote from scratch. By the end we had roughly 9,000 lines of C and 840 lines of x86 assembly across about 40 source files.
I had taken systems courses before, but this was the first time I had to think about an entire machine at once. The semester is structured so you start with nothing booting and end with something that can fork, exec, and run a small shell. There is a particular feeling, the first time you write a piece of code that the processor will dispatch through a hardware vector, where you understand that the rules have changed. I want to describe three of the core design decisions we made.
Page-table modification via an identity-mapped directory
The kernel needs to modify the page tables of other processes during operations such as fork, exec, and page-fault handling. While running in kernel mode, the kernel has to reach those page directories through some virtual address.
So at boot the kernel allocates an additional page directory, sudo_index, that establishes an identity mapping for every physical frame the system will use. To modify any process’s page tables, the kernel saves the running CR3, switches CR3 to sudo_index, modifies the target page directory by treating its physical address as a virtual address, then restores the original CR3. Page-table edits collapse to direct pointer arithmetic on physical frames. The kernel’s virtual address space stays bounded regardless of how much physical memory the system has.
The cost is two CR3 reloads per cross-process modification. In return, it bought us a real reduction in code complexity for the rest of the term. Every time we later had to touch another process’s mappings, the work was straightforward.
Lock acquisition order in interrupt-disabled regions
Mutex acquisition is built on a single XCHG against a lock word. A thread that finds the lock held yields the CPU and retries on its next scheduling slice. The hazard appears at the intersection with interrupt-disabled regions. Some scheduler operations require interrupts to be off, and may also allocate through the kernel heap, which is itself protected by a mutex. If interrupts are disabled before that heap mutex is acquired, the thread holding it cannot be scheduled, and any contender blocks indefinitely.
We had exactly this pattern in two places, and the resulting deadlocks were intermittent, timing-sensitive, and prone to disappearing under instrumentation. Qinghan and I spent two evenings on it before the cause clicked. The rule is plain: acquire the high-level mutex first, then disable interrupts, then perform the work. Inverting those two steps in any path is enough to wedge the system under contention.
Scheduler structure
The scheduler maintains three intrusive queues, each indexed by a hashtable on TID for constant-time membership and removal:
- A runnable queue: threads eligible for execution. The head is the currently running thread; a timer tick rotates head to tail.
- A descheduled queue: threads suspended via the
deschedulesyscall, awaiting a correspondingmake_runnable. - A blocked queue: threads suspended on synchronous events such as
wait().
Per-process metadata tracks the parent PID, page directory, set of live child processes, set of vanished but unreaped children, and threads waiting in wait(). Per-thread metadata tracks TID, kernel-stack base and saved ESP, and any registered software-exception handler.
The actual context switch is short. A small assembly macro pushes general-purpose registers, segment selectors, and CR2/CR3 onto the kernel stack. A C handler then selects the next runnable thread and swaps in its saved stack pointer, and a paired assembly routine pops the new thread’s frame and resumes execution. Routing both timer-driven preemption and voluntary yield() through the same restore path means there is one switching routine to debug, not two. When we eventually had a context-switch bug (we did, an off-by-eight in our segment-register pushes), we only had to find it once.
System calls
We implemented thirty-three user-facing system calls, organized by subsystem:
- Process and thread lifecycle:
fork,thread_fork,exec,vanish,task_vanish,wait,set_status - Thread management:
gettid,yield,deschedule,make_runnable,sleep,get_ticks - Memory:
new_pages,remove_pages - Console I/O:
print,getchar,readline, plus cursor and color controls - Software exceptions:
swexn, with dispatchers for the architectural exception set (divide-by-zero, page fault, general protection, and the remaining standard vectors) - Miscellaneous:
halt,readfile
Each subsystem is partitioned into a C handler and an assembly wrapper. The wrapper saves and restores context and validates user pointers; the C handler implements the syscall logic. Keeping the assembly minimal and uniform across subsystems made review easier toward the end of the term, when both Qinghan and I were too tired to hold the whole control flow in our heads.
The most involved handlers were the expected ones. fork clones the address space and the kernel-stack frame so the child returns from the same syscall instruction. exec loads an ELF image, replaces the address space, and constructs a fresh user-mode stack. vanish and wait cooperate to move a terminating thread into the parent’s vanished-children queue, wake any waiters, and reparent orphans to init. The first time I saw fork return zero in a child and a positive value in the parent, with the rest of the system continuing to run, I had to stop and sit with it for a moment.
Drivers
The timer fires at 1 kHz and drives preemption. The keyboard driver consumes scancodes into a circular buffer on which getchar and readline block. The console writes to VGA text-mode memory under a mutex on the cursor state. The interrupt-wrapper assembly preserves a uniform stack layout across every interrupt source, so a single C-side handler signature serves every vector.
Closing notes
I came out of P3 with a different relationship to systems code. Reasoning about correctness in user-space programs is, mostly, reasoning about your own code. Reasoning about correctness in kernel code requires holding three layers in mind at once: the instruction stream, the page-table state, and the scheduler queue. Any inconsistency between them halts forward progress entirely.
The other shift was about preemption. Every line in the kernel is a potential preemption point, and once that lands you stop fighting it. Design becomes about where you would rather not be interrupted, and arranging the surrounding code so that interruption at those points preserves invariants. The way of thinking has stayed with me in code that lives nowhere near a kernel.
To understand a program you must become both the machine and the program.
Alan Perlis