We all know about DSA—Data Structures and Algorithms. But that's not all there is to programming, no matter the field. I emphasize this especially in software engineering: DSA is almost never the fix on its own.
Today marks over a week and a half on my lexer since the beginning of May 2026, when I decided to build a compiler. I've gone through different choices, code patterns, rewrites. But you know what? I didn't read a single compiler book. No Compiler Dragon Book—the bible of the field. No PDF, no 2,000 pages just to experience someone else's experience into yours.
This article isn't saying you shouldn't read books. But if you're a systems engineer like me—and my experience is pretty low, roughly 2.5 years in C++ and years in programming—we know tools aren't always the answer. In fact, DSA isn't either. This isn't about toy projects. We're talking about real production-grade projects where every performance detail and line of code must be understood perfectly, not just written pretty.
While writing the lexer in C++, coming from Rust after ditching C++ for a while, I hated my comeback. Rust's ownership model and the compiler fighting me on every borrow? I ran from it. Went back to C++ where debugging a struct—which was my token object—meant just printing fields and moving on. No lifetime elision wars, no fighting the borrow checker because my token held a reference to source text that might outlive the lexer stream. In C++, I own my memory, I leak it, I fix it. The cost is visible immediately. In Rust, the cost was hidden behind compiler errors that were technically correct but solved someone else's problem, not mine. I didn't need zero-cost abstractions. I needed to see my token struct in a debugger without wrestling with the language first.
Those questions turned out to be simple. Deceptively so.
I noticed that while having only the basics of compilers—its stages, just the prior experience (in my case: Lexer → AST → IR → Codegen)—if I had actually asked the same questions as the first person who attempted it, about its architecture, what it solves, and the coding reality... because as we all know, design doesn't map well to code. In the worst case, if we try to find the ratio, we can use only the most abstract and subtle design—not overly architected, but rearchitected through code. This is a classic system design phase, but I realized it through grit.
So the problem, which you've probably guessed, is we want to solve two problems. First: no manual assembly. Second: high-level code and its own problem—which is, how do they get back to their root?
The best approach—not the only approach, but the best—was to use what we all call a lexer today: a stream-broken, tokenized, labeled source code.
But why? Why not parse straight from the character stream? Why add a whole stage just to label things?
Because the problem isn't "how do I read code." The problem is how do I transform high-level intent into machine execution without writing machine code by hand. And that problem has two parts: the human writes symbols, the machine needs instructions. The gap between them is where every cost hides.
So you ask the two questions. First: what do I actually need to know about this source? Not "what data structure should I use"—that's the DSA trap. You don't reach for a hash map because hash maps are fast. You reach for it when your problem is "I need to check if I've seen this identifier before in O(1)." The problem first. The structure second.
Second: where does the cost arrive? Design won't show you. Design is clean boxes and arrows. Code reveals it. When I wrote my lexer, I didn't hit the cost in the diagram. I hit it when I realized a recursive descent parser trying to backtrack over raw character streams was burning CPU on re-lexing the same identifier five times. The cost wasn't visible in the "Lexer → Parser" box. It was visible in the profiler, in the branching, in the cache misses from string comparisons.
That's when the engineering choice crystallized. The lexer isn't there because compilers "should have one." It's there because tokenization is the point where you pay the string cost once, then never again. You transform the variable-length, unpredictable, cache-unfriendly character soup into fixed-size, predictable, cache-friendly labels. That's not a fancy design pattern. That's solving the first sub-problem: how do I make the rest of the pipeline fast enough to be usable?
Now, why can't the parser live in the lexer? Why not just build the AST while tokenizing?
Because the lexer solves a linear problem. It walks left-to-right, one pass, no memory of nesting depth. The parser solves a non-linear problem. Take a complex recursive function—nested lambdas, match arms, closures capturing environments. The lexer sees this:
FN IDENT LPAREN IDENT COLON IDENT RPAREN LBRACE MATCH IDENT LBRACE...
Flat. Stateless. A conveyor belt of labels. It has no stack to track "this brace closes the lambda, not the match arm." It cannot, by design, handle recursion because recursion requires a tree, and trees require a builder that remembers where it is in the structure.
The parser sees the same tokens, but at depth 3, inside a match arm, inside a closure, knowing exactly which brace closes which scope. The lexer sees RBRACE and thinks "end of something." The parser knows "end of match arm, inside lambda, inside function." That's the difference between a label and a structure.
Could you force the lexer to track a stack? Could you make it "smart"? Sure. Now you've merged two problems into one stage. And when that recursive function nests ten levels deep, your lexer isn't tokenizing anymore—it's predicting, it's branching, it's holding state that grows with input complexity. The cost arrives in CPU exhaustion, in stack overflows, in the maze of entry and exit points where you can't tell if you consumed the full path or just the happy path. Burnout in code, not in design.
And here's what they don't tell you: the burnout hits the engineer before it hits the machine. The junior who reads three compiler books before writing a line of code. The team that designs for 1M users at 100 users. The developer who builds a distributed system because "microservices are best practice" when a monolith would have shipped in a week. The cost arrives in the human first—in the paralysis of premature abstraction, in the exhaustion of solving problems you don't have yet. That's why these stages exist. Not because a book prescribed them. Because that problem needed that solution.
The AST exists because some problems are inherently non-linear, and pretending they're linear doesn't make them linear—it makes them expensive. The parser is where you accept that cost upfront, where you build the explicit tree, where you make recursion manageable by giving it a structure that matches its nature. The AST isn't standard because every language has different non-linear truths. SIRL's match arms with explicit types need different nodes than C's switch statements. The AST shape is dictated by what your language actually does, not by what some book says it should look like.
Same for the IR. Same for the lexer. We know the stages—Lexer → Parser → AST → IR → Codegen—but the implementations diverge because the problems diverge. A JIT compiler skips the AST for hot paths because its problem is latency, not optimization. An embedded compiler uses a different IR because its problem is register pressure, not vectorization.
This is the grit. The design phase gives you the illusion that you've solved it. The code phase reveals you haven't. Design doesn't map to code. It maps to intention. Code maps to reality. And reality is where the cost lives—in the branching, in the memory layout, in the cache lines, in the non-linear paths that design documents politely ignore.
Books give you answers to questions you don't have yet. The problem is yours; the answer should be too. Not from blind DSA application. Not from following a chapter. From asking what you actually need to know, and having the guts to let the code show you where the cost lives.
So the lexer stays dumb. The parser stays recursive. The AST stays language-specific. Not because it's elegant. Because each stage solves exactly one problem, pays exactly one cost, and exposes exactly one interface to the next stage. That's systems engineering. Not DSA for DSA's sake. Problem first. Cost second. Structure last.
And when you finally lower to IR and codegen to assembly, you've answered both original problems. No manual ASM. High-level code returned to its roots. But the path there wasn't found in a book. It was found by asking what the problem actually is, where the cost actually arrives, and having the grit to let code—not design, not someone else's experience—tell you the truth.
Two questions. Everything else is just typing.