gitgood.sh — from the trenches

Driver, not passenger

Lukasz Wisniewski — Sun, 17 May 2026 00:00:00 +0000

The label AI has always tracked hype, not capability. The research field has been an active discipline since the Dartmouth workshop in 1956 — McCarthy, Minsky, Shannon. The target was general intelligence — and hasn't moved.

The label is industry shorthand. Nobody who builds game AI thinks they're building intelligence. They're shipping a system that looks alive for the few minutes the player is paying attention to it. The same applies to LLMs.

LLMs are statistical autocomplete

LLMs are built on top of neural nets. The job is simple: given a chunk of text, guess the next token. Train one long enough — on most of the internet, on every book in the library, on a few decades of code — and the guesses start to look smart. None of it is intelligence — it's a very thorough study of what token usually follows what.

The model doesn't know. It predicts. No experience, no emotion, no knowledge, no critical thinking, no wisdom — none of the human factors that produce understanding. There's a model of what comes next, one token at a time.

What an LLM can produce, once harnessed, is absolutely remarkable.

Vibes all the way down — what a freaking nonsense

The term vibe coding makes me puke 🤮 "What you vibing?" — one may ask. Just button it... Andrej Karpathy came up with the term in early 2025. Everyone has been parroting it since, without thinking.

I don't want to have anything to do with vibe coders. Vibe coding is letting an LLM run loose, building black boxes with the attitude of "ahh whatever, if it breaks I ask clanker to fix it." That's asking for trouble. It may build you a simple thing — but a demo has nothing to do with a production-ready system.

The outcome is a tangled mess. Spaghetti code looks organised next to it. No structure. No architecture. Everything dumped in one file. Unnecessary code everywhere. Security holes bigger than the ones in Swiss cheese. Performance issues. Just a random mess assembled from random snippets that almost works.

AI assisted development

This is the term I like to use. I've been an incredibly heavy LLM user since day one — I hate it as much as I love it. My favourite joke about all this: if AI ever gets sentient, I'm at the top of the hit list. The amount of abuse I offload onto the models to find the right workflow has made sure of it.

There's no golden bullet. Everyone is different. Everyone works differently. Everyone has to find their own way to make it work. It works for me — and believe me, I put that thing through its paces daily.

One thing is undeniable: LLMs do better in some areas than others. Code generation tracks the training data — the more there is for a language, the better the output. For popular ones, decent. For niche ones, worse. For Odin, the models absolutely suck 🤷

When pointing LLM at the codebase, the quality of that codebase matters a lot. Garbage in, garbage out. The model imitates what it reads. The same applies when generating from nothing — the training data quality is questionable, the output mediocre at best. This is where moderation comes in.

How I incorporate LLMs into my daily engineering

AI is just another automation tool. Nothing else. All it does is save me time — sometimes a lot of it. Most of programming work is boring anyway: simple, repetitive code. In the grand scheme of things, the internet is basically CRUD. Generating code speeds my workflow — I don't have to type it.

The difference is I'm in full control. I own the architecture. I decide the code organisation. I follow every line. I push back when needed. I ask for changes. I give directions. I ask for specific implementations. I verify what the model claims. I refuse output I don't understand. I iterate. I adjust manually. I refactor what doesn't fit. Small iterations work. "Build me a Facebook" doesn't 😂 The model is the tool. I'm still the engineer. If I'm not going to end up with code I'd write by hand, it has no place to exist.

The code that ships under my name is mine — accelerated.

I use the LLM to research things I don't know — cross-referencing sources, parsing specs, drafting documentation. Faster than the manual route.

LLMs are invaluable when I'm learning new concepts and new languages. The rate at which I can pick up new ground is intoxicating — weeks of reading compressed into hours. Moderation, again. Never trust an LLM. Never. If it tells me something I don't know, I cross-reference. They hallucinate — that's how they work.

And of course, writing — which LLMs excel at. But there's a big but...

Ghost in the prose

I'm sure it doesn't surprise you: this very article was written with an LLM's help. The distinction matters — with help, not generated. There is no point hiding it. I'm allergic to slop. The kind of AI writing that delves into every paragraph, opens every post with in an era of. I won't ship it.

Yeah, I could prompt it. Give it a title. Ask it to fill in some generic SEO-friendly text. No voice. No personality. Nothing interesting. Just slop. The internet is already drowning in it. I won't be adding to it.

Working on a RAG system recently, I discovered something fun: AI detectors are worthless. I fed one the full text of The Last Wish by Andrzej Sapkowski — published in 1993. The verdict: AI-generated, 60% confidence. (Yes, kids — The Witcher is from the 80s.)

Here's how I write. Hours of thinking. Collecting notes and anecdotes. Research. A draft written by hand. Then the LLM comes in — proofreading, grammar, awkward phrasing. More reworking. Shuffling lines around. Reading the thing a hundred times to make sure the words flow, the rhythm holds, the whole thing is coherent. Writing is a long, hard process of constant refactoring. The same applies to code. Neither will ever be perfect — it just has to be good enough. Make your own judgment based on that.

The model isn't in `git blame`

Unless it's vibe coding, then it is. There is absolutely nothing wrong with AI-generated code. None of it bothers me. What bothers me is companies bragging about the percentage of AI-generated code in their codebase. 80% of our code is AI-generated. It doesn't matter. It's all hype-driven. It's not an achievement. Code is a means to an end — shipping a product. The end user doesn't care if the app was written in Rust, Ruby, or PHP. They don't care if it was written by AI. They want to get their stuff done. They want to use the product. If they don't, you're in trouble — but the code isn't the problem, if it's good enough. The product is.

Shipping a reliable product is a choice. A decision. Engineering is a craft — care, knowledge, judgment, experience. How you get there doesn't matter. A Sunday driver and an F1 driver both have a car. The car needs a driver.

What matters is the end result: software that's reliable, performant and secure.

TDD is a costume

Lukasz Wisniewski — Fri, 15 May 2026 00:00:00 +0000

If you write tests close to the code, your tests start shaping it. They expose tangled dependencies. They signal when a function is doing too much. They make you think about the interface before you commit to internals. This is the design feedback TDD takes credit for. You get most of it from any test written close to the code — not just from tests written first.

There's a broader version of the argument: writing tests first enforces design upfront. It doesn't. Design happens in your head before you write a line of anything. Tests are one way to commit a design, not a prerequisite for having one.

There are many ways to test software — manual, end-to-end, contract, integration, property-based, load, exploratory. TDD evangelism lives in exactly one of them: unit tests. The rest of the surface area is somebody else's problem.

TDD looks great on toy examples. Hey, let's TDD a stack. Push, pop, peek — three methods, no I/O, no dependencies, no surrounding system. That's not reality. Reality is a flaky third-party API, a payment gateway that times out, a schema that changed last week...

What actually happens

Someone announces they "do TDD." Pause. Repeat the announcement to anyone who walks past their desk. Then describe — at length — the colors of their feedback loop. Red. Green. Refactor. Like the alphabet, in case you forgot.

Bob

Not that long ago, I had an interview with a TDD evangelist. Let's call him Bob — to spare him the exposure. Bob wears a TDD hat the way some people wear tin foil — proud, tight, slightly off. Bob told me he writes tests first because — and I'm quoting — "I know what part of the system was tested." Two things went through my head:

How does someone like this actually get employed?
I'm talking to a badly trained model.

Frankly, I still don't fully understand what Bob meant by it. I suppose it was his winning argument. I disengaged with "I've never had your problems, so I don't know what you're going on about, mate." Not the most diplomatic interview move. An interview is a two-way process though — I'd rather wait for the right opportunity.

The only honest record of what's been tested is the code. The opinion collapses on first contact with a team. Code is shipped by groups of people. If "what was tested" only lives in the one engineer's head, the reviewer, the on-call, the new hire six months from now — they all get nothing. That's not engineering discipline. It's personal note-keeping wearing a methodology hat.

The screenshot trophy

Then there's the social-media variant: a screenshot of green checkmarks running down a test list, posted to social media or a Slack channel for everyone to see. Some people share photos of their food. But posting a passing test as a milestone is the tell of someone new — trying to impress everyone with the bare minimum.

Testing is standard engineering practice. It's part of the job, not a milestone.

Tests are not magic

Every test is more code. And every line of code is a maintenance liability. The more code in the repository — written, copy-pasted, AI-generated — the higher the probability of bugs. Tests don't escape that math. They're code too.

Tests reduce bugs. They aren't a magic safety net. A test existing doesn't mean the code under it is bug-free — it just means it was checked. And every test you write is another surface to maintain, another file that has to be updated when the underlying behaviour changes, another thing that can be wrong. A bad test will lie about what the code does. A flaky test will train the team to ignore failures. A test that exists only to bump coverage is dead weight in the water.

A test suite is only as good as the engineer who wrote it. If the engineer doesn't understand what should be true about the system, the tests will codify the misunderstanding — and now the misunderstanding has a green checkmark next to it.

Not all code earns a test

Plenty of code doesn't need a test:

Language features
Standard library functions
The framework's own well-tested behaviour

Writing tests for 1 + 1 because the linter says branch coverage dropped is busywork with a green checkmark on it.

Behavioural testing is what matters — what the function does for the system, not what each line evaluates to. The arithmetic is the language's job. The edge cases aren't — floating-point rounding, integer overflow, the boundaries where the runtime starts doing unexpected things. Those earn their tests.

A unit isn't a function — it's a unit of behaviour that can be one function or a whole module. The size doesn't matter. The boundary does.

What gets tested is judgment — the thing in production with stakes, the business rule that has to hold, the boundary condition the language won't catch. Coverage requirements treat every test as equal. That's the bug.

Monkey tests

The moment coverage becomes a CI gate, behaviour changes. The team stops asking "does this work?" and starts asking "does this branch get visited?" The result: code that walks the runtime through every line, asserts nothing meaningful and ticks the coverage percentage up.

Coverage doesn't tell you what was tested. It tells you what was hit. A 95% coverage number means the runtime walked over 95% of the lines. It says nothing about whether any of those lines did the right thing under any of the inputs they'll actually see.

A strict coverage threshold trains the team to write monkey tests. The number goes up, the bug count stays the same and the suite gets heavier every sprint with code that asserts nothing.

Where tests pay off

People like to say a well-tested codebase gives you fearless refactoring. True — to a degree. Rename a function, restructure a module, swap an internal implementation — and the suite tells you what broke. It's a double-edged sword though — passing tests don't mean nothing broke, only that the tests still pass. Test counts, coverage numbers, methodology stickers on a laptop — none of that ships better software.

The size of the test suite is influenced by the language choice. Statically-typed languages catch the dumb mistakes at compile time — wrong type, missing field, function signature mismatch. The type checker is built in, not bolted on. Loosely-typed languages don't have that backstop, so they bolt one on — TypeScript on top of JavaScript, mypy on top of Python Different language, different overhead. What gets tested is what the tooling won't catch.

The ceremony doesn't matter. Balancing a pogo stick on a beach ball doesn't matter. What matters is the end result: software that's reliable, performant and secure.

Exorcisms like this are the engineering version of insisting tea must be stirred clockwise — that stirring it anti-clockwise spoils it. The tea doesn't care. The ritual is for the person performing it — they might as well sacrifice a chicken on the last day of the month while they're at it.