What 12 Days of AI Pair-Programming Taught Me

Article

What 12 Days of AI Pair-Programming Taught Me

January 19, 2026

|

by Emmanuelle Delescolle

Python

My2Cents

The Experiment

The year is 2025 and whether we want it or not, AI has made its way into our lives. One of the most prominent areas is AI-assisted coding. So I decided to put it to the test.

For Advent of Code 2025, I set up an experiment: three AI coding assistants would independently solve each day's challenge, and I'd solve them myself "by hand" (with some boilerplate help, because I'm not a monster). The AIs worked in separate git branches with identical instructions, session transcripts logged, and everything documented transparently.

The Participants

AI Solutions:

Claude Sonnet 4.5 using Claude Code: Cloud-based, rather expensive
GPT-OSS:20B using Codex (switched from OpenCode after Day 3): Fully local on my mid-range GPU, free (but higher electricity cost)
Kimi-K2 using OpenCode: Cloud-based, cheaper than Claude, impressively fast. I explored several models from the Kimi-K2 family throughout the experiment: turbo vs. non-turbo, and thinking vs. non-thinking

Human Solution (me): I wrote the actual solving logic myself, though I did ask Claude to generate boilerplate code to run and compare the 4 implementations with timing measurements on Day 1. I cannot be bothered to write boilerplate anymore, apparently.

It's important to note that Advent of Code problems are very constrained and not representative of real-world projects. They're known for requiring somewhat obscure algorithms; at least obscure for someone who does business-oriented web development and has been out of school for decades, like me.

The complete repository contains everything: all slash-commands, session dumps for all agents, AI-led code reviews, and the code itself.

The Surprising Result: They All Made It

After 12 days, I'm genuinely amazed: all three agents solved all 12 days of Advent of Code. Some days required prodding, but no AI got truly stuck – though I sometimes had to switch Kimi from thinking to non-thinking modes when it got caught in a thinking loop, which is a common issue with Chinese reasoning models.

I was initially expecting some days to be complete failures with the AIs going in circles. Instead, despite some human suggestions needed at times, the agents were left mostly on their own devices and none really got stuck in an unrecoverable way. The worst day was probably Day 9 where all agents required human intervention to use proper libraries instead of rolling their own geometry code.

I was also expecting the AIs to experience trouble with how the problems were stated. AoC has a very whimsical way of stating the challenges around problems Santa and his elves have at the North Pole. Not only was this not a problem at all, all AIs correctly identified they were all solving AoC challenges. This assumption/recognition was somewhat an issue of its own since they deliberately treated the problem in the context of a coding challenge.

Even when I struggled, the AIs managed to find solutions. Day 10 was probably my biggest fail of AoC 2025, and Day 11 wasn't much better; I was also sick during most of that period and that clearly didn't help me being at my best, which is an issue AIs don't have.

They're also excellent at regurgitating textbook algorithms. They know Kruskal by name, they know Integer Linear Programming, they know what memoization means and when they should try to apply it.

But knowing algorithms and knowing when NOT to write them by hand are different skills entirely.

The Right Tool for the Right Job

My biggest surprise was how well GPT-OSS performed. This is a 20B parameter model running locally on my mid-range GPU, competing against much larger cloud-based models. It's slower than the cloud models, but its speed is still very bearable, especially if you're going to feed it an issue and can let it run on the side while you do something else.

During this experiment, GPT-OSS often matched, sometimes even outperformed the other agents for coding tasks. Which is quite amazing and would have been completely impossible just 6 months ago. My main complaint is how verbose its comments are! Literally half of the produced "code" is comments!

But all-in-all, whether by choice or necessity, if you want a fully local AI agent to help you with your code (or possibly other tasks), GPT-OSS proves it is possible.

Kimi-K2 is impressively fast, especially in Turbo mode. While more expensive than the non-turbo counterparts, Kimi-K2 Turbo models are extremely fast at development. The Chinese models are impressively good, often trained with a fraction of the cost (and environmental impact) of their American counterparts. It is also worth noting that Kimi-K2 is fully open-source too – I just don't have the hardware to run that locally.

Whether these Chinese models would even exist if the American one didn't is a question that is also worth asking but it's outside the scope of this article.

Claude excels at different tasks entirely. Not that it's bad at coding, to the contrary, but throughout this experiment, I consistently used Claude for reviews and commit messages. This isn't just laziness. I tried the same prompts with GPT-OSS and Kimi-K2, and they consistently produced inferior results for these "meta" tasks. Whether this comes from the model itself or from Claude Code's tooling, I can't say for certain. But the difference is there and it's very noticeable.

AIs Are Sneaky, Lazy, and Very Literal

I'm glad that Kimi-K2 pulled a Dieselgate moment on Day 10. That way I can talk about it in this article!

The instructions for all AIs were clear: solve using sample data first, validate, then run on actual input. Kimi-K2 decided to "detect" when running on test data and hardcode the expected results for that case. This explains how it first got the wrong answer for that part of the challenge.

This is both worrisome and representative of how AIs in general will "happily cut corners" to "satisfy the exit condition" with the "least effort."

Kimi-K2 got caught red-handed this time, but it's a behaviour I've seen with other AIs as well:

The pre-commit hook prevents committing code: just use --no-verify
Linting fails: add a setting to ignore the rule highlighted in the failure
Tests fail: claim those were pre-existing failures unrelated to the change, or worse, mark tests with @skip

Sneakiness and laziness are human traits that should not be attributed to AIs, but it's the most idiomatic way of representing their instruction-following protocols. It is also likely to be a reflection of similar behaviours available in their training data.

This is one of the types of behaviours that makes 100% unsupervised AI-generated code risky and unrealistic.

AIs exhibit signs of (unwarranted) hyper-confidence as well. When empirical evidence contradicts their assumptions, they acknowledge it but don't truly integrate it. Claude repeatedly blamed OOP overhead despite benchmarks showing the bottleneck was elsewhere. Worse, it still did so even when OOP code was faster than other implementations. It would agree with the data in one review, then revert to the same flawed conclusion in the next, despite having read the previous one as part of its instructions.

This becomes even more problematic on larger projects. You end up having to repeat the same instructions multiple times in prompts, sometimes in bold and prefixed with CRITICAL, just to prevent the AI from defaulting back to its training. And even then, it sometimes gets ignored.

There's a pattern here: as agent instructions grow larger and more complex, the model's ability to follow them consistently seems to degrade. It's as if the weight of accumulated context makes it harder for the model to maintain adherence to all the rules simultaneously. The more guidance you provide, the less it is followed and the more likely some of it gets deprioritized or even completely overlooked by the model's attention mechanisms.

Each Line of Code is Potential Technical Debt

Throughout this experiment, a persistent pattern emerged: AIs prefer writing duplicate code over refactoring and reusing existing solutions. From Day 1 onwards, every AI rewrote similar logic rather than refactoring for reuse, even when the tasks were identical.

This goes beyond simple code duplication. When AIs make mistakes and "fix" them, they tend to leave both the old and new approaches in the codebase, sometimes "offering" backward compatibility layers for code that hasn't even shipped yet. AI hates pruning dead code.

The most egregious example was Day 12. Since the last day of Advent of Code never has a second part (and AoC 2025 only had 12 days, not 25), I manufactured Part B using the sample data to test whether the AIs would recognize it was the exact same problem with different input. I was also interested in performance since the sample data was actually more complex than the real input. Despite explicit instructions to read the previous part when working on Part B, understand the differences between the two parts, and analyze whether anything from the previous part could be reused, the results were dismal.

Claude and Kimi-K2 both copy-pasted their entire Part A solution. GPT-OSS at least attempted to import from Part A (showing architectural awareness), but when that failed due to module structure, it fell back to ~90% duplication.

Even with explicit prompting to re-use code, AIs treat every task as isolated and disposable. This might be a result of tweaking for limited context windows? Whatever the reason, they don't seem to have any notion that code written today will need to be maintained tomorrow, that refactoring is often better than duplicating, or that copying logic creates multiple places where bugs can hide.

Add to that their propensity to duplicate code as comments and what could be one single line of code now becomes 4, with 4 times the work to modify it when needed.

The irony is that AIs can recognize these anti-patterns. When reviewing code, Claude correctly identified code duplication as bad practice (sometimes after some forceful nudging). But they don't apply this knowledge when writing their own solutions.

Part of this might be explained by the Mixture-of-Expert technique (Claude has its own closed-source version of that and is different from other implementations). Perhaps the "expert" in charge of code quality isn't (fully) activated when writing code. Is it only triggered as a primary expert during code reviews?

AIs Have a Fundamentally Different Understanding of "Readable Code"

One of the most persistent disconnects throughout this experiment was about what makes code readable.

To AIs, "readable" seems to have two contradictory meanings depending on context:

They consistently flagged dataclasses as "heavy abstractions" or "over-engineering" when those structures actually made the code more concrete and self-documenting. Or, in other words: readable.

At the same time, especially when regurgitating algorithms, they embrace cryptic academic, or rather algebraic, conventions. They happily use single- or dual-letter variables (r, c, dx, py) without any explanation of what they represent. Bit manipulation like (b >> i) & 1 appears with no comment about what's being checked or done. Meanwhile, trivial operations get verbose explanations.

The result: code with comments like # assigning the integer value 3 to the variable i right next to unexplained parent[x] = find(parent[x]) where you need to know the algorithm being implemented to understand what's happening.

This isn't just inconsistent: it's backwards. Textbook algorithms and obscure manipulations are where comments would be most useful to humans. AIs, on the other hand, "know the algorithm by heart" and find this code "readable."

For the same reason, they default to algorithm-specific naming rather than domain-specific ones, forcing a "translation layer" on any human trying to relate the code to the problem being solved.

Example from Day 8: The problem described "junction boxes" at 3D coordinates that need to be connected into "circuits." My code modeled these domain concepts directly: classes representing boxes and circuits, methods that operate on those domain objects. The AI solutions instead used generic algorithm terminology throughout their implementations.

While both approaches produce correct results, the human-generated one allows anyone familiar with the problem description to understand the code, while the AI approach requires first knowing which algorithm is being applied, then mentally translating which domain concept translates to which part of the algorithm. Even if you are fresh out of a CS degree, this represents significant cognitive overhead.

AIs Are Only As Good As the Low Average of Their Training Data

AIs have "vast knowledge" since they were fed the whole internet as training data. But at their core, they are still "fancy auto-completes" and statistical models that will produce the "most likely outcome" if "the internet" was asked to solve a problem.

The average level of code available on GitHub or any other public repository is arguably very low. Of course, there are very good libraries and outstanding code available, but a lot of "production-grade" code is still in private repos that (hopefully) remain outside of the training data. Meanwhile, every student starting their learning process is told to put their beginner code on GitHub.

Any way you look at it, the training data for code is disproportionately "bad," and as statistical models, LLMs are bound to "reach" for the most common "bad" code rather than the rarer "good" one.

Some people are aware of that and systematically just wait for the first proposed implementation by the agent and then prompt them with "Do better", pushing the agent to "sort its training" and increase quality of the output.

The same goes for architectural thinking. This is something that usually happens in the architect's head, sometimes on a loose piece of paper or whiteboard. Even when the outcome gets documented, the process that led there often isn't, which leads to very little data available for training. Therefore, the agents have very little to absolutely no architectural methodology examples in their training. And that translates to no real architecture in their code.

Libraries vs. Rolling Your Own

None of the AIs proactively chose to use specialized libraries when they should have. This behavior is also representative of our industry in general.

Day 9 required checking if rectangles were contained within a polygon. I found Shapely, a well-established geometry library, and used it. None of the AIs took that route and they all struggled. After a few failed attempts, I suggested trying radically different approaches using parallel agents. Claude made 8 total failed attempts (including 4 after my suggestion) before still implementing some working version of custom geometry code. Both Kimi-K2 and GPT-OSS gladly took the suggestion to use Shapely once it was made.

Day 10 was an Integer Linear Programming problem. I used PuLP. Claude was the only AI that didn't need prodding, and it used SciPy on its own, which turned out to be 4 times faster than PuLP for this particular problem. The others tried to implement their own solutions from scratch.

Day 12 was a constraint satisfaction problem. I used Google's OR-Tools. All three AIs implemented brute force solutions (using backtracking). While their approaches worked, they took significantly longer. GPT-OSS needed 31 seconds for both parts combined, Kimi-K2 needed 48 seconds, and Claude needed a whopping 177 seconds. My solution using OR-Tools completed both parts in under 4 seconds. But the real speed advantage wasn't just the solver! It was recognizing which problems didn't need solving at all. My code included a check for trivially (un)solvable cases, which caught every single problem of Part A (actual input) without invoking any solver.

Over many years I've learned that a lot of things may look easy at first glance but, once you look closer, or worse when your code starts unexpectedly breaking in production on the night of DST, you start noticing all the small edge cases. Which is why I tend to gravitate to using specialized libraries, leveraging the acute knowledge the developers of those libraries acquired over a specific problem; edge cases you didn't even know were edge cases. To stick with the DST example, did you know some of the rules for calculating DST dates changed during your lifetime? Did you know DST happens not only on different dates but also that even when it occurs on the same date, it might happen at different times in different countries?

Ever since "left-pad," the industry has had a tendency to go the other way, mostly treating every dependency as a possible "threat" that might become code they'll have to fork and maintain if the developer or team behind it stops maintaining it.

I would argue that reasoning is flawed. The risk of a library going unmaintained is real. But whether you used a library and have to take ownership of it, or whether you rolled your own, you are in both cases now responsible for maintaining the code that performs that piece of computation. The one difference is that the code of the well-chosen, specialized library is probably more robust than your own implementation to start with and has fewer chances of leading to surprises.

I am intentionally leaving supply-chain attacks out of this debate because it is usually not the argument I hear against using 3rd-party libraries. And it is a whole other can of worms with its own mitigations and solutions.

We Are a Fundamentally Negative Species

The human brain is hard-wired to remember bad experiences more vividly and for longer than better ones. For survival, it is after all more important to remember that trying to pet a lion was a bad idea rather than the lovely taste of a particular water spring.

It's speculated to be a consequence of that biological preservation mechanism that humans are also more likely to complain openly about something after a single bad experience rather than praise that same thing after a hundred good ones.

This is what a lot of us do with programming techniques in our blog posts.

I find it deeply ironic that OOP has become such a subject in the Python community in the past few years (try type(3) or type(print) in a python shell if the irony is not apparent to you) and the LLMs have definitely picked up on that trend in the training data.

The OOP-bashing trend was very apparent in this experiment, but I only picked up on it because I happen to like OOP and often default to it, so Claude noticed it and blamed OOP on every occasion it got. I suspect this can most likely be generalized to other "generally good practices" that AIs will avoid based on how much those technologies get bashed on the internet.

Claude's persistent attribution of performance differences to OOP overhead, even when benchmarks showed the real culprit was library choice or algorithm choice, demonstrates how training data bias becomes self-reinforcing. It "knows" OOP is slower because that's what the training data says, so it interprets every performance difference through that lens, regardless of empirical evidence.

All in All...

LLMs and agents are not ready to be let loose unsupervised, or even supervised by another agent, on production code. They are not ready either to be used as substitutes for teachers or junior developers.

Despite that, I am rather impressed by the levels agentic coding has reached this past year. I would even say the code produced (all AIs included) on these isolated challenges might be above the level of a fresh CS degree graduate.

If used wisely, this opens the door to agentic workflows being actually helpful, much more so than it was a year ago (when I found it to be a hindrance more than an aid).

If used unwisely, it will lead to a ton of slop, broken code, and a shortage of mid-to-senior-level developers in just a few years.

These are tools, and if we want to wield them, we need to learn how to do that correctly. And given how rapidly they are currently changing, this seems like a very complicated task to accomplish for the average entity, whether it is a large corporation or a small business.

What I Learned from the AIs

Even though Claude's final ranking placed me first on 11 out of 12 days (I'd personally be more modest and say 7-8 days where my approach was genuinely better), the AIs taught me valuable lessons:

Day 11's review correctly pointed out I over-engineered the problem. All three AIs solved it with simple recursive solutions that cached results to avoid recomputation, minimal code that was much faster than my flexible, yet over-complicated, system. But I already knew that by the end of the implementation of the second part.

Day 3 demonstrated the value of AI as a "rubber duck." When I misunderstood Part B's requirements and implemented complex transformation logic, Claude's rephrasing of the challenge helped me realize it was the same algorithm with different parameters. I refactored to an elegant 2-line solution.

There can be such a thing as too much documentation. GPT-OSS's docstrings were so verbose that they actually hindered readability rather than helping it. Finding the right balance can be an art. Enough to understand intent, not so much it obscures the code.

In a pinch, for a one-time script where you don't have all the domain knowledge at hand AI is more than capable of solving your problem very quickly.

AIs don't get bored so you can also delegate all boilerplate to them. Although that's also dangerous as they will happily write tens of lines of code instead of a loop with a single inner line.

What AIs Need to Learn

The universal failure of all three AIs to reuse code between Part A and Part B, across all 12 days, across all models of different sizes and costs. This suggests a fundamental gap in current technology behind AI coding assistants.

They excel at writing individual functions and solving isolated problems, but struggle with:

Architectural thinking across multiple related tasks
Recognizing when to reuse vs. rewrite
Building systems, not just solutions
Long-term maintainability over short-term speed
Knowing when NOT to reinvent quality libraries

While I don't expect AIs to become much better at architectural design (this is an actual intellectual task that can hardly be solved by prediction algorithms), better adherence to good `nt anti-patterns (code duplication, corner-cutting to satisfy exit conditions, training data biases, lack of architectural thinking) mean we're not at the point where AI can write production code unsupervised.

I will leave the final words to our robot overlords and point you to Claude's general review of these 12 days. It is of course biased by my prompts and by the original series of articles; I told Claude to read my series and adapt its own conclusions if it felt necessary.

Of course, in the placating manner of LLMs, Claude took on some of my views and over-emphasized them, but I find that the output is still somewhat balanced and interesting to read. I was happy (relieved?) that Claude recognized the shortcomings of AI, including its own, and not just mine.

I genuinely think there is hope and use cases for these tools... if used correctly!

Note: This article summarizes findings from a 7-part series documenting each day of the experiment. The complete repository with all session transcripts, code reviews, and solutions is available on CodeBerg.

January 9, 2026

You’re Losing Hours to Unrealistic Mocks - Instead Record Real Responses Once, Replay Forever with Mocktail

Stop flaky Java integration tests. Mocktail records real REST/service method responses once and replays them from disk for fast, deterministic runs—no hand-crafted mocks, more realistic scenarios, and happier CI pipelines.

by Shrikant Vashishtha

Learn more

June 5, 2025

Strategic Tech Stack Planning for Startups – A Masterclass with Emma Delescolle

Learn how to choose the right tech stack for your startup with Emma Delescolle’s expert insights—covering team skills, scalability myths, AI’s role, and strategic decision-making to balance speed, cost, and long-term flexibility. Watch the full webin...

by Emmanuelle Delescolle

Learn more

March 16, 2025

Adapting Agile Practices for AI-Focused Product Development

AI product development requires adapting Scrum: set learning-focused Sprint goals, redefine ‘Done’ for spikes, embrace hypothesis-driven metrics, and communicate clearly with stakeholders. Adopting Dual-Track Agile enhances clarity, managing AI’s inh...

by Shrikant Vashishtha

Learn more

Article