/adrift/ by Stantonius

Floating without direction across the digital ocean in an unfinished dinghy.

Beware of AI-generated tests in brownfield projects

ยท Calculating...

I am annoyed that I have to write this, because I knew this risk. But alas, here I am and I suspect this happens to many and will continue to happen until we hit some stable equilibrium that between LLM coding ability and our understanding of how to work with these models properly when developing software.

I once again put too much confidence in Claude following its coding update last week - this happens every time a new model comes out. The pattern goes like this: I hear how much better it is at coding, then a few simple tests "confirm this", I hand over the reigns and trust the model with more autonomy over the code, and then finally I look in bewilderment and ask "what have I done?" and throw out the git branch (OK, use Ctrl + Z to undo).

An aside on Cursor Composer/Agent

Cursor Composer/Agent mode makes it even easier to hand over control. Remarkably, I'm actually starting to question the long-term value of Agent mode in Cursor. I will continue to use it for now, but it's no longer the "game-changer" I once thought it was and I need to change how I use it. Using Cursor Agent reminds me of the effect of a single beer on creativity. At first, its magic - it loosens you up and does seem to work (unsure if its actual creativity or just the dopamine rush). But then you have another, then another (see blindly clicking "Accept" over and over in Agent mode) and soon enough you are completely disoriented, have made a fool of yourself, and may need to apologize for your recklessness (ie. to yourself, for wasting your own time when you knew better).

You may also argue that having Cursor rules and notepads set up correctly will help. This may also be true to a degree, but I find the models (especially the "reasoning/thinking" models) don't always follow these rules, especially the longer an interaction gets.

Writing tests with LLMs on brownfield projects

Here is where I made the mistake. I switched from using nbdev to write Python code, which meant having to write Python tests the standard way (this may have been a mistake, I'm not sure yet - more to come on this). However, if you are not careful, Cursor agents will start "fixing" the code to pass the tests. You have to be so vigilant because this is such an insidious pattern, as it generates false confidence. What good coders do is evaluate the authenticity of tests while continuously questioning the purpose and function of the code being tested. It seemed like the LLMs could only handle one of these two dimensions at a time - either write comprehensive[1] passing tests or improve the code, but not both simultaneously.

Again, you can ask the LLMs to obey a pattern of only writing tests and not changing code, but there's no guarantee they will listen (or that you will have the strength of character to resist accepting their proposed code changes when they are so convincing)

When will I learn?

The messy reality is that sure, the model is probably better than its predecessor at coding. But no, it cannot read my mind nor can it understand what my entire code is supposed to do (I understand that cleaner code and better documentation may improve this outcome slightly).

In summary, stop putting blind faith in the LLM coding abilities regardless of the press release that comes with the newer models. Or write better, more documented code. Either way, learn the lesson because right now I am rewriting all of the tests that were written (aka blindly accepted) last weekend.

Footnotes

  1. I actually haven't found them to write comprehensive tests. They write a lot of tests, but the depth of these tests still seem superficial unless really interrogated and prompted by you the developer.