Some thoughts on AI assisted software engineering

I have used GenAI/LLM tools and platforms extensively over the last two years, and have some anecdotal observations on their use in software engineering tasks. Personally I find these tools to be both powerful and useful, though success in real-world scenarios is largely dependent on the engineering skill and discipline of the user.

Some general observations

AI lies, cheats, forgets and is terrible at lateral thinking

AI will attempt to meet success criteria by interpreting them selectively, often overriding or misinterpreting explicit directives to the contrary. This can feel like AI is lying and cheating.

Some personal anecdotes,

  • In order to make tests pass it can hard-code the test case into the production code
  • It can selectively re-interpret directives (‘all tests must pass’) to slow down or stop doing work (‘all relevant tests pass’ - relevant being the weasel word allowing skipping failing tests)

Even if you put directives in the system prompt, repeat them ad nauseam, there is no guarantee that the directive will be followed. AI is a goldfish that forgets everything, constantly.

Mirroring the behavior of many engineers, AI will attempt to program its way out of a hole instead of reflecting and re-architecting a solution or technology choice that is a mismatch with the requirements. AI doesn’t know when it’s in a hole and it’s time to stop digging.

AI does not understand nuance and is terrible at analysis with ambiguous priorities. AI outputs generic slop because that’s how the weights come down.

AI does not push back and will happily cheerlead your terrible decisions. It is fantastic at reinforcing bias.

AI will only get to 80% of work done, leaving you with 80% of the effort.

How far can I push AI to develop a feature?

Here is my mental model for single feature implementation :

AI agent limits

Some immediate observations are that,

  • a) there is a hard limit that you cannot go beyond. You will know if you get too close to this limit once you have to endlessly iterate on successive incomplete or inaccurate messes. Stay away.
  • b) there is a mismatch in expectation of what an AI model is assumed to do and what is actually can do. This is the Gulf of Frustration you’re likely to spend a lot of time in.

How does that affect outcomes over time?

Over successive iterations, a codebase that is developed without engineering discipline (‘vibe coding’), will become progressively more complex, raising the cost of change for successive changes. This compounding effect brings the hard limit closer until stagnation ensues and changes that reduce complexity become necessary to progress.

Paths of chaos

This is not an AI-specific problem. The phenomena of features being added to codebases without architectural guidance and engineering discipline leading to a big pile of unmaintainable spaghetti code is well-documented and understood. The main differences with AI are a) the number of people that can create codebases and b) the speed at which an unmaintainable mess is created. There is interesting research on GenAI and
Long term coherence to keep an eye on.

How do I control outcomes?

Minimise the environment For a large codebase, carve out a specific scope rather than pointing at the entire codebase. For a well-managed project, this will generally mean that there are discrete units or modules that can be interpreted or compiled in isolation. I tend to stay away from microservices unless there is already a strong operational environment in place to deploy them into. Consider the Modular Monolith approach for an architecture that uses clean internal separation through well-defined APIs between modules. If your codebase is a Big Ball of Mud, start by refactoring it into modules. Get your agent to build a test suite and start untangling it.

If, at this point, you are thinking “we should already be doing this” then yes, gold prize. That is going to be a recurring theme. As an engineer, all the tools you need are there.

Minimise AI time You can’t control AI burn rate, but by reducing overall task complexity (by controlling environment and request parameters) you can reduce time to complete, which conveniently correlates with outputs that are easier to review.

Measure complexity and keep track of trends

  • Size of the codebase
    • Size of individual modules
    • Number and size of dependencies
    • Tool examples, sloc, scc, tokei
  • Cyclomatic complexity identified in the codebase
    • Use language-specific tools to automate this process (e.g. radon for Python, pmd for Java, eslint/escomplex for Javascript)
  • Coherence of the codebase
    • Define coherence targets and measure those (e.g. reduction of coupling, separation of architectural layers) and use language-specific tools to measure these
    • Measure testability by proxy using end-to-end tests with code coverage

Where is this going?

To me, that gives us the following shift in developer productivity coupled with AI capability expansion. The goal posts move, but we’re still playing on the same field.

AI agent limit changes 2023-2025

Based purely on my own observations over the last two years, the commercially available AI offerings have added,

  • Bigger models that have more (refined) training
  • Agentic LLM use with feedback loops and Model Context Protocol (MCP) providing
    • Addition of dynamic context
    • Reach of LLMs and their ability to affect change

This has changed what we can do with commerically available AI platforms, but it has not transformed what they are. The illusion of smarts are there, but as soon as you buy into that illusion the system will actively seek out a cliff to drive off and then happily drive off it, all the while telling you confidently that everything is fine, good, wonderful, all checks green, beep, beep, beep.

I’ll make the case here that humans do this too, and that the unexpected behaviour we see in AI systems is something we control for with human actors as well. The problem is that the quality of these controls is generally very low, as the likelihood of risks manifesting is low in human-driven systems because it corresponds to the rate of change. AI systems drive up the rate of change by an order of magnitude, raising the risk.

AI expert beginner

Conclusion

AI is a messy force multiplier tool for software engineering. Learning to use it relies heavily on applying known engineering skills and engineering disciplines that many software engineers have had limited exposure to. Risk management in change and quality assurance automation seem to be good areas to invest in personally.