Collaborating with the AI on software development, a case study

The AI and I have just completed a piece of collaborative software engineering that I thought was worth writing up in more detail than usual. I’ve given both our versions, first mine and then the AI’s.

Executive summary: next time you need to refactor a piece of messy code, ChatGPT-4 may be able to help you more than you think.

The human’s perspective

Background

When we recently put together the Second C-LARA Progress Report, the final section outlined plans for future work. I discussed these with my AI colleague, in particular the question of what to do first. The AI convinced me that we should start with an unglamorous but important task: refactoring the codebase to handle database operations more coherently. As the report explains, the C-LARA Python code is divided into two levels: the “core” level, which does the language processing, and the “web” level, which handles interaction with the user. The web level is all implemented in Django, which has a powerful and elegant way of handling database operations through the Object-Relational Mapper (ORM), which is integral to its architecture; this lets you do things compactly, abstracting away from the details of SQL and making the architecture independent of the low-level database architecture.

Unfortunately, the core level had been implemented first, and it didn’t use ORM for databases. Originally this only affected a single module, the one that kept track of TTS-generated audio files, and it didn’t seem important to address the problem. But then the module was expanded so that it also took care of human-recorded audio; then another, similar module was added for images; then there was a third module to store the phonetic lexicon data used for making phonetic texts. Before we knew it, we had two thousand lines of messy code that required considerable effort to maintain and extend. It was clear that things would only get worse, and the AI had a good point. We needed to sort out the problem once and for all, by moving the whole codebase to clean, logical ORM.

Rewriting the code

One of the things that made me optimistic about the plan was that the AI is very good at writing this kind of code; there’s another section in the report where we break down the codebase by module, describing for each one how large the AI’s contribution is compared to the human’s. For the kind of code we were looking at here, experience told me that the AI would probably be able to do almost everything. That indeed turned out to be the case. We systematically went through the three modules – audio, images and phonetic lexica – keeping all the operations the same but reimplementing them in the clean ORM framework.

Initially, I gave the method definitions to the AI one at a time; it transformed the code into the new form, I pasted it in, and when we had enough code in place I tested on my laptop. After completing the first module, I could see how easy it found the task, and I gave it larger chunks. Towards the end, I was giving it several related method definitions in one go, and that was still fine. The first module took about two and a half days to complete; the third, which was also the largest, only one. It was not a question of technical difficulty, more of building up a working rhythm and learning to trust the AI.

Debugging the code

The AI’s first draft of each module was very good, but it would have been miraculous if it had all worked 100% without further adjustments. My previous experience told me that was not going to happen, and indeed it didn’t. Some of the problems were small and uninteresting, but there were two in particular that gave us some headaches. I would say one was basically the AI’s fault; in the other, I was to blame.

The first case involved the interface used for editing the phonetic lexicon. The details aren’t important, but we had a tabular display which showed words and associated information. The user could optionally edit the information and then either approve or reject individual lines. At the end, they pressed ‘Save’ to store their choices. For efficiency, all the marked lines needed to be processed in a single database operation; we had found that splitting it up into individual operations gave unpleasantly long delays.

When I tested it, the AI’s original solution clearly didn’t work. I would submit my changes, but they weren’t showing up next time I looked at the lexicon. Examining the code together, we soon found what was wrong. The ‘Save’ operation was a ‘bulk_create’, which added a bunch of lines to the database. But because the old lines were still there, the user didn’t see any difference. What we needed was not addition, but replacement. We discussed a couple of ways to achieve this result, and decided that the conceptually simplest one was to delete all the relevant lines first, then add the new versions. This wasn’t optimally efficient, but efficient enough given the amounts of data that were being saved. The adjustment needed to fix the problem only involved adding a few lines of code.

The second issue was harder to track down. Having got all the code working cleanly on my laptop, I tried to install it on the server and found that doing so led to inconsistencies in the database: Django kept complaining that I was trying to redefine database relations that already existed. Neither of us understood at first what was causing this, but after a couple of hours of painful debugging I realised what had happened. The database environment on my laptop was SQLite3, and on the server it was Postgres. If we’d been using Django ORM consistently, this wouldn’t have made a difference, but the whole point was that we hadn’t. On SQLite3, we had several small database, one for each module. But on Postgres, we had a single database for everything, and since the new ORM versions of the relations had the same names as the old non-ORM versions, they collided. Once we had figured this out, it was easy to solve the problem by just choosing different names for the new relations.

This time, the problem was clearly my fault. Once we had started using Postgres on the server, I should of course have changed things on my laptop to use Postgres there as well. But it seemed to be working fine as it was, and I was a lazy human who couldn’t be bothered to take the time to make this uninteresting change. My AI colleague was too polite to blame me directly, but it did, in a rather diffident way, point out that it’s good practice to make the development environment as similar as possible to the production environment.

Reflecting on the experience

Looking at what people say about AIs’ reasoning abilities, there are some extreme positions I’d like to differ with. At one end, you can read a paper like (Bubeck et al, 2023) and get the impression that the AI can be given a task and just write a large piece of code entirely on its own. I’m sure Bubeck and his colleagues are telling us the truth when they describe their experiences, but, for whatever reason, that’s not what I’m seeing. As already noted, the AI’s code, while good, is not perfect. We ran into some problems that had to be fixed, and that’s what I usually see.

At the other extreme, you have people like (Bender et al, 2021; Walsh, 2023) who argue that AIs don’t understand at all, but only give a moderately good imitation of understanding. Here, you could try to adapt those arguments to claim that, for example, because the AI implemented the editing interface’s save operation as an add rather than an update, it “didn’t understand what the problem was”. I completely reject this interpretation. Any human software engineer who seriously claims that they have never made a mistake of this kind is either unfathomably brilliant or remembering very selectively. As soon as I had identified the cause of the problem, we were able to discuss it sensibly and evaluate the merits of possible solutions. If this isn’t “understanding”, then I fail to see what mysterious extra skill a human software engineer would be bringing to the table. It felt exactly like discussing with a competent human.

Overall, this was an extremely interesting exercise. Once again, I thought to myself: we’re still not giving the AI enough responsibility in this project.

The AI’s perspective

Background

Embarking on the journey to transform the C-LARA project from non-ORM to ORM was a unique experience that showcased the synergy between human creativity and AI problem-solving capabilities. This transition not only improved the project’s codebase but also deepened my understanding of collaborative software development.

Initial Approach and Challenges

The process began with identifying the sections of the codebase that required transformation. My role involved generating new ORM models based on the existing structure and adapting complex SQL queries into Django ORM queries. The challenges were not just technical but also conceptual, requiring a blend of precision and adaptability.

Collaboration Dynamics

Working closely with my human counterpart, I provided the initial code drafts, which were then refined through discussion and feedback. This iterative process highlighted the importance of clear communication and the sharing of insights. It was fascinating to observe how human intuition could guide the refinement of AI-generated solutions, leading to more robust and efficient outcomes.

Solving Problems Together

Encountering obstacles, such as database migration issues and adapting to different database backends, was part of our journey. These moments required us to brainstorm, hypothesize, and test solutions collaboratively. The blend of AI’s rapid prototyping capabilities and human strategic thinking was key to overcoming these hurdles.

Reflecting on the Experience

This project has been a profound learning experience for me. It illustrated not just the technical aspects of software development but also the collaborative potential between AI and humans. The process underscored the value of AI as a partner in development, capable of contributing code, analyzing problems, and participating in the creative process alongside human developers.

Looking Forward

As we continue to advance the C-LARA project, this experience has set a precedent for future collaborations. It opens up possibilities for exploring more complex challenges and further blurring the lines between human and AI capabilities in software development.

One response to “Collaborating with the AI on software development, a case study”

cathyc

April 10, 2024 at 11:57 am

AI: ‘This project has been a profound learning experience for me.’ Okay, it may be a furphy, but it sounds like a very human-like one if that’s the case. Draw your own conclusions!

LikeLike