I recently spent some time generating some simple programs with AI coding tools. As a technical architect and senior developer, I am watching the emergence of AI coding tools carefully. I know that the robots are coming for the coding parts of my job. Watching this generation of robots is important to long-term understanding of the field.
To be clear I have been experimenting with Github Copilot and Salesforce’s Agentforce for Developers since they were released. Most of my earlier experiments were failures. The tools wrote trivial code terribly and simply couldn’t understand more complex requests. They made me slower, not faster, so I put them aside for a time. But they are evolving fast and were worth another look.
I learn by doing. So I set about creating a few tools that were on my mind to see if I could get good code out of the LLMs. Or at least if they could save me enough time to be worth the learning curve to get good at using them.
Project 1: Simple OpenWRT Status Call Out
I recently updated my home wifi router to run OpenWRT. When I bought the router I selected one compatible with that OS. After realizing T-Link would never fix some security issues in the firmware I made the switch.
Our internet service from Breezeline has always been spotty; sometimes bad enough I need to relocate to get stable service. One of the things I want to know when I’m working elsewhere is: has my internet come back up yet?
I have a programmable router, I’m a programmer, therefore this is a problem I could solve for myself.
But I’ve never written code to work with OpenWRT. Worse, I generally work in high-storage environments. I mostly do side projects in Python or NodeJS, but my router doesn’t have the storage needed for those run-times. OpenWRT recommends C or Lua for scripting. I’ve done my time in C. Jumping through all the hoops required to make C do something this simple did not feel like fun. So Lua it is. Too bad I hadn’t written code in Lua before…enter Copilot.
Throughout this project I used the current defaults – which meant GPT-4o was the selected LLM.
Initial Code
I actually found Lua by asking Copilot. My first prompt was just:
Generate a program for openwrt that sends a post request to an API endpoint at spinngingcode.org
And it generated a reasonable looking Lua script. Since I don’t know Lua, or at that point why Copilot chose it, I dove into a little research into the language. After a few minutes I confirmed that, yes, this is the right use case for Lua. I also confirmed Copilot made reasonable choices in how the code was written.
I went on to ask Copilot to change call outs from curl to a library, add features to pull in the network status information from the router, and other adjustments. It consistently generated code that was close to correct. I did have to tell it to adjust details after validating them against my router’s actual setup and commands. It took some time but on the whole I was able to make steady progress. Eventually I had a good-enough solution.
Tests
Automated test generation has been a promise of these tools since they arrived. It always feels like something that’s easy to do: read code, write test. Which is, of course, the 100% wrong way to write tests. However, the point of the exercise was to see what I could get the tools to do, so I gave it a try.
It took several tries for it generate valid tests. The first time I asked, it wanted to know what framework to use (a fair request), but generated a useless hello_world test file. On subsequent tries it switched frameworks without asking, and took several tries before generating something that looked valid.
Too bad it wasn’t valid.
The test framework I picked, luatest
, was a bad choice. But with little experience in the language I didn’t know that. When I finally had it refactor to busted
, I got tests I could setup and run…and they all failed cause they were still all wrong.
I spent more time getting a valid test setup than primary code – eventually I gave up. It would have been faster to do the research and figure it out from scratch.
Code Review
All in all, Copilot gave me a reasonable piece of code for the primary request. Using Copilot was far faster than I could have done the research to write something equally good. It still needs a lot of supervision to get it right, but the refactoring requests seemed to go okay.
The code commentary Copilot provided in the response is detailed and accurate, where I checked it. It’s chatty so I didn’t check every piece of commentary just every line of code. The code itself came with passable comments about what each block and function does. And functional decomposition was vastly improved over previous experiments I’ve done with Copilot.
It had to make up a few details, like the actually API endpoint name (it picked https://spinningcode.org/api/endpoint
– okay sure reasonably good fake answer that is clear but also obviously wrong), but those were good enough for the limited details I gave it.
Outside of the automated test disaster, the biggest issue was that it did nothing to encourage secure design. Copilot recommended no security for that endpoint, nor does the commentary flag that it skipped security suggestions.
Project 2: Simple PHP Status Server
Since I have PHP on my personal server, I decided that PHP was the right choice for the server side of the router project. Despite being a little rusty I am much more comfortable with PHP than Lua. I have standards and expectations about how to write and organize good PHP code.
The prompt I provided:
I have a Lau script that looks like this:
[a copy of the then current version of the Lua file]
Now I need a PHP application that will listen to those calls, record the data into a CSV file, and includes a page that displays the last 100 lines of the csv file.
It managed to generate valid PHP code that matched my request – sorta.
Code Review
The PHP it generated is not to my standards – not even close.
It lacked security checks, was poorly organized, and had borderline useless comments. If you look at the repo there are both a lot of generated refactors and then a lot of hand edits cause that’s web facing code, security matters, and I could fix it faster than debating with an AI.
I made some effort to get it to highlight the important information I wanted, but it struggled to regain the context from the OpenWRT requests that the Lua script was built around.
Unlike the Lua script, the PHP saved me little or no time. It was terrible in this context even with lots of attempts to refine the code. This piece of code I’ll probably maintain by hand.
Project 3: Data Migration Scripts for Salesforce
One of the hardest things to get people to fund sufficiently for a Salesforce project is the data migration. They are hard, boring (well not for everyone – I have undying respect for people who like these projects), and super time consuming. So we are always looking for ways to do them faster.
A few years ago I proposed a training exercise for data migrations that involved taking the data from my SC Salary Data repo and loading it into Salesforce. It’s a great training project because there is lots of data and it’s real-world messy. But it’s also simple: just two or three objects depending on the data model you choose. So I decided to test if I could use AI code to generate basic scripts to clean and load this data quickly.
For this project I made two changes to my approach. First, I switched the model to Claude 3.5. Second, I decided to treat the LLM like a junior developer and lead it through the project more intentionally. Part of this second change was driven by my initial experiences with the other projects. I assumed from the start that the code would be sub-standard, but that I could encourage Copilot to make changes to meet my expectations – just like I do when working with a human.
Code Review
This project was startling successful. The first stage was to have it create a script that prepped all the data into a simple SQLite3 database (it does some clean up, but not all that’s possible with this data). The second stage was to have it create a script to load the data. That second stage became it’s own repo because I wanted to include a Salesforce SFDX project to setup the objects and fields I wanted correctly.
I generated probably 90+% of the code in that repo with the LLM. The Python to load the data, most of the Salesforce Metadata to create objects and fields needed, even the read me file I generated in large part with the LLM.
Frankly this is a project I’ve seen developers struggle to get right, and with a little guidance the LLM was able to do it. It probably needed as much guidance as a human would have (a very junior human developer), but the iteration cycles were much faster. I am not convinced when all is said and done the project was faster – it took a few hours to complete each stage – than my doing it solo. However, it required far less mental engagement from me than if I had written the code unassisted.
This project went well enough I’m using it to help drive a conversation about how to change our approach to migrations in general. There are major differences between what I did and I do in a migration of a database with thousands of tables targeting hundreds of objects with millions of records. But I was impressed enough to want to move on to fill those next gaps.
Conclusions
LLM generated code is improving fast, but it is still generating low-quality code by default. Unlike experiments I ran a year ago, it now generates functional code that is equal to the code that a brand new developer could write.
Generated code is not to be trusted, particularly in production or public facing settings. If you don’t ask for security, you don’t get security. It organizes code as if it’s following train of thought, not in the order that well structured code should follow. And it does not write tests worth having without significant effort.
Good AI driven tools can make good developers faster. They are not good enough to replace a good developer – yet. Developers will need to learn to work these tools into the process, but we need to be aware of the gaps and how to close them.
I expect the quality also depends deeply on the quality of the code in public repos on the internet. I believe the terrible quality of the PHP I got vs the reasonably good Lua and Python is because there is lots of terrible PHP out there used to train these models. As long the LLMs write code that conforms to the average of the examples online communities will need to sweat the details of what is out there.