Not comparable at all. Power tools work deterministically. A powered chainsaw is not going to have a 0.1% chance of chopping a completely different tree on the other side of the forest. Of course accidents happen; your hand can slip. But a proper comparison would be if you got a computer to look at a large number of powered chainsaws and then generate its own in CAD based on what it’s seen, and then you use that generated power tool. Which, for something as potentially dangerous as a powered chainsaw, you most likely wouldn’t want to do, and would want to have careful human oversight over every part of design.
There’s a certain amount of random in how a chainsaw operates… will a certain cut cause the chain to be thrown or not? That has a lot to do with how you use the saw, maintain the tension on and lubrication in the chain, what you’re trying to cut, etc. and the same goes for the LLMs. They are based on statistics and “heat” of how far down the list of top results they choose for their answer, but you are the LLM operator, you choose what to do with its output, how much agency give it, how thoroughly you review and test its output before using it.
Inexperienced idiots would use no chain lube while cutting a 20" dbh standing hardwood with a 14" saw and just doing a straight plunge cut on the downwind / leaning to side of the tree where they will bind the bar inbetween the base and the rest of the tree if they’re lucky enough to even get that far. With enough perseverence they just might drop the tree on their house. It’s the same with LLMs, or traditional programming. Put the local high school chess club in charge of the ICBM targeting software? You get what you deserve.
I don’t agree. LLMs are by design probabilistic. Chainsaws aren’t designed to be probabilistic, and any functionality that is probabilistic (aside from philosophical questions about what it is possible to be certain about, YKWIM) is aimed to be minimised. You’re supposed to be able to give the same model the same prompt twice and get two different answers. You’re not meant to be able to use a chainsaw the same way on the same object and have it cut significantly differently. You’re inherently leaving much more to chance by using LLMs to generate code, and creating more work for yourself as you have to review LLM code, which is generally lower quality than human-written code.
And that’s a strength as compared with the machines that have attempted 100% determinisim since the days of Lady Ada and Charles Babbage. It also makes it a different beast which must be handled differently than a rigid machine.
You’re supposed to be able to give the same model the same prompt twice and get two different answers
Like creative writers. The Late Show monologue wouldn’t be very good if the writers used the exact same formula every night.
You’re inherently leaving much more to chance by using LLMs to generate code
Not if you give proper (complete and testable) requirements. I’d argue that LLMs are no more “unpredictable” than a pool of randomly selected human programmers.
This is where the power of diversity / randomness comes into play: with proper (complete and testable) requirements, the randomized agents, be they LLMs or consultants for hire, will iterate until they meet the requirements or give up / run out of time or resources.
and creating more work for yourself as you have to review LLM code
That’s a matter of practices - and code review is a good practice, but in the world where LLMs are writing the code - only LLM code reviewers are going to be capable of keeping with the flood of code that the LLMs are producing: https://youtu.be/pzkwn3hu1Cc?t=60
which is generally lower quality than human-written code.
That has been changing, rather quickly over the past year. The size of problems that LLMs are solving as well as humans has been increasing steadily for many months.
Now, having said all that, I’m paid to produce code, so I do review everything the LLMs make because nobody is asking me (yet) to make anything at super-human speed, so I’m not asking LLMs to make anything at super-human speed for the overall development process. I’ve been doing this for about 6 months, and the quality of what they produce, and the complexity of things they are able to successfully produce, has been steadily increasing throughout that time. Six months ago, I couldn’t ask an LLM to make more than a simple sub-module in things I was working on and get a reasonable result. Today, most things I’m tasked with - I can have the LLM develop a set of requirements for the whole problem statement, and then implement to those requirements, develop meaningful tests (six months ago the LLM generated tests were garbage, lately they’re getting to be on-par to better than what my human test department colleagues make), and do self-reviews and refinements to a point where the code meets our standards better than code written by our human programmers.
One of the most productive prompts you can give an LLM is: “Review these requirements; identify gaps, ambiguities, conflicts or any other problems that may hinder implementation. Report all findings and suggest potential corrections.” You won’t get the same result every time, and repeating that prompt in a fresh context window on the “corrected” requirements often leads to additional refinements, but eventually you do end up with a good set of self-consistent and complete requirements. The thing you have to do is read those (extensive) requirements and ensure that they reflect what you are thinking correctly, because any “hallucinations” that creep into the requirements will then be implemented in the code and the tests and sail right on through to the finished product.
Not comparable at all. Power tools work deterministically. A powered chainsaw is not going to have a 0.1% chance of chopping a completely different tree on the other side of the forest. Of course accidents happen; your hand can slip. But a proper comparison would be if you got a computer to look at a large number of powered chainsaws and then generate its own in CAD based on what it’s seen, and then you use that generated power tool. Which, for something as potentially dangerous as a powered chainsaw, you most likely wouldn’t want to do, and would want to have careful human oversight over every part of design.
There’s a certain amount of random in how a chainsaw operates… will a certain cut cause the chain to be thrown or not? That has a lot to do with how you use the saw, maintain the tension on and lubrication in the chain, what you’re trying to cut, etc. and the same goes for the LLMs. They are based on statistics and “heat” of how far down the list of top results they choose for their answer, but you are the LLM operator, you choose what to do with its output, how much agency give it, how thoroughly you review and test its output before using it.
Inexperienced idiots would use no chain lube while cutting a 20" dbh standing hardwood with a 14" saw and just doing a straight plunge cut on the downwind / leaning to side of the tree where they will bind the bar inbetween the base and the rest of the tree if they’re lucky enough to even get that far. With enough perseverence they just might drop the tree on their house. It’s the same with LLMs, or traditional programming. Put the local high school chess club in charge of the ICBM targeting software? You get what you deserve.
I don’t agree. LLMs are by design probabilistic. Chainsaws aren’t designed to be probabilistic, and any functionality that is probabilistic (aside from philosophical questions about what it is possible to be certain about, YKWIM) is aimed to be minimised. You’re supposed to be able to give the same model the same prompt twice and get two different answers. You’re not meant to be able to use a chainsaw the same way on the same object and have it cut significantly differently. You’re inherently leaving much more to chance by using LLMs to generate code, and creating more work for yourself as you have to review LLM code, which is generally lower quality than human-written code.
And that’s a strength as compared with the machines that have attempted 100% determinisim since the days of Lady Ada and Charles Babbage. It also makes it a different beast which must be handled differently than a rigid machine.
Like creative writers. The Late Show monologue wouldn’t be very good if the writers used the exact same formula every night.
Not if you give proper (complete and testable) requirements. I’d argue that LLMs are no more “unpredictable” than a pool of randomly selected human programmers.
This is where the power of diversity / randomness comes into play: with proper (complete and testable) requirements, the randomized agents, be they LLMs or consultants for hire, will iterate until they meet the requirements or give up / run out of time or resources.
That’s a matter of practices - and code review is a good practice, but in the world where LLMs are writing the code - only LLM code reviewers are going to be capable of keeping with the flood of code that the LLMs are producing: https://youtu.be/pzkwn3hu1Cc?t=60
That has been changing, rather quickly over the past year. The size of problems that LLMs are solving as well as humans has been increasing steadily for many months.
Now, having said all that, I’m paid to produce code, so I do review everything the LLMs make because nobody is asking me (yet) to make anything at super-human speed, so I’m not asking LLMs to make anything at super-human speed for the overall development process. I’ve been doing this for about 6 months, and the quality of what they produce, and the complexity of things they are able to successfully produce, has been steadily increasing throughout that time. Six months ago, I couldn’t ask an LLM to make more than a simple sub-module in things I was working on and get a reasonable result. Today, most things I’m tasked with - I can have the LLM develop a set of requirements for the whole problem statement, and then implement to those requirements, develop meaningful tests (six months ago the LLM generated tests were garbage, lately they’re getting to be on-par to better than what my human test department colleagues make), and do self-reviews and refinements to a point where the code meets our standards better than code written by our human programmers.
One of the most productive prompts you can give an LLM is: “Review these requirements; identify gaps, ambiguities, conflicts or any other problems that may hinder implementation. Report all findings and suggest potential corrections.” You won’t get the same result every time, and repeating that prompt in a fresh context window on the “corrected” requirements often leads to additional refinements, but eventually you do end up with a good set of self-consistent and complete requirements. The thing you have to do is read those (extensive) requirements and ensure that they reflect what you are thinking correctly, because any “hallucinations” that creep into the requirements will then be implemented in the code and the tests and sail right on through to the finished product.