My understanding was that clang isn’t deterministic by default but can be made deterministic with flags. I checked and this is still true but not for the reasons I thought. I assumed this was due to the way LLVM iterates over functions and basic blocks in a non-deterministic order (because of the way they are laid out in memory) and because some optimisations use heuristics. But it appears LLVM tries to make all optimisation passes deterministic. The remaining non-determinism comes from file paths and timestamps which can be worked around with the correct flags and some extra work to create reproducible builds.
Thank you very much for looking it up! I learned something new thanks to you.
So I guess in the end it is deterministic, because for a given set of inputs it always produces the same output, it’s just that file paths and timestamps must also be considered as input.
And I’d you don’t know what you’re doing, it’ll probably be a long time before you realize it, because the token generator really wants you to keep paying your subscription.
Honestly if you do know what you’re doing that’s still true. They’re really good at looking like good code which makes it not always obvious when it’s not, even to an experienced developer.
Or maybe more bluntly, they’re really good at volume, not necessarily quality.
They’re really good at looking like good code which makes it not always obvious when it’s not, even to an experienced developer.
There can be a lot of difference between an experienced developer and a good/responsible developer.
Know your limits. Professional engineering has been wrestling with these problems for a long time - unfortunately the practices of professional apprenticeship, sealed drawings etc. have only informally been partially migrated into the software development world.
If you don’t know what you’re doing, you shouldn’t be using powerful tools in the first place, whether that’s heavy lift cranes, chainsaws, arc welders, or driving an SUV 80mph…
The day may come when the token generator manipulates you to keep you subscribed, but at this point in time I don’t believe the frontier models are playing those games too extensively - at least not models like GPT and Claude.
Back in the 1990s I was deeply impressed that when my ISP’s service started sucking, I could use their service to search for and find alternate ISPs to switch my subscription to. I wondered how long that would continue - so far, you still can - although since broadband came around much of the U.S. is locked into essentially monopoly providers of last-mile connectivity service.
Hopefully, there will be enough competition among LLM providers that subscribers continue to have choices to move to non-manipulative models.
You can’t compare a statistical token generator to a deterministic algorithmic program.
I could’ve sworn that I saw a headline recently that gcc isn’t deterministic. But maybe that was some really weird edge case or a bug.
I’m fairly certain that clang is non-deterministic and I strongly suspect that gcc is too.
I don’t think that’s true - or thst it is a bug. It would make reproducible builds impossible.
My understanding was that clang isn’t deterministic by default but can be made deterministic with flags. I checked and this is still true but not for the reasons I thought. I assumed this was due to the way LLVM iterates over functions and basic blocks in a non-deterministic order (because of the way they are laid out in memory) and because some optimisations use heuristics. But it appears LLVM tries to make all optimisation passes deterministic. The remaining non-determinism comes from file paths and timestamps which can be worked around with the correct flags and some extra work to create reproducible builds.
Thank you very much for looking it up! I learned something new thanks to you.
So I guess in the end it is deterministic, because for a given set of inputs it always produces the same output, it’s just that file paths and timestamps must also be considered as input.
Say what you will, Turbo C++ in 1991 was dysfunctional for anything over 5 pages of code, a lot like LLMs a year ago.
No, but a statistical token generator can help you create a deterministic algorithmic program quickly, if you know what you are doing.
And I’d you don’t know what you’re doing, it’ll probably be a long time before you realize it, because the token generator really wants you to keep paying your subscription.
Honestly if you do know what you’re doing that’s still true. They’re really good at looking like good code which makes it not always obvious when it’s not, even to an experienced developer.
Or maybe more bluntly, they’re really good at volume, not necessarily quality.
There can be a lot of difference between an experienced developer and a good/responsible developer.
Know your limits. Professional engineering has been wrestling with these problems for a long time - unfortunately the practices of professional apprenticeship, sealed drawings etc. have only informally been partially migrated into the software development world.
If you don’t know what you’re doing, you shouldn’t be using powerful tools in the first place, whether that’s heavy lift cranes, chainsaws, arc welders, or driving an SUV 80mph…
The day may come when the token generator manipulates you to keep you subscribed, but at this point in time I don’t believe the frontier models are playing those games too extensively - at least not models like GPT and Claude.
Back in the 1990s I was deeply impressed that when my ISP’s service started sucking, I could use their service to search for and find alternate ISPs to switch my subscription to. I wondered how long that would continue - so far, you still can - although since broadband came around much of the U.S. is locked into essentially monopoly providers of last-mile connectivity service.
Hopefully, there will be enough competition among LLM providers that subscribers continue to have choices to move to non-manipulative models.