What exactly would you checksum? All intermediate states that weren’t committed, and all test run parameters and outputs? If so, how would you use that to detect an LLM? The current agentic LLM tools also do several edits and run tests for the thing they’re writing, then edit more until their tests work.
So the presence of test runs and intermediate states isn’t really indicative of a human writing code and I’m skeptical that distinguishing between steps a human would do and steps an LLM would do is any easier or quicker than distinguishing based on the end result.
You could time stamp changes and progress to a file. Record results of tests and output and give an approximate algorithmic confidence rating about how bespoke the process of writing that code was. Even agentic AI rapidly spits out code like a machine would where humans take time and think about things as they go. They make typos and go back and correct them. Code tests fail and debugging looks different between an agent and a human. We need to fingerprint how agents write code and use agentic code processed through this sort of validation looks versus what it looks like for humans to do the same.
What exactly would you checksum? All intermediate states that weren’t committed, and all test run parameters and outputs? If so, how would you use that to detect an LLM? The current agentic LLM tools also do several edits and run tests for the thing they’re writing, then edit more until their tests work.
So the presence of test runs and intermediate states isn’t really indicative of a human writing code and I’m skeptical that distinguishing between steps a human would do and steps an LLM would do is any easier or quicker than distinguishing based on the end result.
You could time stamp changes and progress to a file. Record results of tests and output and give an approximate algorithmic confidence rating about how bespoke the process of writing that code was. Even agentic AI rapidly spits out code like a machine would where humans take time and think about things as they go. They make typos and go back and correct them. Code tests fail and debugging looks different between an agent and a human. We need to fingerprint how agents write code and use agentic code processed through this sort of validation looks versus what it looks like for humans to do the same.