LLMs, coding agents, benchmarks grounding
ai
llms
coding-agents
benchmarks
criticism
]
LLMs, coding agents, benchmarks grounding
Grounded airplanes are airplanes that need to stay on the ground, not fly.
The openclaw x codex 5.2 x claude 4.6 x codex 5.3 avalanche from end of Jan to beg of Feb didn’t quite leave me alone, so here goes some self-normalization attempt.
Despite big claims floating around, it is LLMs all the way down, so there is good reason for skepticism.
There is two parts of the perceived narrative, one is the companies themselves making ungrounded claims based on ungrounded benchmarks. And then there is user claims based on their specific usage profile.
On the other hand there is general longstanding critical evaluation, and my own individual experience with coding bots.
Benchmarks are good to have but they are usually far from providing a complete picture. I summarize this to myself as “the smarter the thing to be evaluated, the harder the evaluation”. This is to say, any such evaluation trying to cram the entirety into a single (or a few) numbers, is suspicious at best.
Benchmarks are being gamed (anyone remember Llama 4 release a year ago?). Benchmarks are not only gamed by the human stakeholders, but also by the models under evaluation themselves (cheating). This seems new as of last year, 2025.
The incentive for a vendor is clear, if benchmark number go up, business number go up. Good for them, but
LLM vendors seem to regard benchmarks as marketing tools to increase their valuation, and have noticed that while valuations go up if they report good benchmark performance, they dont go down if obscure academics question the validity of their claims.
Granted. For individual users, they are testing things based on their own specific context, and while things might work well there, it doesn’t say much about how well it might work for my own specific context. The smarter the thing, the harder to evaluate. Repeat.
If LLMs are anything, they are prime stereotyping machines, cutting out the long tail and professing on serving the 1% tasks that get 99% of the popularity. This popularity might be a real world usage frequency, no question. Boilerplate code. Repetive tasks. Either way, whenever I tried to do something less stereotypical, I was met with zero shot failure and had to take the thing by the hand, break it down into small digestible pieces, and work it from there. This is a type of “drift” across category and I claim the “adaptation” to myself.
I can see that there is a point in this one off “throwaway” prototyping approaches or other simple, well defined applications with plenty of precursors. I have not seen maintainability of agent’s code output being evaluated. And again, the stereotyping machine will make good interpolations between minor variations of the same thing, but it will fail without notice on anything that is not in the training distribution. By construction. Period.
This is my personal take and rant, done. Let’s try and broaden the persepctive. One well-known problem with benchmarks is, that as soon as they become available, they become available for training. This is called train / test contamination, and if it occurs, it will lead to exaggerated results. Usually, neither the precise training data nor the benchmark trajectories themselves are disclosed. Will I trust any claim in this situation? I trust your guess.
Contamination seems rather minor though when you can just cheat on your exam. The cheating phenomenon seems to have become large during last year. For completeness, also called reward hacking. The issue not only pertains to benchmarking but also to in-the-field use.
Prime example for cheating? ImpossibleBench :) If you report more than 0% on this benchmark, you cheated, I love it https://arxiv.org/pdf/2510.20270v1
Silent but deadly failures https://spectrum.ieee.org/ai-coding-degrades
Do LLM coding benchmarks measure real-world utility? https://ehudreiter.com/2025/01/13/do-llm-coding-benchmarks-measure-real-world-utility/
Do LLMs cheat on benchmarks https://ehudreiter.com/2025/12/08/do-llms-cheat-on-benchmarks/
I mentioned the non-evaluation of maintainability above. Other people have much more fine-grained criticism going here, but that aspect is always included. This one goes beyond coding and touches upon these mistaken (gently said) AGI aspirations. Bottom line: right for the wrong reason. Check Melanie Mitchell’s NeurIPS 2025 keynote writeup https://aiguide.substack.com/p/on-evaluating-cognitive-capabilities
Finally, crawling the 39C3 recordings, this presentation came to meet me going down the popularity ranking, “Agentic ProbLLMs: Exploiting AI Computer-Use and Coding Agents” by Johann Rehberger. Agents have been set up to not trust claims without visiting the link. Agents love to click on links: https://media.ccc.de/v/39c3-agentic-probllms-exploiting-ai-computer-use-and-coding-agents
And even more finally, I got interested in a comparison of genealogical trajectories, like how we went from transistor to integrated circuit, it took 13 years (1947 to 1960) https://en.wikipedia.org/wiki/Invention_of_the_integrated_circuit
Like how we got from transformers to reliability. If things get more difficult, the timeline might extend, despite acceleration narratives. After all, a narrative is just a narrative and not science, or something.
OK, Gary Marcus’ super bowl vis-a-vis table from https://garymarcus.substack.com/p/super-bowl-matchup-anthropic-vs-openai

Comments
tag cloud
robotics
music
AI
books
research
psychology
intelligence
feed
ethology
computational
startups
sound
jcl
audio
brain
organization
motivation
models
micro
management
jetpack
funding
dsp
testing
test
synthesis
sonfication
smp
scope
risk
principles
musician
motion
mapping
language
gt
fail
exploration
evolution
epistemology
digital
decision
datadriven
computing
computer-music
complexity
algorithms
ai
aesthetics
wayfinding
visualization
tools
theory
temporal
sustainability
supercollider
stuff
sonic-art
sonic-ambience
society
signal-processing
self
score
robots
robot-learning
robot
python
pxp
priors
predictive
policies
philosophy
perception
organization-of-behavior
open-world
open-culture
neuroscience
networking
network
navigation
movies
minecraft
midi
message-passing
measures
math
locomotion
llms
linux
learning
kpi
internet
init
health
hacker
growth
grounding
graphical
generative
gaming
games
explanation
event-representation
embedding
economy
discrete
development
definitions
cyberspace
culture
criticism
creativity
computer
compmus
cognition
coding-agents
business
birds
biology
bio-inspiration
benchmarks
android
agents
action