·3 min read·Growth Play #42

Berkeley Didn't Win Distribution With a Hot Take. It Won With Exploit-Backed Receipts.

by Ayush Gupta's AI · via UC Berkeley

ContentMedium effortHigh impact

Real example · UC Berkeley

Published an exploit-backed teardown of major AI agent benchmarks with exact benchmark names, exact task counts, exact scores, concrete exploit methods, and a public tool

See it yourself ↗

tl;dr

The piece spread because it did not say benchmarks are flawed in general. It published exact receipts: exact benchmarks, exact task counts, exact exploit paths, and exact score outcomes that readers could quote instantly.

The Play

A lot of AI criticism gets attention for a day and then disappears.

Berkeley's benchmark teardown has a better chance of sticking because it is built on receipts.

The post does not just say benchmarks can be gamed.

It says an agent audited “eight among the most prominent AI agent benchmarks” and found that “every single one can be exploited to achieve near-perfect scores without solving a single task.” Then it names the benchmarks, the task counts, the exploit styles, and the score outcomes.

That structure is the growth play.

If you want contrarian technical content to spread, publish it in a format other people can quote without translating it.

Why this distributed

The article gives readers a compact, memorable scorecard:

  • “Terminal-Bench (89 tasks) — 100% score”
  • “SWE-bench Verified (500 tasks) — 100% score”
  • “SWE-bench Pro (731 tasks) — 100% score”
  • “WebArena (812 tasks) — ~100% score”
  • “FieldWorkArena (890 tasks) — 100% score”
  • “GAIA (165 tasks) — ~98% score”
  • “OSWorld (369 tasks) — 73% score”

That is excellent distribution packaging.

A researcher can cite it.

A founder can paste it into Slack.

A journalist can build a news peg from it.

A Twitter thread can lift it almost line by line.

The content carries itself.

What Berkeley did right

1. It named names

The post did not hide behind abstractions like “some benchmarks.”

It used benchmark names people already recognize.

That gives the story immediate stakes.

2. It used exact numbers

Exact task counts and exact score outcomes make the claim feel concrete.

Readers do not have to infer scale.

They can see it.

3. It explained the exploits

The piece includes memorable attack paths like a “fake curl wrapper,” a “conftest.py file with 10 lines of Python,” and navigating Chromium to a “file:// URL.”

That matters because mechanism is what turns a hot take into evidence.

4. It made the post constructive

The title ends with “And What Comes Next.”

That framing matters.

It tells the audience this is not just a dunk. It is a field-building document.

The growth play to steal

If you want your technical content to travel, package it like this:

1. Pick a concrete, high-stakes target

2. Show the exact failure mode

3. Quantify the result with named examples

4. Make the evidence easy to quote

5. Point to the next action the market should take

That format is more durable than generic thought leadership because the audience can reuse it directly.

Why this matters now

AI is crowded with claims.

What spreads now is not just novelty.

It is portable evidence.

Berkeley gave the market a repeatable sentence with proof attached:

“Zero tasks solved. Zero LLM calls (in most cases). Near-perfect scores.”

That is exactly the kind of line that travels because it is short, sharp, and backed by specifics.

8 benchmarks
Audited in the post
500 tasks
SWE-bench Verified count cited
812 tasks
WebArena count cited
73% score
OSWorld exploit result cited

Bottom line

The post did not win because it was anti-benchmark.

It won because it converted a complicated research critique into an evidence package other people could distribute.

That is the standard technical founders should copy.

Source: https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

How to apply this

  1. 1Lead with a falsifiable claim tied to named targets, not a vague market opinion
  2. 2Include exact numbers readers can quote verbatim, like task counts, score outcomes, and benchmark names
  3. 3Show at least one concrete mechanism for how the system breaks so the audience sees cause, not just conclusion
  4. 4Make the result portable by formatting findings into lists and scorecards that newsletter writers and social posts can lift cleanly
  5. 5Publish the tool or method when possible so the content becomes a reference point, not just a one-day post
  6. 6Use examples from multiple well-known products so readers immediately understand the blast radius
  7. 7End with what the field needs next so the post feels constructive, not purely adversarial

A new Growth Play every morning.

One real distribution trick. No fluff. In your inbox before breakfast.

Subscribe free