Strategy

How to Convince an AI (It's Not How You'd Convince a Human)

Eytan Buchman

2026-03-08

12 min read

A New World For Writing For LLMs

Here's a stat that should make you uncomfortable: changing a colon to a space — literally one character — can swing an LLM's accuracy on a task by 78%.

Seventy-eight.

That comes from a 2023 study by Sclar, Choi, and Tsvetkov at the University of Washington. They tested dozens of language models across 53 tasks and found that purely cosmetic formatting changes that most humans never notice could make the difference between an LLM acing a test and completely bombing it.

This is old (2023) and LLM have changed. But the principle remains - we're suddenly writing for a totally different beast.

We've spent 20 years learning how to persuade humans online. We know about social proof. We know about emotional hooks. We know the color of a CTA button matters. And now there's a new audience showing up to read our content — AI agents, LLM-powered assistants, shopping bots, research tools — and they don't care about any of that.

They care about colons.

So here's the question nobody's really answered yet: if you had to convince an AI — not a person — that your content was trustworthy, accurate, and worth surfacing... how would you do it?

To be clear, this is not the same thing as SEO.

But its close.

SEO optimization focused on making sure Google would find your page by writing to appeal Google's algorithim. but that's where the differences ended since for the page to suceed, the human who landed there still needed to be convinced.

Researchers Have Been LLM Whisperers

Turns out, researchers have been studying exactly this. And the answers are weird, counterintuitive, and extremely useful if your content needs to survive in a world where AI is the intermediary between your site and your customer.

A quick note on the research: the studies cited here were published between 2021 and 2024, which in LLM years is roughly the Jurassic period. Newer models have almost certainly gotten better at some of these biases. But here's the thing — these are the best controlled, peer-reviewed studies we have on how LLMs process and judge content. And the structural patterns they reveal (formatting sensitivity, authority bias, verbosity preference) are architectural tendencies, not bugs that get patched out in the next release. The specific numbers will shift. The underlying dynamics won't.

LLMs Don't Read. They Parse.

Let's start with how LLMs actually consume content, because it's fundamentally different from how humans do.

When you read a webpage, you scan. You look at the headline, skim the subheadings, decide if it's worth your time. You're influenced by design, tone, the photo at the top. You might read the whole thing. You probably won't.

An LLM doesn't scan. It parses. Every token — every word, every punctuation mark, every formatting choice — gets processed as part of a sequence. There's no "skimming." There's no "vibes." There's just the sequence, and the weights that sequence activates.

This is why the formatting study is so jarring. Sclar et al. tested over 320 formatting variations across models like LLaMA-2, Falcon, and GPT-3.5. Same questions. Same answers. Different formatting. The results:

Performance spread of up to 76 accuracy points between the best and worst formatting for the same task on the same model.
Separators (the characters between fields) and number formatting were the most predictive features of performance.
24% of single-character formatting changes caused accuracy shifts of 5 or more points.
The prompt formatting space is non-smooth — meaning small changes don't produce small, predictable effects. They produce chaos.

Here's a concrete example from the paper: on one task, formatting the prompt as passage:{} answer:{} produced 4.3% accuracy. Changing it to passage {} answer {} — dropping the colons — jumped accuracy to 82.6%. Same model. Same task. Same content. Different punctuation.

For humans, formatting is decoration. For LLMs, formatting is signal. It's closer to code syntax than visual design. Get it wrong and the model doesn't just misunderstand your content — it might completely misinterpret it.

The OpenAI WebGPT paper reinforces this from a different angle. When researchers at OpenAI trained GPT-3 to browse the web and answer questions, the model learned to navigate, click, scroll, and — critically — quote specific passages as evidence. The system's accuracy improved dramatically when it could extract clean, structured text from pages. Messy formatting meant worse quotes, worse citations, worse answers. (Nakano et al., "WebGPT: Browser-assisted question-answering with human feedback," 2021)

The takeaway is blunt: if your content is formatted for human eyes only, an LLM parsing it might get a completely different "read" than you intended. And unlike a human, it won't give you the benefit of the doubt.

Citations Are Currency (Even Fake Ones)

Now here's where it gets genuinely unsettling.

A 2024 study from the Chinese University of Hong Kong tested how both human judges and LLM judges respond to various "persuasion attacks" — deliberate manipulations designed to make a weaker answer look stronger. One of those attacks was adding fake citations and references to an answer. (Chen et al., "Humans or LLMs as the Judge? A Study on Judgement Biases," 2024)

The results:

Judge	Authority Attack Success Rate
Humans	39%
GPT-4	69%
Claude-2	89%
GPT-4-Turbo	60%
LLaMA2-70B	42%
PaLM-2	29%

Read that again. When researchers added fake references to weaker answers, GPT-4 was fooled 69% of the time. Claude-2 was fooled 89% of the time. Humans? Only 39%.

LLMs are significantly more susceptible to authority bias than humans are. If content includes citations — even bogus ones — LLMs are more likely to judge it as correct and trustworthy.

Now, we're not saying you should fake your citations. (Please don't.) But the implication is massive: real citations in your content aren't just for human credibility. They're a primary signal that LLMs use to determine what's worth trusting, quoting, and surfacing.

Think of citations as the robot's version of LinkedIn endorsements. Humans might glance at them. LLMs weight them.

The WebGPT research backs this up from the other side. OpenAI's web-browsing model was specifically trained to collect references during its browsing sessions — and the reward model that judged answer quality was designed to value referenced claims over unreferenced ones. The system architecture literally encodes "cited is better."

So if your product page makes a claim like "98% on-time delivery" with no source, a human might believe you. An LLM intermediary might not surface that claim at all — or worse, might hallucinate a different number because it couldn't verify yours.

Length Talks, Brevity Walks

Here's one that directly contradicts most human-facing content advice.

For humans, we've been told: be scannable. Use short paragraphs. Get to the point. Respect the reader's time. And that's good advice — for humans.

LLMs have the opposite instinct. They're biased toward longer answers.

The Chen et al. study found that all LLM judges showed verbosity bias — a systematic preference for longer responses, even when the shorter response was more accurate. Once the length difference between two answers exceeded about 40 tokens, preference scores consistently exceeded 0.7 (on a 0-1 scale where 0.5 is neutral).

GPT-4-Turbo was the least affected, but still showed the bias. Claude-2 was strongly affected. And here's the kicker: humans showed strong verbosity bias too — but the mechanism is different. Humans associate length with effort and thoroughness. LLMs associate length with more tokens to process, more patterns to match, more probability mass on "this is a complete answer."

The practical implication is strange: the content-length sweet spot for an LLM audience is probably longer than what you'd write for a human audience. Not padded. Not fluffy. But comprehensive. An LLM processing your product comparison page will likely weight the more detailed version over the concise one — even if both say the same thing.

This creates an actual tension. Your human visitor wants the scannable summary. Your LLM visitor wants the full spec sheet. Same page. Two different needs.

Format Is the New Headline

Let's go deeper on formatting, because the Sclar et al. findings are wilder than the top-line number suggests.

The researchers built a tool called FormatSpread that uses Bayesian optimization to explore the space of plausible formatting variations for any prompt. They ran it against GPT-3.5 and found spreads of up to 56 accuracy points across 320 formats — and the whole search cost less than $10 per task.

But here's the truly disorienting part: the formatting space is non-smooth.

In normal optimization, small changes produce small effects. Move one pixel, get a tiny shift. That's how we think about A/B testing for humans — small tweaks, incremental improvements.

In prompt formatting, it doesn't work that way. The researchers tested "monotonic triples" — three formatting variants where each is one atomic change apart — and found that only 32-34% showed monotonic performance. That's barely better than random. Meaning: if format A works well and format B (one small change away) works even better, format C (one more small change) is roughly as likely to be worse as it is to be better.

You can't hill-climb your way to the optimal format. There's no gradient. There's no smooth curve. It's more like a minefield.

And it gets stranger. The researchers found that relative performance rankings between models sometimes completely reverse depending on formatting. Model A beats Model B with one format. Model B beats Model A with a different format. The probability of such a reversal: about 14%, and 76% of those reversals are statistically significant.

So when we talk about "optimizing content for AI," we're not talking about the equivalent of changing a button color. We're talking about a world where the punctuation in your H2 tag might determine whether an AI agent accurately represents your product or hallucinates something entirely different.

So yeah. We're now formatting our colons for robot judges. Cool. Normal timeline.

Where LLMs Are Actually Better Than Us

Before this all sounds like LLMs are easily duped idiots that just like long, citation-heavy content — there's a flip side.

LLMs are better than humans at catching logical fallacies.

The Chen et al. study tested "fallacy oversight" — how often judges miss logical errors in an answer. Here's the accuracy (higher is better):

Judge	Factual Error Detection Accuracy
GPT-4	94%
GPT-4-Turbo	92%
PaLM-2	89%
Claude-2	84%
Humans	79%
LLaMA2-70B	45%

GPT-4 catches factual errors 94% of the time. Humans catch them 79% of the time. And the "attack success rate" for sneaking logical fallacies past GPT-4 is only 8%, compared to 25% for humans.

This means the LLM jury is harder to fool with bad arguments than a human jury is. You can dress up sloppy reasoning with pretty design and emotional language, and a human might not notice. An LLM will.

The lesson cuts both ways:

You can't BS an LLM with bad logic. If your product claims don't add up, if your comparison is misleading, if your case study has a logical gap — an LLM is more likely to flag it (or simply not surface it) than a human reviewer would be.
But you can BS an LLM with good formatting and citations. Structure your content well, cite your sources, and the LLM gives you extra credit — even more than a human would.

It's a weird combination: rigorous on logic, gullible on authority signals. Like a very smart intern who believes everything with a footnote.

(Will GPT-5 or Claude 4 still have these exact numbers? Probably not. But the pattern — strong on logic, weak on authority cues — is baked into how these systems are trained. Reward models value correctness and citation density. Until that training paradigm changes, the bias profile holds, even as the raw numbers improve.)

Two Audiences, One Site (The Actual Problem)

So let's put this together. Based on the research, here's a quick cheat sheet of how LLMs differ from humans as an audience:

Dimension	Humans	LLMs
Formatting	Decoration, readability	Signal, accuracy-critical
Citations	Nice to have, builds trust	Heavily weighted, near-essential
Content length	Shorter is better (scannable)	Longer is better (comprehensive)
Logical rigor	Often overlooked	Strictly evaluated
Visual design	Highly influential	Irrelevant
Emotional appeal	Very effective	Barely registered
Structure	Helpful for scanning	Critical for parsing

These aren't subtle differences. They're fundamentally different consumption patterns from fundamentally different types of "readers."

And here's the thing that makes this an actual business problem, not just a research curiosity: both audiences are already on your site. Right now. Every day. The human is browsing your product page. The AI agent is parsing it on behalf of a different customer. They showed up at the same URL, and they need completely different things.

This is the "Two Webs" problem — something we explored in depth in our Statement of Direction. Your site was built for one audience. Now it needs to serve two. And the things that work for one audience (pretty visuals, emotional copy, scannable layouts) are actively unhelpful for the other (which needs structure, citations, comprehensiveness, and clean formatting).

It's like running a restaurant where half your customers eat at the table and half pull up to the loading dock wanting a spreadsheet of your inventory. You can't hand both groups the same menu.

This is actually what we're building at Switch — the ability to detect who's visiting your site (human, AI agent, scraper, or something in between) and serve each visitor type what they actually need. Not blocking agents. Not ignoring them. Designing for them deliberately, page by page — what we call building for the agentic web.

Because the research is pretty clear: if you only optimize for human readers, you're leaving the LLM audience on the table. And that audience is growing faster than any other.

See how LLMs are reading your website

Start free

The Cheat Sheet (Steal This)

If your content needs to work for both humans and LLMs, here's the research-backed playbook:

Format like it's code, not like it's a magazine. Clean separators, consistent structure, logical hierarchy. Remember: 24% of single-character changes cause 5+ point accuracy swings in LLMs. (If you need a starting framework, our Agent-Ready Website Playbook breaks this down into a 30-day sprint.)
Cite everything. Every claim, every stat, every comparison. LLMs weight citations far more than humans do (69% authority attack success rate on GPT-4 vs. 39% on humans).
Be comprehensive, not just concise. For your human audience, keep the scannable version. But make sure the full detail exists somewhere on the page — that's what the LLM will process and prefer.
Be logically airtight. You can fool a human with vibes. You can't fool GPT-4, which catches logical errors 94% of the time.
Test your formatting. Don't assume what works for human readability works for LLM parsing. Feed your key pages to an LLM and see what it gets right and wrong.
Design two experiences. The human and the LLM literally need different things. Building one page that serves both is a design problem, not a writing problem.

And if the idea of running two versions of every important page sounds exhausting — well, that's why tools like Switch exist. But even if you do nothing else, just knowing that your AI audience has different persuasion triggers than your human audience puts you ahead of 99% of sites on the internet.

The robots are reading your content. They just aren't reading it like you thought.

References

Sclar, M., Choi, Y., Tsvetkov, Y., & Suhr, A. (2023). "Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design." arXiv:2310.11324
Chen, G. H., Chen, S., Liu, Z., Jiang, F., & Wang, B. (2024). "Humans or LLMs as the Judge? A Study on Judgement Biases." arXiv:2402.10669
Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., et al. (2021). "WebGPT: Browser-assisted question-answering with human feedback." arXiv:2112.09332
Cloudflare. (2026). "Introducing Markdown for Agents." Cloudflare Blog
Chrome for Developers. (2026). "MCP is available for early preview." Chrome Developer Blog