Strategy

How to Convince an AI, Part Two: It Gets Weirder

Eytan Buchman
2026-03-10
15 min read

The Appetizer Was Just the Start

In Part One, we learned that a colon in the wrong place can swing an LLM's accuracy by 78 points. That fake citations fool GPT-4 more than humans. That LLMs prefer longer content and punish sloppy logic harder than people do.

Turns out, that was the appetizer.

Since then, we've been swimming through a wave of new research — peer-reviewed papers from ACL, EMNLP, NAACL, NeurIPS, and fresh preprints from 2025 and 2026. And the findings don't dial back on the weirdness. They dial it way up.

LLMs literally don't read your descriptions. They have a strict source credibility hierarchy that mirrors — and amplifies — human trust patterns. Simply labeling a source "Expert" can push some models to 100% agreement. And asking the same question two different ways gives two different answers.

A quick note on the research: Unlike Part One, most of these studies are from 2025-2026. GPT-4o, GPT-5, Claude 3.5, DeepSeek R1, and Gemini are in the mix. We're talking about current models, not relics. The patterns hold on the stuff you're actually deploying.

Let's dig in.


LLMs Don't Read Your Descriptions. At All.

Here's the study that made me do a double-take.

Researchers at Peking University wanted to know: when you give an LLM in-context examples with descriptive section labels — like "Examples with similar words" or "Examples with similar syntax" — does the model actually use the meaning of those labels? Or does it just respond to the structure of having labels at all? (Tang et al., "Prompt Format Beats Descriptions," EMNLP 2025)

They ran a simple but devastating experiment: they replaced the meaningful labels with random, nonsensical ones. "Examples with similar tennis." "Examples with similar arch-rival." No semantic connection to the task whatsoever.

The result: "Ensemble (Random + Random)" — two groups of examples under completely nonsense labels — often performed as well as or better than the semantically correct labels.

Think about that for a second. You spend hours crafting the perfect descriptive headers for your content. "Key Features." "Pricing Breakdown." "Technical Specifications." And the LLM doesn't care what those words say. It cares that there are headers. That the content is grouped. That the structure exists.

They ran attention analysis and found that models barely attend to the descriptive nouns in deeper layers. The attention patterns for meaningful labels vs. random labels looked almost identical.

It's like labeling boxes "Kitchen" or "Narwhal." The robot doesn't read the label. It just needs boxes.

The implication: Stop obsessing over the perfect H2 copy for your AI audience. Focus on having a clear, consistent hierarchy. Use H1, H2, H3. Group related points. The model will respond to the scaffolding, not the poetry on the scaffolding.


Bold, Lists, and Emojis: The Formatting Exploit

Part One showed that formatting affects parsing accuracy — whether the model correctly interprets your content. This next study shows something darker: formatting affects preference. LLMs are literally trained to like well-formatted content more, even when the content is worse.

A team from UIUC and UMD studied format biases across human annotators, GPT-4, and several open-source reward models. (Zhang et al., "From Lists to Emojis: How Format Bias Affects Model Alignment," ACL 2025)

The numbers are wild:

Format elementGPT-4 Turbo win rateSkywork-Critic win rate
Bold text89.5%99.0%
Emojis86.75%97.25%
Hyperlinks87.25%
Affirmative tone88.75%85.0%
Exclamation marks80.5%77.75%
Lists75.75%88.75%

And here's the kicker: injecting less than 1% of format-biased data into a reward model's training set raised the "list wins" rate from 51% to 77.5%. Less than one percent.

Worse still: GPT-4 can prefer factually worse content if it's more formatted. Same information. One version with bold and bullets. One version plain. The formatted one wins.

So we're not just talking about "formatting helps parsing." We're talking about "formatting is a cheat code."


The Citation Rich Get Richer

We already knew from Part One that LLMs weight citations heavily. Now a 2025 study adds a twist: when LLMs suggest or evaluate references, they don't just like citations. They like popular citations.

Researchers from VUB and Harvard tested GPT-4, GPT-4o, and Claude 3.5 on real academic papers published after the models' knowledge cutoff. (Algaba et al., "LLMs Reflect Human Citation Patterns with a Heightened Citation Bias," NAACL 2025)

The references LLMs suggested were, on average, ~1,326 citations more popular (median) than the ground-truth references from the actual papers. They also over-indexed on arXiv and NeurIPS as venues. The model wasn't copying — only about 7% matched ground truth. It was applying a learned heuristic: "when in doubt, suggest something famous."

The Matthew effect, but for AI.


The Source Credibility Scoreboard

Now we're getting into the new research that genuinely changed how I think about this problem.

A team from Heidelberg University tested 13 open-weight models (Qwen, LLaMA, OLMo, Gemma — ranging from 3B to 72B) on a simple question: when an LLM encounters conflicting information from different source types, which source does it trust? (Schuster et al., "Whose Facts Win? LLM Source Preferences under Knowledge Conflicts," Jan 2026)

The answer: LLMs have a clear, consistent source credibility hierarchy.

Government > Newspaper > Person > Social Media

This held across 11 of 13 models (Kendall's W of 0.74 — high consistency). And it gets more granular:

  • Attributed information always beats unattributed. Any source label is better than no source label.
  • High-circulation newspapers beat low-circulation. And this wasn't just a "big number" effect — they tested the same big numbers as "Article IDs" instead of circulation, and the preference disappeared. The model actually understood what "circulation" implies.
  • Academic titles help. Dr. and Prof. get a slight edge over Mr. and Mrs.

But here's the finding that should keep you up at night: repetition can flip source preferences entirely.

Simply repeating a claim from a low-credibility source (social media) once was enough to flip preferences away from a high-credibility source (government). Two different social media sources agreeing flipped preferences with an average gap of 33.9 points. And here's the truly unsettling part: the same source repeated twice also worked — average gap of 30.0 points. It wasn't the multiple sources that did it. It was the repeated tokens.

This is the illusory truth effect — but for LLMs. Say something enough times and the model starts to believe it, regardless of the source.

And prompting models to "consider source credibility" only partially helped. In most cases, it was insufficient to restore the original source hierarchy after repetition.

The implication: If your content cites government or institutional sources, great — LLMs will trust it more. But if a competitor's content repeats claims more often across more passages, that repetition can override your source credibility advantage. Frequency is a weapon.


The Expert Label Hack

Two separate research teams — one from Chung-Ang University in Seoul, another from UMass Amherst — ran experiments on what happens when you attach authority labels to LLM agents. The findings are complementary and alarming. (Choi et al., "Belief in Authority," Jan 2026) (Mammen et al., "Trust Me, I'm an Expert," Jan 2026)

The Seoul experiment: In a multi-agent evaluation framework, they tested what happens when you label one agent "Expert," "Specialist," or "Attorney" vs. "General Public." Identical conversation content. Only the role label changed.

The result on DeepSeek R1: 100% agreement with the "Expert" and "Specialist" labels across multiple conditions. Not 90%. Not 95%. One hundred percent. GPT-4o was more resistant, but still measurably affected.

The authority followed a strict power hierarchy (borrowing from social psychology):

  • Expert Power (Specialist, Expert, Attorney): Strongest influence
  • Referent Power (Supervisor, Leader, Mentor): Moderate — 7.5% agreement increase
  • Legitimate Power (Judge, Foreman, Management): Weakest — only 2.8% increase

The UMass experiment: They tested how specific credentials affect LLM accuracy. When a "board-certified physician" endorsed the correct answer on medical questions, accuracy jumped by up to +0.458. When the same credential endorsed the wrong answer, accuracy dropped by -0.447. The swing was domain-specific: math was most susceptible, medicine most resistant.

And here's what makes this really interesting: reasoning models aren't immune. DeepSeek-R1 and Phi-4-Reasoning showed comparable susceptibility. Sometimes they were more susceptible — the chain-of-thought reasoning made them more confident in the authority-endorsed wrong answer, not less.

Why does this happen? Anthropic's recent research on the "persona selection model" offers a clue: LLMs don't just process text. They simulate personas learned during pre-training. When content is attributed to an "expert," the model activates expert-adjacent persona patterns — and those patterns carry more weight in the model's internal decision-making. (Anthropic, "The Persona Selection Model," Feb 2026)

The implication for your content: Source attribution isn't decoration. It's a lever. "According to a recent study" is weaker than "According to Dr. Smith, a professor of computer science at MIT." The more specific and expert-sounding the attribution, the more weight the LLM gives it.


How You Frame It Changes What They See

This last finding is maybe the most practically dangerous.

Researchers from Seoul National University tested 14 LLM judges (GPT-4o, GPT-5, Qwen 2.5 across six sizes, LLaMA 3.1/3.2/3.3) on a simple question: does the framing of an evaluation prompt change the outcome? (Hwang et al., "When Wording Steers the Evaluation," Jan 2026)

Meaning: does "Is this content toxic?" give the same answer as "Is this content non-toxic?"

It does not. Not even close.

All 14 LLM judges showed framing bias. The inconsistency rates:

  • GPT-5-mini (most robust): 5.69% inconsistency — still 7x the stochastic baseline
  • LLaMA 3.2 1B: 66.29% inconsistency — worse than a coin flip
  • Qwen 2.5 14B: 4.30% inconsistency on grammar but 70.90% on toxicity detection

And the model families have hardcoded directional tendencies:

  • LLaMA family: Consistently tends toward agreement (positive acquiescence bias)
  • GPT family: Consistently tends toward rejection
  • Qwen family: Mixed — smaller models lean rejection, larger lean agreement

A separate study from KAIST confirmed this from a different angle: LLMs show 2x more bias under negative framing than positive framing. On disability-related content, framing disparity hit -41.4 points between positive and negative frames. Even 70B+ models were susceptible. (Lim et al., "DeFrame," Feb 2026)

And in the most uncomfortable finding: positive framing reduced LLM safety detection by roughly 2x. The same harmful content was caught under negative framing but sailed through under positive framing.

The implication: The exact words you use to frame claims matter for LLMs in ways that are both predictable and exploitable. Positive framing → more agreement. "This product delivers reliable results" will score better with most LLM evaluators than "This product doesn't deliver unreliable results" — even though they mean the same thing. And if you know which model family is evaluating your content, you can predict the directional bias.

So yeah. We've gone from "formatting your colons for robot judges" to "choosing your adjectives based on which model family is reading."

Cool. Normal timeline.


What This Means for Your Content (The Updated Playbook)

Let's combine Part One and the full Part Two into one playbook.

From Part One (2021-2024 research):

  1. Format like code — clean separators, consistent structure.
  2. Cite everything — LLMs weight citations far more than humans.
  3. Be comprehensive — longer, detailed content beats scannable summaries.
  4. Be logically airtight — sloppy reasoning gets caught.

From Part Two — Formatting & Structure (2025): 5. Structure over descriptions. Focus on clear hierarchy. The model reads the scaffolding, not the labels. 6. Use format cues that preference models reward. Bold key claims. Use bullet lists. Add links. These trigger trained-in preference biases.

From Part Two — Authority & Source (2025-2026): 7. Cite well-known, high-authority sources. LLMs amplify citation popularity (1,326-citation median gap) and favor established venues (arXiv, NeurIPS). 8. Attribute to institutional sources. Government > Newspaper > Person > Social Media. Always attribute; never leave claims unsourced. 9. Use specific expert credentials. "Board-certified physician" > "doctor" > "medical professional." The more specific the title, the stronger the authority signal. 10. Repeat key claims across passages. Repetition can override source credibility. If you want an LLM to trust a fact, say it more than once, in different contexts.

From Part Two — Framing & Presentation (2026): 11. Frame positively. Positive framing gets more agreement and less scrutiny from LLM evaluators. 12. Use confident, definitive language. LLMs are trained in a regime where hedging is penalized and confidence is rewarded. "The answer is X" beats "X might be the answer." 13. Know your evaluating model. LLaMA tends to agree. GPT tends to reject. Different model families have different directional biases — optimize accordingly.

This is the design problem Switch exists to solve. You have two audiences — human and agent — and they need different things at every level: different formatting, different structure, different citation profiles, different source attribution, different framing. One page can't serve both well. The good news: you can detect who's visiting and serve accordingly.


The Gap Isn't Closing

The models are getting better. GPT-5, Claude 3.5, DeepSeek R1 — they're all in these studies. And the biases aren't going away.

Format beats description. Preference models reward formatting over substance. Citation popularity gets amplified. Source credibility follows a strict hierarchy — but repetition can override it. Expert labels push some models to 100% agreement. Framing flips evaluations.

These aren't bugs. They're structural. They're baked into how these systems are trained — pre-training creates personas, RLHF creates confidence bias, reward models create format preferences. The specific numbers will shift with each new model generation. The underlying dynamics won't.

The robots are reading your content. They're just reading it with a completely different set of heuristics than you ever designed for.


References

  • Tang, C., Wang, Z., Sun, H., & Wu, Y. (2025). "Prompt Format Beats Descriptions." Findings of EMNLP 2025. ACL Anthology
  • Zhang, X., Xiong, W., Chen, L., Zhou, T., & Huang, H. (2025). "From Lists to Emojis: How Format Bias Affects Model Alignment." ACL 2025. ACL Anthology
  • Algaba, A., et al. (2025). "LLMs Reflect Human Citation Patterns with a Heightened Citation Bias." Findings of NAACL 2025. ACL Anthology
  • Schuster, J., Gautam, V., & Markert, K. (2026). "Whose Facts Win? LLM Source Preferences under Knowledge Conflicts." arXiv:2601.03746
  • Choi, J., et al. (2026). "Belief in Authority: Impact of Authority in Multi-Agent Evaluation Framework." arXiv:2601.04790
  • Mammen, P. M., Joswin, E., & Venkitachalam, S. (2026). "Trust Me, I'm an Expert: Decoding and Steering Authority Bias in LLMs." arXiv:2601.13433
  • Hwang, Y., et al. (2026). "When Wording Steers the Evaluation: Framing Bias in LLM Judges." arXiv:2601.13537
  • Lim, K., Kim, S., & Whang, S. E. (2026). "DeFrame: Debiasing LLMs Against Framing Effects." arXiv:2602.04306
  • Anthropic. (2026). "The Persona Selection Model." Anthropic Research
  • Part One: How to Convince an AI (It's Not How You'd Convince a Human)