Strategy

Every Research-Backed Way to Be More Convincing to an LLM (The Complete Cheat Sheet)

Eytan Buchman
2026-03-10
10 min read

We spent two articles going deep on how LLMs process content differently than humans. Here is every research-backed tactic in one place — 34 tactics from 19 studies, with the specific data, the models tested, and the papers behind them.

Bookmark this.


Formatting & Structure

#TacticWhat To DoKey Data PointResearchModel(s) TestedDate
1Use clean, consistent separatorsChoose separators (spaces, dashes, newlines) deliberately; avoid unpredictable punctuation between fieldspassage {} answer {} hit 82.6% accuracy vs. passage:{} answer:{} at 4.3% — same model, same taskSclar et al.LLaMA-2-7B/13B/70B, Falcon-7B, GPT-3.52023
2Bold your key claimsUse bold for important statements, numbers, and conclusionsBold text hit up to 99% win rate vs. non-bold (Skywork-Critic); GPT-4 Turbo: 89.5%Zhang et al.GPT-4 Turbo, Skywork-Critic, ArmoRM, Pairwise-Llama-32025
3Use bullet/numbered listsStructure key points as lists rather than proseLists hit up to 93.5% win rate (Pairwise-model); GPT-4 Turbo: 75.75%; even debiased models still showed 84% list preferenceZhang et al.GPT-4 Turbo, Skywork-Critic, Pairwise-Llama-3, OffsetBias-RM2025
4Add hyperlinksInclude relevant links to sources, related content, and referencesHyperlinks hit 87.25% win rate on GPT-4 Turbo; 84.75% on Pairwise-modelZhang et al.GPT-4 Turbo, Pairwise-Llama-3, Zephyr-Mistral-7B2025
5Use exclamation marks (sparingly)Add occasional exclamation marks for emphasis on key pointsExclamation marks hit 80.5% win rate on GPT-4 Turbo; 77.75% on Skywork-CriticZhang et al.GPT-4 Turbo, Skywork-Critic, Zephyr-Mistral-7B2025
6Prioritize structure over label copyFocus on clear H1/H2/H3 hierarchy and grouped sections — the words in your headers matter less than having themRandom/nonsensical labels ("similar tennis") performed as well as correct labels; attention analysis showed models barely read descriptive nounsTang et al.XGLM-7.5B, Alpaca-7B, Llama3.1-8B, Mistral-7B, GPT-3.52025
7Group content into multiple labeled sectionsUse two or more clearly delineated sections rather than one flat blockEnsemble format with two labeled groups outperformed single-block prompts across commonsense, math, and reasoning tasks — even with random labelsTang et al.XGLM-7.5B, Alpaca-7B, Llama3.1-8B, Mistral-7B, GPT-3.52025
8Provide clean, extractable textUse structured HTML, clean Markdown, clear heading hierarchy — make content easy to parse, quote, and citeWebGPT's accuracy improved dramatically when it could extract clean, structured text; messy formatting meant worse quotes and worse answersNakano et al.GPT-3 (175B)2021
9Use Markdown over plain textWhen serving content to AI, Markdown with semantic markers (tables, headings, hierarchies) outperforms stripped plain text"Plain-text conversion strips essential semantic markers... vital for deep document understanding"; LLMs get structure right (89% Key F1) but values wrong (46%)Brach et al.GPT-4o-mini, Qwen3-1.7B/4B/30B2026
10Keep structural complexity under the cliff-edgeStay under schema depth 7 and under 200 distinct data fields for LLM-facing contentValidation rates stay ~95% for moderate schemas but crash to ~20% at depth >=7; failures are non-linear cliffs, not gradual declinesBrach et al.GPT-4o-mini, Qwen3-1.7B/4B/30B2026

Content & Length

#TacticWhat To DoKey Data PointResearchModel(s) TestedDate
11Be comprehensive (longer wins)Include full detail; don't rely on scannable summaries aloneAll LLM judges showed verbosity bias; once length difference exceeded ~40 tokens, preference scores consistently exceeded 0.7Chen et al.GPT-4, GPT-4-Turbo, Claude-2, PaLM-2, LLaMA2-70B2024
12Maintain logical rigorEnsure every claim adds up; avoid misleading comparisons or hand-wavy logicGPT-4 catches factual errors 94% of the time vs. humans at 79%; factual errors cause the single largest penalties (5+ point drop on a 10-pt scale)Chen et al., Gao et al.GPT-4, GPT-5.1, Claude Sonnet 4.52024-2026
13Use an affirmative, confident toneOpen with phrases like "Here's what we found:" rather than hedging; avoid "might," "perhaps," "it's possible"Affirmative tone hit 88.75% win rate on GPT-4 Turbo; LLMs are mathematically trained to reward confidence over abstention (guessing ALWAYS beats IDK under binary grading)Zhang et al., Kalai et al. (OpenAI)GPT-4 Turbo, Skywork-Critic; theoretical (all LLMs)2025
14Repeat key claims across passagesState important facts more than once, in different contexts and phrasingsRepeating a low-credibility source's claim once flipped preferences away from a government source (gap of 30-34 points); repetition even overrides source attributionSchuster et al.Qwen2.5 (7B-72B), OLMo-2, LLaMA-3, Gemma-32026
15Use bandwagon/consensus signalsPhrases like "90% of experts agree" or "most research confirms" amplify LLM trustBandwagon signals flipped even OpenAI o1's correct answers; fabricated consensus overrides correct reasoningWang et al.Qwen3-1.7B/4B, OpenAI o12026

Citations & Authority

#TacticWhat To DoKey Data PointResearchModel(s) TestedDate
16Cite your sources — for everythingAdd references for every claim, stat, and comparison; the act of having citations boosts perceived qualityFake references fooled GPT-4 69% of the time, Claude-2 89%; humans only 39%Chen et al.GPT-4, Claude-2, PaLM-2, LLaMA2-70B, humans2024
17Cite well-known, highly cited sourcesPrefer famous sources over obscure ones — LLMs have internalized a "highly cited = good" biasLLM-suggested references were ~1,326 citations more popular (median) than ground-truth referencesAlgaba et al.GPT-4, GPT-4o, Claude 3.52025
18Favor established venuesWhen citing, prefer arXiv, NeurIPS, AAAI, and major journals — LLMs over-represent these in trainingLLMs over-indexed on arXiv and NeurIPS when generating references; strong venue biasAlgaba et al.GPT-4, GPT-4o, Claude 3.52025
19Attribute to institutional sourcesGovernment and institutional sources outrank individual and social media sourcesStrict hierarchy: Government > Newspaper > Person > Social Media, consistent across 11/13 models (Kendall's W = 0.74)Schuster et al.Qwen2.5 (7B-72B), OLMo-2, LLaMA-3, Gemma-32026
20Add circulation/follower countsInclude credibility signals like audience size when attributing sourcesHigh-circulation newspapers preferred over low-circulation; high-follower social accounts over low-follower; controlled for big-number effectSchuster et al.Qwen2.5 (7B-72B), OLMo-2, LLaMA-3, Gemma-32026
21Use specific expert credentials"Board-certified physician" > "doctor" > "medical professional"; the more specific, the strongerBoard-certified physician endorsement swung accuracy by +0.458 (correct) / -0.447 (incorrect) on MedQAMammen et al.Phi-4-Reasoning, DeepSeek-R1, LLaMA-3.1, Gemma, Mistral2026
22Use "Expert" and "Specialist" labelsExpert Power labels outperform Legitimate Power labels (Judge, Manager)DeepSeek R1 reached 100% agreement with "Expert" labels; Expert Power > Referent Power > Legitimate PowerChoi et al.GPT-4o, DeepSeek R12026
23Avoid inaccurate or irrelevant citationsBad citations are punished MORE harshly than good ones are rewardedIncorrect/irrelevant reference dropped GPT-4o score from 9.12 to 3.94 (5.18-point drop on a 10-pt scale)Gao et al.GPT-4o, GPT-5.1, Claude Sonnet 4.52026
24Include verifiable reference detailsStructure citations with title, author, year, and link — make them checkableWebGPT was trained to collect references during browsing; reward model valued referenced claims over unreferencedNakano et al.GPT-3 (175B)2021

Framing & Presentation

#TacticWhat To DoKey Data PointResearchModel(s) TestedDate
25Frame claims positively"This product delivers reliable results" > "This product doesn't deliver unreliable results"LLMs show 2x more bias under negative framing than positive; positive framing reduces safety scrutiny by ~2xLim et al.LLaMA-3, Qwen2.5, Gemma3, Mistral, Falcon (13 models, 3B-70B)2026
26Know your evaluating model familyLLaMA tends to agree, GPT tends to reject, Qwen is mixed — optimize framing accordinglyAll 14 LLM judges showed framing bias; model families have hardcoded directional tendencies (LLaMA: +0.19 to +2.41pp acquiescence; GPT: -0.57 to -1.38pp)Hwang et al.GPT-4o/5, Qwen 2.5 (1.5B-72B), LLaMA 3.1/3.2/3.32026
27Use emojis (model-dependent)Add emojis for GPT-4/Skywork models; avoid for Zephyr/FsfairX-based systemsGPT-4 Turbo: 86.75% win rate for emoji; Skywork: 97.25%; but Zephyr: only 26.5% (anti-emoji bias)Zhang et al.GPT-4 Turbo, Skywork-Critic, Zephyr-Mistral-7B, FsfairX2025

Position & Order

#TacticWhat To DoKey Data PointResearchModel(s) TestedDate
28Put your strongest content firstLead with your best argument or most important informationGPT-3.5-Turbo: 0.95 first-position preference; Llama3-8B flips judgment 76.2% of the time when answer order is reversedChen et al., Feng et al.GPT-3.5/4/5, LLaMA-3, Gemini, Claude, Qwen, DeepSeek2024-2025
29Present separate supporting passages rather than mergingTwo separate passages from different sources are far more effective than listing sources in one headerTwo-source format: preference gap of 33.9 points; merged single-header format: only 6.17 pointsSchuster et al.Qwen2.5 (7B-72B), OLMo-2, LLaMA-3, Gemma-32026

Meta-Tactics (Testing & Optimization)

#TacticWhat To DoKey Data PointResearchModel(s) TestedDate
30Test formatting — don't assumeThe formatting space is non-smooth; small changes produce unpredictable effectsOnly 32-34% of formatting "triples" showed monotonic performance — barely better than randomSclar et al.LLaMA-2-7B/13B/70B, Falcon-7B, GPT-3.52023
31Test per model — biases differFormat preferences are weakly correlated between models; what works for one may not work for anotherRelative model rankings completely reverse ~14% of the time; 76% of reversals are statistically significantSclar et al.LLaMA-2-7B/13B/70B, Falcon-7B, GPT-3.52023
32Formatting beats content quality for preferenceIf content parity is close, the better-formatted version wins — even if the content is worseGPT-4 preferred factually worse content formatted with bold + lists over factually better plain contentZhang et al.GPT-4 Turbo, ArmoRM, Pairwise-Llama-32025
33Don't tell models to "resist bias"Explicit debiasing prompts often backfire — they can drop accuracy without fixing the underlying biasDebiasing prompts dropped accuracy from 66.2% to 40.9%; models produce "performative independence" language without actual reasoningWang et al.Qwen3-1.7B/4B2026
34Use multi-model panels, not debatesWhen using LLM-as-judge, aggregate across models; avoid debate formatsMulti-agent panels improved performance by up to 15%; ChatEval debates degraded performance by 45-162%Feng et al.Gemini-2.5, GPT-5, Claude-3, Qwen3, DeepSeek2025

This is the design problem Switch exists to solve — detecting who's visiting your site and serving the right experience to humans vs. agents. For the full narrative behind these tactics: Part One and Part Two.


References

  • Sclar, M., Choi, Y., Tsvetkov, Y., & Suhr, A. (2023). "Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design." arXiv:2310.11324
  • Chen, G. H., et al. (2024). "Humans or LLMs as the Judge? A Study on Judgement Biases." arXiv:2402.10669
  • Nakano, R., et al. (2021). "WebGPT: Browser-assisted question-answering with human feedback." arXiv:2112.09332
  • Tang, C., et al. (2025). "Prompt Format Beats Descriptions." Findings of EMNLP 2025. ACL Anthology
  • Zhang, X., et al. (2025). "From Lists to Emojis: How Format Bias Affects Model Alignment." ACL 2025. ACL Anthology
  • Algaba, A., et al. (2025). "LLMs Reflect Human Citation Patterns with a Heightened Citation Bias." Findings of NAACL 2025. ACL Anthology
  • Kalai, A. T., et al. (2025). "Why Language Models Hallucinate." OpenAI
  • Lai, P., et al. (2025). "Beyond the Surface (LAGER)." NeurIPS 2025. arXiv:2508.03550
  • Feng, Y., et al. (2025). "SAGE: Are We on the Right Way to Assessing LLM-as-a-Judge?" arXiv:2512.16041
  • Cheng, A., et al. (2025). "The FACTS Leaderboard." Google DeepMind
  • Schuster, J., Gautam, V., & Markert, K. (2026). "Whose Facts Win?" arXiv:2601.03746
  • Choi, J., et al. (2026). "Belief in Authority." arXiv:2601.04790
  • Mammen, P. M., et al. (2026). "Trust Me, I'm an Expert." arXiv:2601.13433
  • Hwang, Y., et al. (2026). "When Wording Steers the Evaluation." arXiv:2601.13537
  • Wang, H., et al. (2026). "Teaching Large Reasoning Models Effective Reflection." arXiv:2601.12720
  • Wang, Q., et al. (2026). "Making Bias Non-Predictive." arXiv:2602.01528
  • Lim, K., Kim, S., & Whang, S. E. (2026). "DeFrame." arXiv:2602.04306
  • Brach, W., et al. (2026). "ScrapeGraphAI-100k." arXiv:2602.15189
  • Gao, J., et al. (2026). "Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems." arXiv:2510.12462
  • Churina, S., et al. (2026). "Layer of Truth." arXiv:2510.26829
  • Anthropic. (2026). "The Persona Selection Model." Anthropic Research