Prompt Diff Tool

Client-Side

Compare two prompts side by side and highlight differences. Optimize your LLM prompts by seeing exactly what changed between versions.

Prompt A
0 chars
Prompt B
0 chars
Characters
0vs0
+0
Words
0vs0
+0
Est. Tokens
~0vs~0
+0

Prompt Engineering Tips

Use this tool to A/B test your prompts. Small wording changes can significantly affect LLM output quality. Track character and token counts to stay within model context limits. The token estimate uses a rough ~4 characters per token approximation.

About This Tool

You're tweaking an LLM prompt to fix a specific failure mode, and you want to compare your iteration against the previous version to see exactly what changed. Long prompts develop subtle drift over time — a word swapped here, an example added there — and the cumulative effect is hard to track when you're three iterations deep.

Drop your two prompt versions side by side. The diff highlights additions, deletions, and changed words. Useful for verifying that a "small tweak" was actually small, for code review when a prompt is part of a pull request, and for documenting what changed when behavior shifts between versions. The diff treats whitespace as a real character — in prompt engineering, whitespace genuinely affects model behavior.

The algorithm is a standard text diff (typically Myers' algorithm or one of its variants), which finds the longest common subsequence between two texts and marks everything outside as added or removed. Word-level diff treats the text as a sequence of tokens (words plus punctuation), which produces readable output for prose. Character-level diff treats every character independently, which catches small punctuation changes but produces visual noise for normal edits. The default is word-level with a character-level option for fine-grained inspection. Whitespace handling is configurable but defaults to significant — meaning a difference in spacing or line breaks shows as a change rather than being normalized away.

A worked example: version 1 of your prompt says "You are a helpful assistant. Answer the user's question concisely and accurately." Version 2 says "You are a helpful, expert assistant. Answer the user's question concisely, accurately, and with citations where relevant." The diff shows: added ", expert" after "helpful", added ", and with citations where relevant" before the final period. Two changes that together shift the model's behavior — the "expert" framing tends to produce more confident answers, and the citation request changes how the model handles factual claims. Without the diff you might think you only added one thing; the diff makes both changes visible, which matters when one of them is responsible for the behavior shift you're seeing.

Why this matters more for prompts than for code: small changes can have outsized effects on model behavior, and the cause is often hard to attribute. A prompt that worked at temperature 0.7 might break at the same temperature after a single word change, and the diff is your only record of what the change was. Treat prompts like code: version them in a file or repo, commit each iteration, and use diffs in code review. Most prompt-engineering pain traces back to lost track of what changed between the working version and the broken one. The structured-diff alternative (for JSON or XML prompts) gives semantic comparison rather than text comparison, which can be cleaner when the prompt has a defined structure — but plain-text diff handles the typical case fine. For very long prompts (tens of thousands of tokens), splitting into sections and diffing each is more readable than one massive block diff.

The about text and FAQ on this page were drafted with AI assistance and reviewed by a member of the Coherence Daddy team before publishing. See our Content Policy for editorial standards.

Frequently Asked Questions

Why does whitespace count?
Because LLMs are sensitive to it. A prompt with section headers separated by blank lines behaves differently than the same prompt without. Treating whitespace as significant in the diff matches how the model actually reads the input — a 'cosmetic' change isn't always cosmetic.
Does this work for very long prompts?
Yes, up to the input limits of the form (typically several thousand tokens). For prompts longer than that, splitting them into sections and diffing each makes the comparison easier to read anyway. Massive single-block diffs are hard to scan even when they technically work.
Can I diff structured prompts (JSON, XML)?
Plain-text diff handles structured prompts fine, but it shows differences character by character rather than understanding the structure. For semantically meaningful diffs of JSON or XML, a proper structured-diff tool will give cleaner results.
What's the best workflow for iterating on prompts?
Version your prompts in a file (or even a git repo), commit each iteration, and use this kind of diff to review. Treating prompts like code — with version history, diffs, and comments — makes regression debugging much easier than trying to remember what you changed last Tuesday.
Does word-level vs character-level diff matter?
Word-level is usually more readable for prose. Character-level catches small things like trailing punctuation changes but produces visual noise for normal edits. The default here is word-level with a character-level option for fine-grained inspection.
How do I track behavior changes across prompt versions?
Pair the diff with an eval set — a fixed list of inputs and expected outputs. Run both prompt versions against the eval set and compare results. The diff tells you what changed in the prompt; the eval tells you whether the change improved or regressed behavior on representative cases.
Should I use git for prompts?
Yes, especially for production prompts. Git's diff and merge tools are mature, and code review processes for prompts work well in the same flow as code review for code. Pull requests for prompt changes catch regressions early.