The Hypothesis

Does the format you deliver skills, prompts, and tools in — JSON, XML, or Markdown — affect how fast and accurately AI coding agents execute?

The Concept

AI coding agents like Claude Code and Codex consume instructions in whatever format you hand them — but nobody has tested whether the format itself changes the outcome. This experiment takes identical skills, prompts, and tool definitions, encodes them in JSON, XML, and Markdown, then runs the same tasks in fresh project sessions across multiple models. Same content, different packaging. Does the container change the result?

Instruction format shootout

The hypothesis

Does the format you deliver skills, prompts, and tools in — JSON, XML, or Markdown — affect how fast and accurately AI coding agents execute?

The concept

How it works

Define test skills and tasks — identical content: skill definitions, tool schemas, task prompts
Encode in three formats — JSON, XML, Markdown — semantically identical
Fresh project per run — no context bleed between tests
Run same prompt across models — Claude Code, Codex, others — same task each time
Measure speed, accuracy, output quality — completion time, correctness, adherence to instructions

Same instructions, three formats, fresh sessions, multiple models. Pure format comparison.

What it explores

Does instruction format measurably affect execution speed or accuracy?
Do different models have format preferences (e.g. Claude prefers XML, GPT prefers JSON)?
Does task type change the optimal format (code generation vs. analysis vs. writing)?
Is there a format that consistently wins for tool/skill definitions specifically?
Does format affect how well the agent follows complex multi-step instructions?

What we found

No clear winner across all conditions — results varied by model, task type, and output required
JSON performed best for structured tool schemas and API definitions
XML performed best for nested, hierarchical instructions with complex conditions
Markdown was rarely the fastest on any single axis but was the most consistent performer across all combinations
The biggest variable was task type, not format

Learnings

Format primes output — JSON input leads to more structured agent output, making it the right choice when you need parseable responses.
XML’s nesting advantage for conditional logic comes at a token cost — only worth it for genuinely complex branching instructions.
Markdown is the safest default for portable skill definitions — it won’t win any single benchmark but won’t lose badly either.
Optimise for task type first, format second — the task you’re running matters more than how you encode the instructions.

Where it goes next

We’ve settled on Markdown as the default for Orchesta’s skill definitions, with JSON for structured tool schemas. The more interesting thread: does format matter less as models improve? We’re planning to re-run this experiment every six months to track whether the gap narrows.