Does the format you deliver skills, prompts, and tools in — JSON, XML, or Markdown — affect how fast and accurately AI coding agents execute?
AI coding agents like Claude Code and Codex consume instructions in whatever format you hand them — but nobody has tested whether the format itself changes the outcome. This experiment takes identical skills, prompts, and tool definitions, encodes them in JSON, XML, and Markdown, then runs the same tasks in fresh project sessions across multiple models. Same content, different packaging. Does the container change the result?
Same instructions, three formats, fresh sessions, multiple models. Pure format comparison.
Does the format you deliver skills, prompts, and tools in — JSON, XML, or Markdown — affect how fast and accurately AI coding agents execute?
AI coding agents like Claude Code and Codex consume instructions in whatever format you hand them — but nobody has tested whether the format itself changes the outcome. This experiment takes identical skills, prompts, and tool definitions, encodes them in JSON, XML, and Markdown, then runs the same tasks in fresh project sessions across multiple models. Same content, different packaging. Does the container change the result?
Same instructions, three formats, fresh sessions, multiple models. Pure format comparison.
Format primes output — JSON input leads to more structured agent output, making it the right choice when you need parseable responses.
XML’s nesting advantage for conditional logic comes at a token cost — only worth it for genuinely complex branching instructions.
Markdown is the safest default for portable skill definitions — it won’t win any single benchmark but won’t lose badly either.
Optimise for task type first, format second — the task you’re running matters more than how you encode the instructions.
We’ve settled on Markdown as the default for Orchesta’s skill definitions, with JSON for structured tool schemas. The more interesting thread: does format matter less as models improve? We’re planning to re-run this experiment every six months to track whether the gap narrows.