← Back to Sandbox

Instruction format shootout

Prompt Engineering AI Toolkit

The Hypothesis

Does the format you deliver skills, prompts, and tools in — JSON, XML, or Markdown — affect how fast and accurately AI coding agents execute?

The Concept

AI coding agents like Claude Code and Codex consume instructions in whatever format you hand them — but nobody has tested whether the format itself changes the outcome. This experiment takes identical skills, prompts, and tool definitions, encodes them in JSON, XML, and Markdown, then runs the same tasks in fresh project sessions across multiple models. Same content, different packaging. Does the container change the result?

The Flow.
Define test skills and tasks
identical content: skill definitions, tool schemas, task prompts
Encode in three formats
JSON, XML, Markdown — semantically identical
Fresh project per run
no context bleed between tests
Run same prompt across models
Claude Code, Codex, others — same task each time
Measure speed, accuracy, output quality
completion time, correctness, adherence to instructions

Same instructions, three formats, fresh sessions, multiple models. Pure format comparison.

Instruction format shootout

The hypothesis

Does the format you deliver skills, prompts, and tools in — JSON, XML, or Markdown — affect how fast and accurately AI coding agents execute?


The concept

AI coding agents like Claude Code and Codex consume instructions in whatever format you hand them — but nobody has tested whether the format itself changes the outcome. This experiment takes identical skills, prompts, and tool definitions, encodes them in JSON, XML, and Markdown, then runs the same tasks in fresh project sessions across multiple models. Same content, different packaging. Does the container change the result?


How it works

  1. Define test skills and tasks — identical content: skill definitions, tool schemas, task prompts
  2. Encode in three formats — JSON, XML, Markdown — semantically identical
  3. Fresh project per run — no context bleed between tests
  4. Run same prompt across models — Claude Code, Codex, others — same task each time
  5. Measure speed, accuracy, output quality — completion time, correctness, adherence to instructions

Same instructions, three formats, fresh sessions, multiple models. Pure format comparison.


What it explores


What we found


Learnings


Where it goes next

We’ve settled on Markdown as the default for Orchesta’s skill definitions, with JSON for structured tool schemas. The more interesting thread: does format matter less as models improve? We’re planning to re-run this experiment every six months to track whether the gap narrows.

Want early access?
Some of these become products.

Innovation and frustration start in the sandbox. Tell us about your what-ifs and let's test something.

Start a conversation