The Hypothesis

Can an AI generate a full set of hypothesis-labelled email test variants — formatted for direct import into SFMC, HubSpot, or Braze — in minutes instead of days?

The Concept

The bottleneck in email A/B testing is never the platform — it's the setup. Writing alternative subject lines, reworking body copy, adjusting CTAs, briefing design, getting approvals, configuring splits. This experiment builds a generation engine: give it a base email, a goal, and brand voice guidelines, and it produces a structured set of variants — each with a plain-English hypothesis label explaining exactly what it's testing and why. Output is formatted natively for the marketer's platform, ready to deploy.

Email variant generation engine

The hypothesis

Can an AI generate a full set of hypothesis-labelled email test variants — formatted for direct import into SFMC, HubSpot, or Braze — in minutes instead of days?

The concept

The bottleneck in email A/B testing is never the platform — it’s the setup. Writing alternative subject lines, reworking body copy, adjusting CTAs, briefing design, getting approvals, configuring splits. This experiment builds a generation engine: give it a base email, a goal, and brand voice guidelines, and it produces a structured set of variants — each with a plain-English hypothesis label explaining exactly what it’s testing and why. Output is formatted natively for the marketer’s platform, ready to deploy.

How it works

Input base email and goal — content, subject line, target outcome (e.g. click-through to pricing)
Ingest brand voice — tone guidelines, style rules, vocabulary constraints
Generate variant matrix — subject line tests, body structure tests, CTA tests, layout tests
Label each hypothesis — plain-English: “Variant C tests urgency-driven subject vs. curiosity-driven control”
Format for platform — SFMC, HubSpot, Braze, Klaviyo-ready output

From one email and a goal to a dozen hypothesis-driven variants in minutes, not days.

What it explores

Do hypothesis-labelled variants lead to more rigorous testing than ad-hoc A/B splits?
Can AI-generated variants maintain authentic brand voice across 12+ variations?
Which variant dimensions (subject, body, CTA, layout) produce the largest conversion deltas?
Does persona-aware variant generation outperform generic variants?
How well does platform-native formatting reduce marketer setup time?

What we found

Variant generation time dropped from 2–3 days to under 15 minutes
Hypothesis labels made test results more actionable — marketers knew exactly what they learned from each variant
Brand voice consistency scored 8.2/10 in blind reviews
Layout/structure variants (single CTA vs. multiple, text-heavy vs. image-led) outperformed copy-only variants by 2.3x on average
Persona-aware variants outperformed generic variants by 18–25% on click-through

Learnings

Hypothesis labels turn tests from “which won” into “what we learned” — this pattern applies to any test surface, not just email.
Brand voice enforcement needs explicit style memory, not just a system prompt — tone drifts across 12+ variants without it.
Subject line variation produces diminishing returns quickly — structural changes to email body are where real gains live.
Platform-native formatting eliminates the “last mile” friction that often delays deployment by days.

Where it goes next

The hypothesis-labelling pattern is worth applying beyond email. Any test — landing page, ad creative, lifecycle flow — benefits from a plain-English label that explains what you’re testing and why. We’re exploring this as a standard across Flywheel’s testing surfaces.