BakeMark: Benchmarking AI-Generated Landing Pages

AIBenchmarkWeb DesignLLM

BakeMark: Benchmarking AI-Generated Landing Pages

LLMs can generate full web pages from a single prompt. But how do the results actually compare across models and design styles? I built BakeMark to find out.

The Idea

Take a simple theme — an artisan bakery — and generate landing pages across 47 prompts, each targeting a different design direction: minimalist, brutalist, glassmorphism, neomorphism, retro 90s, dark luxury, gradient aurora, corporate clean, and more.

Then run each prompt through four models:

Claude Haiku 4.5
GPT-5.4
Claude Sonnet 4.6
Claude Opus 4.6

That gives 188 generated pages you can compare side by side.

How It Works

The main view is a table where each row is a prompt. Click a row to reveal the versions generated by each model. You can filter by model to focus on one at a time.

There's also a second exploration path: the Artists Collection, which adds an artistic dimension. It explores 1,724 prompts inspired by artistic movements and visual directions, across multiple models. A way to push the boundaries of what these models produce when given more expressive constraints.

What I Learned

Using Copilot for this was tricky. When you ask a model to generate a page, it sometimes rewrites other models' pages in the process. Keeping each output isolated required careful orchestration.
Claude Sonnet 4.6 stands out from the pack. Its outputs feel notably different — more opinionated, with stronger design choices. An open question: is that opinionation actually a shortcut to faster, better results?
Generating prompts from Wikidata was surprisingly fun. I pulled artistic genres and movements from Wikidata to seed the Artists Collection. Some genre names are wonderfully weird — which made for unexpected and creative page variations.

Try It

The full benchmark is live at bakemark.famat.me. Browse the prompts, compare the models, and explore the artists collection.