BakeMark: Benchmarking AI-Generated Landing Pages
BakeMark: Benchmarking AI-Generated Landing Pages
LLMs can generate full web pages from a single prompt. But how do the results actually compare across models and design styles? I built BakeMark to find out.
The Idea
Take a simple theme — an artisan bakery — and generate landing pages across 47 prompts, each targeting a different design direction: minimalist, brutalist, glassmorphism, neomorphism, retro 90s, dark luxury, gradient aurora, corporate clean, and more.
Then run each prompt through four models:
- Claude Haiku 4.5
- GPT-5.4
- Claude Sonnet 4.6
- Claude Opus 4.6
That gives 188 generated pages you can compare side by side.
How It Works
The main view is a table where each row is a prompt. Click a row to reveal the versions generated by each model. You can filter by model to focus on one at a time.
There's also a second exploration path: the Artists Collection, which adds an artistic dimension. It explores 1,724 prompts inspired by artistic movements and visual directions, across multiple models. A way to push the boundaries of what these models produce when given more expressive constraints.
What I Learned
- Using Copilot for this was tricky. When you ask a model to generate a page, it sometimes rewrites other models' pages in the process. Keeping each output isolated required careful orchestration.
- Claude Sonnet 4.6 stands out from the pack. Its outputs feel notably different — more opinionated, with stronger design choices. An open question: is that opinionation actually a shortcut to faster, better results?
- Generating prompts from Wikidata was surprisingly fun. I pulled artistic genres and movements from Wikidata to seed the Artists Collection. Some genre names are wonderfully weird — which made for unexpected and creative page variations.
Try It
The full benchmark is live at bakemark.famat.me. Browse the prompts, compare the models, and explore the artists collection.