Preamble

A couple of summers ago I had what felt like a small idea: ask an LLM to write bedtime stories for kids. Glue images and audio on top. Call it a day.

The plan fit in a paragraph. The system as it stands today does not.

The prototype was called Fable — a script with two HTTP calls and a folder of .txt files. The current incarnation is Fabble (fabble.me): a serverless app on AWS that generates stories, validates them, illustrates them, narrates them, broadcasts a fresh one every week, and handles payments. Occasionally it writes a story I’m proud to read aloud.

This post is not a tutorial. It’s a collection of problems I didn’t expect to have — and an invitation to think about whether you would have expected them.

The constraint that changes everything

Here’s the thing nobody warns you about when you start a generative project for kids: writing a story is not like writing code.

Code compiles or it doesn’t. Tests pass or they fail. A story is read out loud, by a parent, to a child. It can’t be merely correct — it has to land.

Now think about what “landing” means, simultaneously:

  • A coherent plot. Beginning, middle, end.
  • A protagonist that stays the same — same name, same face — across prose, illustrations, audio.
  • Age-appropriate tone. A six-year-old doesn’t need existential crises.
  • Nothing you’d be embarrassed to read aloud.
  • Variety over time. Three “pirate story” requests shouldn’t produce three remixes of the same plot.
  • Cheap enough to keep doing.

Each of these has a tractable answer in isolation. The fun starts when you notice they fight each other. Coherence wants long prompts and big models. Variety wants randomness. Safety wants filters that kill personality. Cost wants short prompts and reuse.

Which brings us to the real question.

How do you not go broke?

Bedrock calls cost money. Image generation costs more. The naive system — every request triggers one text call, three image calls, one audio call — has a quarterly bill that would have killed this project before it started.

So here’s a puzzle: how do you serve N users without generating N stories?

Think about it for a second. Each request resolves to a small set of parameters: theme, setting, character, age band, tone, language. The combinatorial space is fixed by configuration — call it \(F \approx 600\) valid fingerprints. Users don’t invent new themes on a Tuesday.

Now. If I’ve already generated a story for fingerprint \(f\), and a new user asks for the same \(f\), and they haven’t seen that story before — do I need to generate again?

No. I attach them to the existing story. The cost drops from “a Bedrock call plus images plus audio” to “two DynamoDB writes.”

The hit rate at pool size \(n\) is approximately:

$$ H(n) \;\approx\; 1 - e^{-n/F} $$

With a warm pool — say 1600 stories across 600 fingerprints — roughly nine out of ten requests cost nothing generative. That ratio is the entire reason this project is financially possible.

Shrinking \(F\) raises the hit rate but kills variety. Growing it does the opposite. The cost-quality trade-off is, quite literally, a tunable knob. Where would you set it?

The weekly broadcast trick

On top of the reuse pool, once a week EventBridge fires a Lambda that generates a single canonical story — full assets — and emails it to all subscribers.

One Bedrock call. One set of images. One audio file. Distributed to N users. Per-user cost tends to zero as N grows.

The pool handles the long tail. The broadcast handles the head. Together they mean that on a quiet week, the marginal cost of a user reading “a new story” is often no generation at all.

A problem you probably didn’t think of

Here’s one that bit me: character consistency across illustrations.

If the protagonist is Tommaso in paragraph one, he should look like Tommaso in all three images. Same hair, same eyes, same outfit. Diffusion models, asked the same prompt twice, will happily produce two entirely different children. Asked the same prompt with the same seed, they’ll produce two slightly different children — which is somehow worse, because the eye is exquisitely sensitive to small face changes.

How would you solve this?

The instinct is “throw a bigger model at it.” That instinct is wrong. The answer is narrowing the space. Generative models are wonderful at filling space and terrible at staying inside lines. So you draw the lines for them — a fixed catalogue of characters with stable visual descriptors, shared prompt fragments reused verbatim across all images of a story, mechanical post-checks on the prose.

The general principle: don’t ask the model to be consistent. Make it impossible for it to be inconsistent.

Who judges a story?

Mechanical checks catch the obvious — name drift, age band violations, missing characters. They don’t catch tone.

So here’s another puzzle: you’ve generated a story. It’s syntactically fine. The protagonist is consistent. The plot has three acts. But it’s… flat. Or slightly dark. Or patronizing. How do you catch that before it reaches a child?

My answer: LLM-as-judge. After generation, the same model evaluates the story against a short list of criteria. It says yes or no. It doesn’t rewrite — rewriting from bad input is statistically worse than regenerating with a better seed.

The judge is conservative on purpose. False negatives (good stories rejected) cost money. False positives (bad stories shipped) cost trust. Trust is harder to recover.

One subtlety worth thinking about: the criteria the judge enforces must be the same criteria the generator prompt is built from. Same vocabulary. When they diverge, you get a system that confidently generates content the same model will then reject. Which is funny exactly once.

The audio problem nobody talks about

The audience speaks Italian. Italian isn’t supported by Amazon Polly’s expressive tier. The older neural voices are competent but flat — correct prosody, no performance. A parent reading aloud beats it on every dimension that matters.

But even if you had a perfect voice, you’d still have a problem: TTS gives you a single mp3. There’s no hook for placing a sound effect at “and the door creaked open.” No music bed underneath. No way to compose.

So here’s the question: how do you build a soundstage on top of a narrator that only gives you back audio?

The key insight — and I’ll leave you to think about how before I say it — is that you don’t need to listen to the audio to know where things happen in it. If you have per-character timing alignment, you can find any phrase on the timeline by string matching alone.

From there: a curated library of ~50 spot SFX, a small LLM call that picks which sounds fit this story and anchors them to literal substrings of the prose, and a final mix step. The result is a narration that performs the story, a bed that sets the room, and a handful of well-placed sounds that punctuate it.

The economics are surprisingly gentle. The SFX library is generated once, offline. The annotator is a cheap model call. The mix is a few hundred milliseconds of compute. And because it composes with the reuse and broadcast tricks, a weekly broadcast still pays for exactly one narration and one mix.

What this taught me

When I write a backend function and forget about it, it sits in production doing its thing. Nobody reads it. Nobody opens it on a Tuesday evening to feel something. The function is a green checkmark.

A story is the opposite. The output is the entire product.

Every architectural choice in Fabble is, at root, about variance: the catalogue narrows it; the judge prunes it; the reuse pool ensures good outputs get seen more than once; the broadcast amortizes compute across everyone.

The hard problem turned out to be staying honest about quality across a pipeline where every stage adds plausibility and removes verifiability. By the time a story has prose, images, and audio, it looks polished. Looking polished is precisely the failure mode you have to guard against.

For now: it writes stories. Sometimes good ones. And it doesn’t bankrupt me. By the standards of side projects with LLMs in them, that’s already further than I dared plan for.