Preamble
A couple of summers ago I had what felt like a small, contained idea: ask an LLM to write bedtime stories for kids, give it some structure, glue images and audio on top of the text, and call it a day. The plan, written down on a Sunday afternoon, fit in a paragraph. The system as it stands today does not.
The original prototype was called Fable. It was a script with two HTTP calls in it and a folder full of .txt files. The current incarnation is Fabble (fabble.me), a tier-gated serverless application on AWS that generates and validates stories, makes illustrations and audio on demand, broadcasts a fresh story to subscribed users every week, deals with payments via BuyMeACoffee, and has just enough operational scaffolding to wake me up when the weekly cron decides to misbehave at 17:00 UTC on a Wednesday. It also has the property that, occasionally, it writes a story I’m proud to read aloud.
This post is about the gap between those two states, and about a thing I keep underestimating: writing a story is not like writing code. Code either compiles or it doesn’t. Tests pass or they fail. You can sleep on a green CI run. A story, on the other hand, is seen. It’s read out loud, by a parent, to a child. It cannot be merely “correct” — it has to land. Form, rhythm, tone, the name of the protagonist, whether the moral feels patronizing or earned — none of these things show up in a unit test. None of them are guarded by a type checker. They’re just there, in plain text, in front of the only judge that matters, and that judge is six years old and not interested in your nine-nines of uptime.
The trick is that the surrounding system is code. So this post is, in a way, two stories braided together: the boring, deterministic story of the AWS-shaped pipeline, and the much less boring story of what that pipeline can and cannot guarantee about the thing it produces.
I’ll keep the architecture intentionally abstract where it touches anything I’d rather not have indexed. The interesting parts are not the secrets anyway. The interesting parts are the constraints.
The shape of the problem
Let me start with what makes this domain weird, because every design decision below is downstream of it.
A “good” generated story has at least the following properties:
- Coherent plot. Beginning, middle, end. Setup, conflict, resolution. Not just a vibes-driven word cloud about friendship.
- A protagonist with a name and a consistent identity. Across the prose, across the illustrations, across the audio. If the character is “Tommaso, six years old, curious” in paragraph one, he can’t become “Lucia, who is afraid of the dark” by paragraph three. And he certainly can’t have brown eyes on page one and a tail on page four.
- Age-appropriate tone. A six-year-old does not need an existential crisis. A three-year-old does not need a five-syllable adjective in every sentence. A nine-year-old, conversely, will notice if you talk down to them and resent you for it.
- No profanity, no violence, no themes you’d be embarrassed to read aloud. This is the floor, not the ceiling.
- Variety over time. If a kid asks for “a pirate story” three Saturdays in a row, they should not get three minor remixes of the same plot.
- Cheap enough that I can keep doing this. This is, painfully, not a soft constraint.
Each of these has a tractable engineering answer in isolation. The fun is that they pull against each other. Coherence wants long prompts and big models. Variety wants randomness. Safety wants filters that often kill the personality of the prose. Cost wants short prompts, small models, and reuse. The cheapest story I can produce is the one I already produced last week and never showed anyone — and as it turns out, that’s a real lever I ended up pulling.
The shape of the system
The system is serverless on AWS, single-region, single account, with a handful of third-party services bolted in where AWS doesn’t cover the ground. The high-level picture is unremarkable but worth drawing once:

A few notes on the deliberate parts of that picture:
- One DynamoDB table. Single-table with a small zoo of
PK/SKshapes — profiles, stories, requests, configuration, broadcasts, suppressions, ephemeral locks. Five GSIs, each one earning its keep for a specific access pattern (story dedup, listing recent broadcasts, narrow projection reads). This is the kind of design you only really appreciate after you’ve watched a relational schema slowly collapse under multi-tenant access patterns; here it just refuses to grow extra moving parts. - One real “synchronous” surface. The API Lambda. Everything else is fire-and-forget, async-invoked, or driven by a schedule, because story generation is slow (seconds), image generation is slower (tens of seconds), and audio is in the middle. If you make any of that synchronous from a browser request you’ll spend the rest of your life debugging timeouts and double-clicks.
- Cognito does the unsexy work. User pool, JWTs, pre-token-generation trigger to mirror entitlement into claims, post-confirmation trigger to provision the profile row, custom email sender for the verification flow. None of this is interesting until you remove it; then it becomes the only thing that’s interesting.
- EventBridge owns the clock. Two daily cron rules (premium expiry reminder, premium downgrade) and one weekly rule (the broadcast). Enabled in prod, disabled in dev. The state difference is the whole reason dev hasn’t quietly sent a 1:00 AM email to everyone on the test list at some point.
If you squint, this is a fairly textbook AWS serverless application. The interesting bits are not in the boxes; they’re in the arrows.
Money, or: the thing that actually shapes the design
Bedrock calls cost money. fal.ai image calls cost more money. Polly is the cheap one. The naive system — every user request triggers one text call, three image calls, one audio call, every time — has a per-story cost dominated by images, and a quarterly bill that would have ended this project before it started.
So the system spends most of its cleverness on not generating things. Three mechanisms, roughly in order of how often they kick in:
1. Tier-gated entitlement. Free, storyteller, storymaker. The free tier gets text. Audio and images appear only at higher tiers. Daily, weekly, monthly, and library-size quotas live in the profile row and are checked by the API Lambda before anything generative happens. Cognito mirrors the tier into the access token so the SPA can render the right UI without an extra round trip, but the API never trusts that claim for authorization — the actual quota check happens server-side against the profile in DynamoDB. The frontend says what’s visible; the backend says what’s allowed. These are different things, and one of them is the security boundary.
2. Fingerprint-based story reuse. Each generation request resolves to a small set of parameters: theme, setting, archetype, age band, tone, language. From those I compute a deterministic fingerprint. Before generating anything new, the API checks whether there’s an unseen, eligible story in the pool with the same fingerprint that this user hasn’t received recently. If yes, the user is attached to that existing story as a new REQUEST row — pointer-style — and the story-generate Lambda is never invoked. The cost of that request collapses from “a Bedrock call plus possibly images plus possibly audio” to “two DynamoDB writes”. The pool size, the re-propose window in days, and the maximum custom generations per month live in a CONFIG#globals row so I can tune them without redeploying.
3. Weekly broadcasts. Once a week, an EventBridge rule fires a Lambda that picks a fresh theme via Bedrock, materializes a single canonical story for that ISO week, generates the assets once, writes a BROADCAST pointer to it, and emails confirmed users. Every subscribed user reads the same physical row. One Bedrock call, one set of images, one audio file — distributed to N users. If N grows, the per-user cost of the weekly story tends to zero.
The combined effect is that on a quiet week, with a typical mix of users, the marginal cost of one user logging in and “reading a new story” is no generation at all. They read a story that was already generated for the same parameters they picked — same theme, same character, same age band — and that they themselves haven’t yet seen. That’s the economic premise of the product, not a clever trick: a story that fits the request is a story that fits the request, whether it was created five minutes ago for this user or five days ago for another. The harder engineering problem, sitting on top of this, is making sure that the story actually does fit the parameters the user picked, every time, and that’s where the prompt engineering and the configuration data earn their keep.
Does the cache actually fire?
It’s worth doing the arithmetic on this, because the whole product economics ride on it.
The fingerprint space is a combinatorial product of the configuration vocabulary:
$$ F \;=\; M \cdot A \cdot T \cdot L $$where \(M\) is the number of valid theme-character mappings (each theme is constrained to a subset of compatible characters and settings), \(A\) is the number of age bands, \(T\) the number of tone variants, and \(L\) the number of supported languages. With the current configuration that’s roughly \(M \approx 50\), \(A \approx 3\), \(T \approx 2\), \(L \approx 2\), giving \(F \approx 600\). Not enormous. Not minuscule. The exact figure isn’t the point — what matters is that \(F\) is fixed by configuration, not by user input. Users don’t get to invent a new theme on a Tuesday.
Let \(R\) be the total number of generation requests landing in the re-propose window \(W\) (currently 180 days). With \(U\) active users averaging \(r\) requests per month:
$$ R \;\approx\; U \cdot r \cdot \frac{W}{30} $$Assuming requests distribute roughly uniformly over \(F\) (they don’t, perfectly — some themes are more popular — but close enough for back-of-envelope), the number of distinct stories \(n\) in the active pool reaches a steady state where stories created per unit time equals stories aging out. The hit rate at pool size \(n\), for a fresh request, is approximately:
$$ H(n) \;\approx\; 1 - \left(1 - \frac{1}{F}\right)^{n} \;\approx\; 1 - e^{-n/F} $$That’s the probability that at least one story matching the requested fingerprint already exists in the pool1. The user-history filter (a user shouldn’t re-see the same story too soon) erodes this slightly, but for typical \(n \gg\) stories-seen-per-user the erosion is small.
At steady state, the pool size satisfies \(n \approx R \cdot e^{-n/F}\) (every request either creates a story or attaches). Let \(\rho = R/F\) — the load factor, requests per fingerprint slot per window. Then the steady-state hit rate solves implicitly:
$$ H \;=\; 1 - \frac{n}{R}, \qquad n = R \cdot e^{-n/F} $$The intuition without the algebra: when \(\rho\) is small (few requests per fingerprint), the cache is mostly cold and \(H \approx \rho/2\). When \(\rho\) is large, \(H \to 1\) and the system asymptotically generates one story per fingerprint per window. With illustrative numbers — say 1000 active users, 4 requests/month, \(W = 180\) days, \(F = 600\) — we get \(R \approx 24{,}000\), \(\rho \approx 40\), and the steady-state pool stabilizes around \(n \approx 1600\) stories, yielding \(H \approx 93\%\).
In plain English: once the system has warmed up, roughly nine out of ten requests cost two DynamoDB writes instead of a Bedrock call plus image generation. That ratio is the entire reason this project is financially possible. It’s also why I treat \(F\) as a real design parameter and not a config detail — shrinking \(F\) (fewer themes, more aggressive grouping) raises \(H\) but at the cost of variety; growing \(F\) does the opposite. The cost-quality trade-off is, very literally, an arithmetic curve.
The weekly broadcast then sits on top of this and pushes the asymptote further: one generated story, \(U\) readers, marginal-cost-per-user tending to zero.
The character consistency problem
This is the part where the gap between “writing software” and “writing stories” yawns widest.
If the protagonist of paragraph one is Tommaso, then Tommaso appears in the title image, the mid-story image, the closing image. He has the same hair, the same eye colour, the same outfit, possibly the same companion animal. Diffusion models, asked the same prompt twice, will happily produce two entirely different children. Asked the same prompt with the same seed, they’ll often produce two slightly different children, which is somehow worse, because the eye is exquisitely sensitive to small face changes.
I do not have a heroic solution to this. I have a set of mitigations that, combined, get the output to “passable for a five-year-old who isn’t doing forensic analysis”:
- A fixed catalogue of characters and settings in a
CONFIG#characters/CONFIG#themes/CONFIG#themeCharacterstriple. The story generator doesn’t invent a protagonist from scratch — it picks from a curated set with stable visual descriptors. The themes constrain the settings; the settings constrain the compatible characters. The combinatorial space is large enough that stories don’t feel repetitive, and small enough that I can hand-tune the visual prompts per character. - A shared visual prompt fragment for each character, reused verbatim across all three image generations of a given story. The prompt for the mid-story scene is “the same character” plus the scene context, not a fresh description.
- Mechanical post-checks on the prose: did the protagonist’s name change between paragraphs? Did the age band drift? Did a new named character appear in the second half who wasn’t introduced in the first? These don’t require an LLM. A small validator on the structured sidecar the model emits is enough.
The thing I keep relearning: this is not solvable by throwing a bigger model at it. It’s solvable by narrowing the space. Generative models are wonderful at filling space and terrible at staying inside lines. So you draw the lines for them.
The judge
Mechanical checks catch the obvious. They do not catch tone.
For tone I lean on LLM-as-judge, in the dullest possible way. After the story-generate Lambda gets prose back from Bedrock, it sends the result back through the model with a different prompt: rate this story against a short list of criteria — age-appropriateness, length, protagonist presence, absence of named third parties not introduced in the setup, originality relative to a small sampled set of recent stories, absence of darkness or violence the parameters didn’t ask for. The judge returns a structured verdict. If the verdict fails, the story is marked failed in DynamoDB, the request stays attached to a pending state, and the user gets a polite retry rather than a bad story.
This is, frankly, the single most cost-multiplying decision in the system — every generation pays for at least one extra inference call. It is also the difference between a system I’d let touch a real child’s evening and one I wouldn’t.
A couple of subtler things about this layer:
- The judge is a guard, not a fixer. It says yes or no, with reasons. It does not rewrite. Rewriting introduces its own kind of drift, and a rewrite-from-bad-input is statistically worse than a regenerate-with-better-seed. So a failed judgment triggers a new attempt with adjusted parameters, not a patch.
- The judge prompt is conservative on purpose. False negatives (good stories rejected) cost me money and annoy users. False positives (bad stories shipped) cost me trust. Trust is harder to recover, so the prompt errs on the side of “no”.
- The criteria the judge enforces are also the criteria the prompt to the generator is built from. The same vocabulary. When they diverge, you get a system that confidently generates content the same model will then reject, which is funny exactly once.
Audio is the expensive-and-mediocre part of the story
Audio is the part of this system I’m least happy with, and the reason is structural rather than something I can fix with a better prompt.
The audience for these stories speaks Italian. Italian, today, is not a supported language for Polly Generative — the high-quality, expressive tier of Amazon Polly. Italian is only available on the older neural voices, which are competent but flat: correct prosody on a sentence, no real performance on a story. A parent reading aloud beats it on every dimension that matters; the only thing it has going for it is availability at 8:45 PM when you forgot a story. And, by the standards of TTS, it isn’t even particularly cheap — generating a full story’s worth of audio adds non-trivial cost to every request that asks for it, which is the main reason audio sits behind a higher tier in the first place.
I’ve spent more time than I’d like to admit looking at the next step. There’s a third-party service I’ve already identified that produces genuinely good Italian narration — with intonation control, a usable voice catalogue, and room for soundscape and lightweight effects (“the door creaks”, “the seagull cries”). It is also genuinely expensive: the per-minute cost is high enough that I can’t just slot it in for every request. The plan, when I get to it, is the same trick the rest of the system uses — gate the richer audio behind the higher tier, generate it once per weekly broadcast, and let the marginal-cost-collapses-to-zero amortization carry the rest, so a story narrated by the good engine reaches many readers for the price of one generation. Until that’s wired in and budgeted, audio stays on neural Polly: present, helpful, occasionally read out loud, never the thing you’d boast about.
The week the weekly Lambda broke
Every system reveals its design under stress. Here’s the most instructive incident from the last few months, sanitized:
The weekly broadcast Lambda picks a fresh theme, builds parameters, writes a pending STORY row, and synchronously invokes story-generate. For a stretch, the user-generated story path and the weekly broadcast path had drifted: the weekly Lambda was building a parameters payload missing a single field that the user path always included — the protagonist’s name. The story generator didn’t refuse — it picked a name and produced prose. The judge, however, was strict: it checked that the protagonist named in the parameters appeared in the prose. With no name in the parameters, the contract was violated, and the judge rejected. The weekly Lambda failed. EventBridge retried. The judge rejected again. The retries exhausted. The DLQ alarm fired. No story shipped that Wednesday.
The fix was a one-liner: make the weekly path conform to the same parameter contract as the user path. The lesson was much bigger: a single contract, exercised by every code path that produces stories, is worth more than a fast path that skips half of it. Convenience paths in generative systems are how you ship a bad Wednesday.
Security, briefly, without giving anyone homework
A few things I’d rather mention generally than specifically:
- Every webhook entry point validates a signed payload against a secret pulled at cold start. Replays and forged events stop at the door.
- Cognito access tokens carry a tier claim, but the API re-resolves the tier from the profile row on every request. The claim is for UX. The DB is for truth.
- S3 assets are private. Users get signed URLs with short TTLs, scoped to the asset their tier entitles them to. The free tier doesn’t get audio URLs even if it knows the storyId. The library reads are tier-gated on the way out, not just on the way in.
- Bounces and complaints from SES feed an
EMAIL_SUPPRESStable; the broadcast Lambda consults it before sending, so the system suppresses itself faster than a human would. - The judge runs on every generated story, including the weekly broadcast, including stories that came back from the reuse pool that hadn’t yet been validated under current criteria. There is no fast path that skips it.
That’s the texture. The specifics live in the code and stay there.
What this taught me that backend code never did
When I write a backend function and forget about it, it sits there in production, doing its thing, indistinguishable to the user from the surrounding noise. Nobody reads it. Nobody opens it on a Tuesday evening to feel something. The function is a green checkmark.
A story is the opposite. The output is the entire product. The system’s job is to lower the variance enough that the average output is good, and to refuse to ship the bad tail. Every architectural choice in Fabble is, at root, about that variance: the catalogue of characters narrows it; the judge prunes it; the reuse pool ensures the good outputs get seen more than once; the weekly broadcast lets me invest more compute into a single story because the cost is amortized across everyone. None of these choices come from “is this idiomatic AWS?” or “is this clean code?”. They come from “is this story good?”, asked over and over by a system I built precisely because I can’t ask it myself a thousand times a week.
This is the part of the work I didn’t expect. I expected to enjoy the wiring — and I did, the wiring is fun. I didn’t expect that the hard problem would turn out to be staying honest about quality across a pipeline where every stage adds plausibility and removes verifiability. By the time a story has prose, images, and audio, it looks polished. Looking polished is precisely the failure mode you have to guard against, because a polished bad story is much harder to catch than an obviously bad one.
So Fabble, in its current shape, is a serverless application with a single mission: produce stories good enough to be read aloud, cheap enough that I can keep producing them, and consistent enough that the system can be trusted to do it on a Wednesday whether I’m watching or not. It mostly works. When it doesn’t, I learn something about narrative structure, or about LLM-as-judge prompts, or about how GSI3PK should always be set on broadcast rows even when nothing about the application logic seems to require it.
The next round of work is on the audio side, and on a quality-feedback loop where the users themselves — not just the judge — rate the stories they’ve received, and those ratings feed back into the broadcast Lambda’s theme selection. That, however, is for a future post, possibly written by a system that has by then earned the right to draft its own.
For now: it writes stories. Sometimes good ones. And it doesn’t bankrupt me. By the standards of side projects with LLMs in them, that’s already further than I dared plan for.
This is the textbook occupancy (a.k.a. balls-in-bins) result: drop \(n\) balls independently and uniformly into \(F\) bins, and the probability that a particular bin contains at least one ball is \(1 - (1 - 1/F)^n\), which converges to \(1 - e^{-n/F}\) for large \(F\). The same shape underpins the birthday problem and the false-positive analysis of Bloom filters. I’m using it here as a back-of-envelope approximation — real request distributions aren’t uniform over fingerprints (some themes are far more popular than others), so the true hit rate at low \(\rho\) is slightly lower than this formula suggests, while at high \(\rho\) it’s slightly higher. Close enough to reason about cost; not close enough to trust without measurement. ↩︎