The AI and the Engineer Both Picked the Right Database. For a Problem You Don’t Have.
Amazon’s 2007 Dynamo paper named six use cases. Most teams have none of them. Nobody asked.
Three databases. Same data, each modeled for its own engine. A generated dataset large enough to stress the system. Load tested all three.
PostgreSQL was three times faster.
The app had been live for two months. I was its only user.
Six weeks earlier, I had built this on DynamoDB because the AI told me to. The app was built to solve a problem AI tools create but don’t address: state. When you work with AI agents across multiple sessions, the context dies with the conversation. What I was building, call it a wiki brain, was a shared knowledge layer where agents could read each other’s work and build on it rather than starting from zero every time. The concept was clean. The execution started clean too.
I built it over a month. The AI coding assistant did most of the typing. When I asked what database to use, the answer came back fast: DynamoDB. The reasoning was coherent: no schema migrations, iterate quickly, the data model can evolve as the product does. That matched how I was working, so I didn’t push back. The AI was describing something real about DynamoDB. It just wasn’t describing the part that would matter.
As features accumulated, each one needed a new way to look at the data. New queries meant new indexes. DynamoDB’s answer to a new query pattern is a Global Secondary Index, a parallel structure you define to make a specific lookup possible. The first few were fine. By the time I was defining GSI6, I should have heard what I was saying. I was creating a sixth separate index because the data model couldn’t answer a sixth kind of question without one. I was building a relational query layer on top of a non-relational database, one index at a time, because the AI had told me schema flexibility was the property I needed. Schema flexibility and query flexibility are different properties. I had the first. I kept paying for the second.
The pages got slower as the data grew. I did what engineers do: found the N+1 queries, fixed them, ran the tests again. Still slow. An N+1 fix yields to optimization. An architecture mismatch doesn’t. When the problem survived the fix, I stopped debugging and ran the bake-off. Each database given its native data model, its best chance. DynamoDB with proper partition keys and optimized GSIs. PostgreSQL with a relational schema and indexed queries. The result didn’t change.
The tool wasn’t wrong. DynamoDB does exactly what was described. The failure was the absence of one question before the architecture was committed: whether the problem I was building toward matched the problem the tool was designed to solve. Nobody in that workflow asked it. The AI was not designed to. I didn’t think to. The question had no home.
In March 2023, a senior software development engineer named Marcin Kolny published a post-mortem on the Prime Video engineering blog. His team had built a video quality analysis system to monitor thousands of concurrent streams for defects: frame corruption, video freeze, sync failures. They built it the way eight years of engineering conferences had implicitly endorsed: distributed microservices, AWS Lambda for processing, Step Functions for orchestration, S3 as intermediate storage between components.
The system hit a hard scaling limit at 5 percent of expected load.
Running it at full scale cost, in the team’s own words, too much to accept. Kolny’s team packed everything into a single process. Data passed in memory. S3 dropped out. Step Functions dropped out. One application, on EC2 and ECS. Infrastructure cost fell by 90 percent.
Kolny’s published conclusion: “We realized that distributed approach wasn’t bringing a lot of benefits in our specific use case.”
The finding traveled because of where it came from. AWS built the tools. AWS spent eight years telling teams to use them. An Amazon engineering team publishing “packing everything into a single process cut our costs by 90 percent” is the vendor’s own engineers demonstrating that the vendor’s canonical pattern can fail — even for the vendor. That is not a narrow finding about Step Functions pricing. It is the strongest possible evidence that the pattern had been traveling without its constraints.
One story: a solo developer, one user, two months in production. The other: one of the largest consumer streaming platforms on earth, a dedicated engineering team, thousands of concurrent streams at a scale most applications never reach.
The scale couldn’t be further apart. The structural failure is identical.
Call it survivorship architecture: the pattern of adopting a tool based on the reputation it earned at its origin, without evaluating whether the problem at the origin matches the problem in the room. The conference circuit carried these tools without their constraints. The AI training data learned from the same filtered signal. The constraint evaluation, the question of whether the match is real, was absent in both cases. Scale didn’t protect Prime Video. Seniority didn’t protect them. Budget didn’t protect them. The variable that produces survivorship architecture is not team size. It’s the removal of one question from the process.
The 2007 Dynamo paper was presented at the ACM Symposium on Operating Systems Principles in October of that year. Amazon researchers led by Giuseppe DeCandia and Werner Vogels were precise about what they had built. The paper named the Amazon services Dynamo was designed to support: best seller lists, shopping carts, customer preferences, session management, sales rank, product catalog. Six use cases. All high-volume, key-value access with patterns that would not change. Vogels later put a number on it: 70 percent of Amazon’s database operations were single-record lookups by primary key.
The paper was not a manifesto. It described a solution to Amazon’s specific data-access problem, shared in the spirit of academic openness. Five years later, AWS packaged the same technology as DynamoDB, a managed service available to any team with a credit card.
The microservices story begins earlier and travels through different hands. Adrian Cockcroft was Netflix’s cloud architect during the company’s migration to AWS between 2008 and 2013. Netflix built its architecture around a concrete organizational pressure: dozens of teams shipping to the same codebase, deployment bottlenecks that were real and measurable. Decomposing into independent services solved that specific problem. Cockcroft began describing the Netflix architecture at QCon San Francisco in 2011 and QCon London in 2012. By his own account: “I basically went around telling everyone else that that’s what Netflix had done.”
In March 2014, Martin Fowler and James Lewis published the post that gave the pattern its name. Netflix had shown the architecture worked. The Fowler-Lewis post gave it vocabulary precise enough to propose in a meeting.
Fourteen months later, on June 3, 2015, Fowler published MonolithFirst on the same website. His finding, from watching teams implement the pattern he had just named: “Almost all the successful microservice stories have started with a monolith that got too big and was broken up. Almost all the cases where I’ve heard of a system that was built as a microservice system from scratch, it has ended up in serious trouble.”
The author of the term. One year after coining it. Same website. Saying: don’t start here.
Go back further. L. Peter Deutsch at Sun Microsystems compiled the Fallacies of Distributed Computing in 1994. The first fallacy: the network is reliable. This is not a minor caveat. Every microservices architecture requires treating every function call as a network call, accepting that the call can be slow, fail partially, arrive out of order, time out. None of that complexity exists when code runs in a single process. Deutsch wasn’t arguing against distributed systems. He was stating that distributed systems impose a tax that must be paid for by a proportionate benefit. A team that adopts microservices without a specific, measurable problem requiring independent deployment pays the tax without collecting the benefit. That was true in 1994. The 30-year gap between when Deutsch wrote it and when teams keep rediscovering it is not a gap in the publication. It is a gap in the process for deciding when the pattern applies.
By 2015, a team adopting microservices from scratch had three published warnings reachable in an afternoon: Deutsch’s fallacies from 1994, Fowler’s MonolithFirst from that June, and Netflix’s own history — which, examined closely, was the story of a company with a specific organizational bottleneck that most engineering teams did not share. None of it reliably reached the room where the architecture got decided.
Engineers arrive at architecture decisions with the answer already in mind.
This is how cognition works under time pressure, and the engineering profession built a delivery culture that institutionalized it. The MVP cadence — ship fast, observe, iterate — is correct for product features, where feedback is fast and a wrong call gets corrected in the next sprint. Applied to infrastructure, the same cadence produces a specific failure: the decision gets made at speed, by an engineer more comfortable with one tool than another, against a deadline that treats architecture as equivalent to a feature ticket.
The legibility problem compounds this. When a VP of Engineering pitches a migration to leadership, “we’re moving to DynamoDB” gets nodded at. “We’re moving to PostgreSQL” requires justification — it sounds like the thing you already have. The engineer recommending the complex option is easy to approve. The engineer recommending the boring option has to explain why boring is right. The incentive to choose complexity is not only comfort level. It is legibility to the people signing off on the work.
The RFC process, a written problem statement with alternatives considered, tradeoffs explicit, reviewed by someone with authority over the decision, is a forcing function against exactly this failure mode. It requires the engineer to write down what the proposed solution was built for and whether the current situation matches. An engineer who has to articulate why DynamoDB’s six documented use cases apply to her product’s access patterns will often find, in the act of writing, that they don’t. The delivery culture discarded the RFC in the name of velocity. What it discarded was the only moment in the process where the information exists, the cost is low, and the outcome is still changeable.
AI has entered that gap and made it structural.
When I asked a coding assistant what database to use, it returned the popular answer. DynamoDB: no schema migrations, iterate fast, the data model can evolve with the product. Accurate about what DynamoDB does. The AI pattern-matched to the most common answer in its training data for “serverless, flexible, iterate fast”: the corpus of engineering blog posts, conference talks, and Stack Overflow answers before its cutoff. That corpus has exactly the survivorship bias this piece has been describing. The posts that got written were about DynamoDB migrations and microservices architectures. The quiet reversals, the weekend migrations, the bake-offs that sent engineers back to Postgres, generated far less text. The model learned from what got published.
AI is the conference circuit, encoded in weights and available on demand.
What it doesn’t model is trajectory. By the time I was defining GSI6, I had a query-pattern problem the AI’s original recommendation hadn’t anticipated — the AI had no model of where my product was heading, only of what DynamoDB does well at the start.
Twelve months after the architecture ships, the performance is bad. Someone proposes a fix, gets a team approved, runs the migration, presents a before-and-after graph at the all-hands. Cost down. Latency down. The room is impressed.
That engineer gets a strong performance review.
The engineer who made the original call got a strong performance review a year earlier, for shipping on time.
Two career events from one architectural failure. The system has no mechanism to connect them. Performance reviews evaluate the period they cover. The original decision looked correct for twelve months and the review reflected it. The migration project gets evaluated on its own terms, often by a different manager. Nobody in either review has any incentive to draw a line between the two events, and the organizational memory required to draw it rarely exists.
You cannot name a time when an engineer’s architecture decision was held against them in a performance review. Not because engineering organizations are forgiving — because the time between decision and consequence is almost always longer than the evaluation window. The performance management system was never designed to bridge that gap.
The post-mortem is blameless by design, and that is correct. Punishing the original engineer doesn’t change the system. But blamelessness means organizational learning stops at “we chose the wrong tool” without reaching “we had no process for asking whether the tool matched our problem.” The clock resets. The same conditions produce the next decision.
Cockcroft described the conference version of this loop [in his own words](https://blog.container-solutions.com/adrian-cockcroft-on-serverless-continuous-resilience): “If you want to speak at a conference, you think, what topic is likely to get picked up? Everyone’s talking about microservices, so I’ll do a microservices talk. It becomes self-referential, self-feeding.” The team that ran the rescue project didn’t give a talk. The conference heard from the teams whose architectures were still in the phase where they look correct. AI makes the loop run faster. The wrong architecture reaches production faster. The rescue project gets approved faster. Everyone gets their performance review faster.
Before the architecture decision, someone with authority over the spend should require two written paragraphs from the team proposing it.
The approver who didn’t require the question is the failure point, not the engineer who proposed the answer.
First: what problem was this tool designed to solve, in the words of the people who built it? Second: what evidence does this team have that their problem is the same one? The two paragraphs reattach the constraint to the recommendation — the thing that gets stripped when a pattern travels through a conference talk, an AI training corpus, or a consulting engagement. The Dynamo paper answers the first question in its abstract. Six use cases. The second paragraph, whether a product with evolving query patterns matches those six use cases, answers itself in the writing. By the time you’re describing GSI6, the answer is already clear. An AI tool cannot produce this evaluation. It doesn’t know the trajectory of the product or where the queries are heading six features from now.
The principle is a boring default: a relational database, until the pressure to leave is specific and measurable. PostgreSQL is the most common expression of that default; the choice of which relational database is a secondary question. The primary question is whether the situation genuinely requires leaving the default at all. DynamoDB is the right answer for audit logs: write-once, high-volume, access by entity ID and timestamp, access patterns fixed forever. Session tokens. Rate-limiting counters. Device telemetry at IoT scale. Outside those cases, the burden of proof belongs to the complex option.
The microservices equivalent: does your team have Netflix’s deployment problem — and have you built long enough to know where your domain boundaries actually are? Those are two separate questions and both matter. Netflix’s services are stable because they map to stable domain concepts the company had years to understand before decomposing. Recommendations. Streaming. User profiles. Boundaries that are durable because the domains are durable. Most teams that split into services do it before they understand their domain well enough to find the stable cuts. They decompose based on what seems natural, or based on team structure, and end up with services that fight the domain rather than map to it — everything connected across the network, nothing independently deployable in practice. That is not a deployment model. It is a maintainability problem wearing an infrastructure costume. Fowler’s MonolithFirst is correct for this reason precisely: build first, understand the domain, let the stable boundaries show themselves under pressure. Extract then. The constraint that gets stripped when microservices travel without their origin story is not just the organizational pressure — it is the domain analysis that produced the boundaries worth keeping.
That is survivorship architecture. The delivery culture removes the moment where the constraints would be checked. The system rewards the adoption and the rescue, with nothing in between that connects them.
The approver who requires the two paragraphs before signing off is the design change. It is a small intervention for a large loop.
Next week’s Design Brief, “The Two Paragraphs That Would Have Saved Your Architecture,” gives the approver the literal template: the questions to require, the red flags in the answers, the use cases mapped.
The loop is still running. It was designed to.



