Streaming platforms require immediate scalability and high resiliency
If you’re running a streaming platform, your infrastructure doesn’t get time to recover. There’s no tomorrow in streaming. Users want video immediately, press play, and the content should start. If that doesn’t happen, they’re gone, and many will never return. Loyalty in this space is measured in milliseconds. That’s the business reality.
At ProSiebenSat.1 Media SE, one of Europe’s largest broadcasters, this challenge became painfully clear. Their streaming platform, Joyn, serves millions of users across Germany, Austria, and Switzerland. The demand is constant. Viewership spikes happen without warning, Premier League game, breaking news, last episode of a top series, and systems need to scale now. Scaling isn’t theoretical; it’s existential.
Executives need to understand that resiliency isn’t about surviving rare disasters. It’s about maintaining service during inevitable daily surges. If your infrastructure can’t absorb those shocks in real time, you’ll lose customers and damage your brand.
The expectation is clear: your app launches, a video loads instantly. Whether ten thousand or ten million people hit your system at once, it shouldn’t matter to them, and it can’t break your backend.
Transitioning to a serverless architecture effectively resolved core system failures
Old systems break under pressure. Joyn’s original architecture worked, until it didn’t. Servers were straining. Databases failed during peak demand. It wasn’t due to a fundamental flaw in cloud computing. It was the way the system was designed: single-node databases, no caching, and inconsistent service standards between teams. Basically, things scaled poorly. When traffic spiked, the whole system went down. Users left. Not great.
The team switched to a serverless approach using AWS. That wasn’t a trendy decision. It was practical. Serverless means fewer operational distractions, no patching, no provisioning, no worrying about load balancing at 2 a.m. You focus on writing code. AWS handles the rest. This allowed the engineers to build faster, scale faster, and deploy safer.
Before the shift, deployments took 90 minutes. Post-shift, it took minutes. That improves developer morale and reduces customer-facing downtime. Serverless systems, especially those using managed services like Lambda and Fargate, scale automatically. You don’t need to guess how many servers you need at 9 p.m. on a Friday. The system adapts based on how many users show up.
The business value here is about speed and reliability. If your customers can count on the product working without interruption, they trust it. If your development cycles are short, you respond to market demands faster. Cost savings come later. First, serverless removes the friction. It creates space for innovation because your team stops spending time fighting the infrastructure.
This approach unlocked velocity. Better uptime, faster feature rollouts, and simpler operations. That’s what infrastructure should do, get out of your way.
Inconsistent data handling was a significant user-facing issue
If your system shows content inconsistently, you’ve already lost the user. They don’t care what caused it. All they see is failure. At Joyn, users would see a video listed on one screen, try to play it or check its details, and it simply wouldn’t be available. Same app, same video, different result. That’s a data consistency problem. And for a platform delivering millions of streams, this kind of failure creates real trust issues.
The root cause wasn’t complicated. Multiple services were built by different teams, each subscribing to the same Kafka stream but handling data in their own way, some performing validation, some skipping it, others crashing silently. The result was a fragmented backend, where different services had contradictory views of the same content. Nothing synced reliably. A single issue in this chain might take hours to trace.
For decision-makers, the key takeaway is that if you’re scaling, you cannot allow individual teams to roll their own standards. You need architecture that enforces consistency system-wide. Otherwise, failures multiply. Customers don’t wait for a support ticket to get answered. They move on.
The solution wasn’t adding more monitoring or more developers. It was changing the system so these issues couldn’t happen in the first place. That meant enforcing consistency through repeatable, resilient processes, not through after-the-fact corrections.
Implementing the hub and spoke pattern centralized and standardized message routing
To stabilize communication across services, Joyn restructured the architecture using a Hub and Spoke model. It’s straightforward: Kafka remains the system of record, where final event data lives, and Amazon EventBridge takes over message routing within the system. EventBridge Pipes, sitting in the middle, manage validation and transformation before the data gets distributed to consuming services.
This changed everything. Instead of allowing services to pull from Kafka and handle messages however they liked. EventBridge acts as the central conduit. Service-to-service messaging is now clean, predictable, and scalable. No shortcuts, no system-specific code for routing. The message enters, passes validation, and reaches every destination in a standard format.
This made troubleshooting easier and debugging almost instant, because all communication passes through the same pathways. Misbehaving services can’t send rogue data. And everything, from microservices to bigger internal tools, got faster, more reliable, and easier to maintain.
For executives, this structure means stability and speed. You reduce the number of ways your system can break. You speed up onboarding new teams because they don’t need to write custom integrations. They talk to the bus, and the bus handles the rest. You create alignment across engineering without slowing them down with bureaucracy.
The essential value here is control without friction. You enforce standards in a way that makes teams more productive. That’s how you scale without chaos.
Addressing message payload size constraints required a hybrid claim check pattern
There are limits to what different systems can handle, and ignoring them breaks things. Kafka, by design, handles large messages, 30 to 40 megabytes is manageable. EventBridge, on the other hand, has a strict limit of 256 kilobytes. Trying to push full media metadata or rich video-related payloads through EventBridge directly doesn’t work. The technical environment enforces those rules, so you either respect them or build around them.
Full payloads, those large messages, are offloaded to Amazon S3. During processing, EventBridge Pipes handle validation and transformation, then store the data. What gets passed through EventBridge is just a reference key, pointing to where the actual data lives. From there, consumer services independently retrieve what they need from S3.
Why does this matter? Because it removes bottlenecks and keeps each layer of the system operating efficiently within its own limits. You scale your data processing without needing to customize routing or create workarounds for individual services.
For C-suite leaders, understand the impact beyond the code. This pattern eliminates infrastructure stress under heavy load and minimizes cross-service dependencies. You’re not asking every system to process every detail multiple times. You’re reducing data movement costs, improving fault isolation, and maintaining responsiveness. It’s how you scale media-intense operations without building expensive and fragile custom infrastructure.
And you do it using native, fully managed components, no third-party tools, no complexity tax.
Event-driven architectures provide superior decoupling compared to conventional data replication
You have options when synchronizing data across services. Data replication using something like PostgreSQL’s logical replication (pglogical) works, and it works well, if you’re fine with tightly coupling all your services to one central database. That kind of setup becomes restrictive as your system grows and shifts across regions or business units. Schema changes become bottlenecks. Rollouts need tight coordination. Downtime becomes harder to control.
At Joyn, the model flips the responsibility: the system publishes events, and subscribers decide what to do with them. You no longer rely on a central database to maintain truth for all services. Each service builds and owns its view of the data based on the events it receives.
Why does that matter? Because it reduces blast radius and gives teams autonomy. Service A going down shouldn’t affect Service B’s ability to operate. With replication, especially across multiple services, everything depends on a shared source, and that creates points of failure you can’t afford at scale.
From a leadership perspective, event-driven architectures open up operational flexibility. You can run services across different database engines. You can evolve service contracts without rigid table structures tying everyone together. And you reduce coordination costs between teams.
Yes, replication simplifies reporting and creates centralized audit trails, but you trade that for increased fragility in distributed systems. Event-driven models allow you to decouple timelines, technologies, and scaling strategies across your organization. That’s not just a technical improvement, it’s a structural advantage for long-term growth.
Fundamental infrastructure failures originated from misconfigurations rather than inherent limitations of cloud tools
Most cloud outages aren’t caused by platform failures. They’re caused by configuration errors, missing autoscaling rules, uninitialized caches, or improperly set timeouts.
Using AWS Lambda or Fargate doesn’t guarantee anything unless they’re configured properly. Services didn’t fail because Lambda wasn’t scalable. They failed because developers hadn’t defined the rules that tell Lambda when and how to scale. Caches weren’t used, so slow queries overloaded the database. These are basic oversights, not system limitations, and they have major consequences when traffic surges.
The reality is blunt: if you don’t respect the fundamentals, even the most advanced infrastructure breaks. And when it breaks at scale, the cost is more than technical, it’s customer-facing, reputation-impacting, and revenue-draining.
What fixed it wasn’t a more powerful tool. What fixed it was enforcing architecture discipline. Every service should have autoscaling. Every interaction should have circuit breakers and retry logic. Every database call should see minimal load, not because it’s durable, but because upstream logic already offloaded the work.
Business leaders should expect this level of precision from engineering. Reliable infrastructure doesn’t just happen by switching clouds or adopting the latest service. It’s the result of knowing what matters and making sure every part of the system is designed to handle failure before it happens.
A thorough understanding of SLAs and best practices is essential for achieving high availability
You can’t trust theoretical uptime numbers until you’ve pressure-tested the full system. AWS will tell you that Lambda has 99.99% availability, and that’s true, within their defined SLA. But in production, if you don’t follow best practices, you’ll never experience that level of reliability.
At Joyn, teams quickly found that infrastructure didn’t fail on its own. It failed when developers skipped retry logic, when APIs had no timeouts, and when one small misconfiguration created cascading outages. The numbers on the AWS website don’t protect you unless your implementation aligns with how that number was calculated.
Here’s why it matters. Executives are often handed SLA commitments by vendors and assume those are guarantees. They’re not. They’re baselines. The rest depends on your system’s architecture. A poorly designed app running on high-SLA infrastructure will still crash. A well-designed one running on modest infrastructure can deliver excellence.
The combination that worked best wasn’t the one with the highest stated availability. It was the one implemented correctly, using an application load balancer with Lambda, supported by distributed caching and carefully tuned failovers.
For decision-makers, this underlines a key responsibility: challenge engineering to not only pick reliable tools, but also adopt the practices necessary to realize those tools’ potential. The right infrastructure sets the bar. The architecture and code determine if you can reach it.
Database selection hinges on a tradeoff between operational complexity and scalability
Choosing the right database has upfront and long-term consequences.
If simplicity and near-zero maintenance are the priority, DynamoDB and Aurora Serverless deliver. They remove the need to manage VPCs, subnets, and replication logic. They’re fire-and-forget in terms of provisioning. But there’s a tradeoff. These systems require thoughtful data modeling. If you don’t shape your access patterns early and carefully, their benefits quickly diminish.
If your workload demands relational consistency and complex joins, RDS and Aurora (non-serverless) offer that, but you carry the operational overhead. You manage network layers, monitor for replication lag, configure failover, and prepare for recovery scenarios. These activities add administrative weight and increase response time when things go wrong.
At scale, especially across multiple services and regions, this decision affects cost structure and resilience. An RDS cluster that isn’t actively optimized becomes a bottleneck. A DynamoDB table that lacks key design foresight becomes unpredictable under load.
For C-level leaders, the principle is clear: database choices define performance ceilings and operational cost floors. Whichever direction you take, ensure product and engineering teams are aligned on access patterns, workload expectations, recovery tolerances, and incident budgets. Assume growth. Build with that expectation now, or pay for poor choices later.
Cell-based architecture minimizes failure impact and enhances scalability
When Joyn moved to a cell-based model, everything changed in how the system handled scale and risk. Instead of a single service responding to all traffic, services were split by country, user type, and platform. Germany has dedicated Lambdas. Paid users don’t share compute with free users. Mobile isn’t bundled with web.
This segmentation exposed a clear benefit: localized failure didn’t become platform-wide failure. If Android in Austria breaks, iOS in Switzerland, along with everyone else, stays online. This approach reduced the “blast radius” and improved our ability to troubleshoot, deploy, and recover quickly.
One Lambda function became thirty. Didn’t matter. Code stayed the same. The pipeline handled it. But capacity went from thousands of requests per second to tens of thousands, instantly.
For executives, the value is control and predictability. A cell-based system gives teams more visibility and reduces the scope of incident impact. It accelerates decision-making because recovery is contained and doesn’t derail the entire platform. It also makes feature delivery more flexible, as rollouts can be targeted and backward-compatible.
This kind of architecture doesn’t slow things down. It de-risks innovation. When properly automated and supported by good monitoring, it enables safer, smarter scaling. And that’s exactly what modern digital experience demands.
A multi-layered caching strategy significantly reduces database load and operational costs
Streaming applications like Joyn serve high volumes of repeat data, things like metadata, thumbnails, user profiles. Pulling this content from the database for every request creates unnecessary strain. This was one of the biggest inefficiencies early on. Without caching, database clusters were oversized just to survive prime-time traffic.
At the edge, CloudFront handled public-facing, frequently repeated requests. Inside compute services, Lambda and Fargate, in-memory caches handled short-lived, high-volume keys. Then, between compute and the database a purpose-built, managed cache that required zero maintenance.
This setup reduced traffic to the underlying databases by over 90%. In some cases, database CPU usage dropped to just 5% during peak hours. That drop translated directly into cost reductions. Smaller clusters did the same job, and the serverless nature of the cache meant zero overhead when idle.
From a business perspective, this kind of shift moves infrastructure spend from high fixed costs to flexible usage-based costs, aligned with demand. You maintain performance while spending less, without compromising on user experience or developer speed.
Caching isn’t optional at scale. It’s a basic economic enabler. Without it, your system is consuming resources inefficiently, and those costs multiply as your platform grows. Backend efficiency isn’t just a developer concern, it’s a financial lever. Use it.
Automation and dynamic compute routing optimize performance while controlling costs
One of the most impactful changes at Joyn was automating how traffic was routed between Fargate and Lambda. These two compute services have distinct characteristics. Lambda offers rapid scale and near-zero cost when idle. Fargate supports longer-running tasks but demands continuous provisioning, which means base costs even when traffic is low.
During peak hours, Lambda absorbed overflow immediately without delay. During quieter periods, Fargate scaled down, or even to zero, while Lambda handled light traffic efficiently.
The result was a 60% reduction in compute-related costs for services operating in the 30 to 50 million request-per-day range. Additionally, switching from API Gateway to Application Load Balancer reduced routing costs by around 90%, with minimal changes to code. These adjustments created a major financial impact with very little engineering overhead.
Automation is a force multiplier. It allows the system to make traffic decisions faster than any human could and without involving ops teams. If a service spikes, it auto-scales. If traffic drops, compute costs drop too. Engineers get alerted only when the system can’t self-correct.
For leadership, it’s critical to understand the larger implication. Without this kind of elasticity baked into your infrastructure, you’re either overpaying to stay safe or risking downtime to save money. Automation removes that tradeoff. You get scalability, cost control, and resilience at the same time. That’s a base requirement in any serious tech stack today.
Multi-region deployment, when approached strategically, provides robust protection at a manageable cost
Multi-region used to be an expensive, complicated decision that most teams avoided. But that’s changed. Cloud providers have made cross-region deployment simpler, faster, and more programmable. At Joyn, this shift wasn’t adopted all at once. Over time, the architecture matured into a resilient, scalable, and cost-aware multi-region system.
The core driver is risk exposure. When a service failure takes place, the impact should be limited to one region, not the entire customer base. If your application is centralized in a single AWS region and that region suffers an outage or degraded performance, you’re fully offline.
Not all services need to be global. But critical ones, authentication, media playback, content metadata, must be available regardless of which region fails. The solution involves tiered strategies: backup and restore for less essential systems, active-active or warm standby for the rest. Active-active, particularly with global DynamoDB tables or replicated Aurora deployments, gives write and read access in multiple geographic zones, ensuring continuity in the face of disruption.
For executives, the critical point is accountability. Downtime is no longer just a technical issue, it’s a leadership decision. If your team isn’t prepared for regional outages, or if you’ve made cost-saving choices that push infrastructure risk higher, that’s a business risk you carry directly. However, when multi-region is implemented strategically, with automation, scoped isolation, and the right replication models, it becomes cost-effective insurance that protects your service quality and your brand reputation.
The technology is in place. The only question is whether your organization is structured to use it.
In conclusion
Infrastructure doesn’t win customers, but it absolutely loses them when it fails. At scale, your backend becomes a core part of the product. Users don’t make a distinction. Video buffers, app crashes, inconsistent content, they see it all as one thing: your brand underperforming.
What the Joyn platform proved is that you don’t need a massive team or unlimited budget to build something resilient. You need clarity on priorities, discipline around tradeoffs, and the willingness to rethink how your systems are designed. Serverless isn’t about trends. Multi-region isn’t about over-engineering. These are smart decisions when aligned with business goals like uptime, user retention, growth, and cost control.
For executives, the mandate is simple. Treat infrastructure as a business risk, not just a technical one. Push for environments that recover fast, scale under pressure, and adapt without manual intervention. Ask the hard questions: What happens during a regional outage? How long does it take to deploy? Who owns failure recovery?
When infrastructure gets out of the way, teams move faster. When systems respond automatically, downtime becomes rare and contained. That’s how you move from reacting to scaling, on your terms, not the platform’s.


