Operational Controls at the BBC | Datadog

Operational Controls at the BBC


Published: 7月 17, 2019
00:00:00
00:00:00

Hello, thank you. Thank you for joining me. So yeah, I’m Ross Wilson. I’m a senior software engineer at the BBC.

And when I was preparing for this talk, I was looking at the Dash website and somewhat see what the deal was and what the what kind of topics and themes would be.

And right on the homepage, it says scale up, speed up. }And that’s really the topic of my talk: how we can use operational controls at the BBC to scale up and also release features faster.

So, we’re going to be talking all about deployment, releasing features, and how we scale as a team.

Introduction to BBC

So just to make sure that we’re all on the same page, what’s the BBC?

So, BBC, British Broadcasting Corporation, we’re a public service broadcaster, primarily in the UK. We’re actually the world’s largest broadcaster if you count number of staff.

And our mission is to inform, educate and entertain. And as a public service broadcaster, primarily in the UK, we are a non-commercial entity. So you won’t find adverts in the UK on the BBC. Instead we get our revenue from a TV license that the population pay.

And in terms of the things that the BBC does, you might think you might be familiar with BBC News and that kind of thing. But we have 26 domestic TV stations in the UK, 58 radio stations, and a huge extensive set of online services. It’s very difficult to count how many websites you have, bbc.com, bbc.co.uk, you could say we have two.

And this for the UK and an international audience.

So you may be familiar with BBC News. You may use it here in the States, it’s all from the BBC.

BBC Accounts: internal and external users

That’s the BBC, but of course, what do I actually do? Software engineer for BBC Accounts.

BBC Accounts looks a bit like this. It’s a BBC wide authentication and authorization platform. And basically, that’s just a lot of words to say, we build a sign-in system and a user registration system.

So, we’re both external-facing, anyone in the world can sign up for a BBC account, and then we’re also internal-facing because we integrate with all the various products that the BBC has.

So it’s one account, single sign-on across the entire BBC’s portfolio of products.

And as a user, if you register for an account and sign in, you can then catch up, you can view catch up and live television, on a service called BBC iPlayer, which is our TV and live streaming platform. You can listen to radio, music and podcasts and a product called BBC Sounds.

And you can also vote online. So shows such as “Strictly Come Dancing,” over here, you might know that as “Dancing with the Stars.” You can do online voting with your BBC account.

And then there’s all kinds of standard things, recommendations for things to watch, that we think you might be interested in, news, weather in your area, that kind of thing.

And as I said, it’s a shared service, so there’s quite a lot of pressure on us if that goes down, we’ve got all the different products of the BBC to answer for.

So, we care quite a lot about our stability and how we release features.

In a previous role, so I’ve been at the BBC for about five years. Two years ago-ish, I was doing the BBC Accounts. Before that, I was building Smart TV applications. This is one of them.

This is BBC iPlayer. So that’s Smart TV applications on games consoles, streaming sticks, that kind of thing.

And one of my highlights was delivering the Wimbledon and the World Cup in 2018, in Ultra-High Definition 4K, as live stream rather than on-demand stream. And we’ll talk about some of the challenges and things that we did for that.

Where BBC came from (and where it’s going)

So, some context for where the BBC has been and where the BBC is now.

We have migrated from a kind of traditional, “throw it over the wall” deployment model, where you had individual teams and then you had a production environment, and a pre-production set of environments. And it’d be a case of you ask a separate team to manage that deployment for you.

As a software team, or as a product team, there wasn’t great visibility into what’s happening in that live environment.

This dedicated operations team was quite a small team. And they were responsible for deploying software across the entire BBC.

So, it would take maybe 40 hours to deploy a change, you deploy to a stage environment, let that soak for a bit. Then the next day, you could do a deployment to live. (There’s no live deployments on a Friday).

So, if you’ve got a change on a Wednesday, you’re looking at next week before we can get that out.

Now, things have moved on quite a bit from then.

So, you’ve got products and service teams, of which we have over 4,000 different deployed services in the cloud at the moment. That’s across AWS primarily, but also GCP and Azure. And that’s the cloud services.

Then we also have a lot of stuff still deployed into our in-house data centers. So, we’ve got a number of data centers. So, we’re split between sort of on-prem and in the cloud.

Now, this talk is going to be very much kind of evidence-based. This isn’t going to be kind of theoretical things you could do. This is stuff that we’re actually doing.

So hopefully, there’s some interesting takeaways that you can get from that.

What are operational controls?

So, operational controls, it’s a bit of a kind of a funny term. I don’t think it’s particularly obvious what that actually means. So let’s just define what we’re talking about.

So, you’ve got the systems, we all have systems. But an operational control is something that you can implement into your system, to change its behavior in a live or pre-prod environment, without having to make a code change or redeploy your code.

And one of the best ways that you might kind of be familiar with that term is feature toggles or feature flags.

And we’re going to talk about different strategies and how they can be used and things like that. So feature toggles and feature flags are like an industry kind of term, whereas flag poles is quite a BBC term.

If you do a Google search for flag poles, you don’t get what we’re talking about. So, flag poles are another form of operational control, and again, we’re going to talk about that.

Then you’ve got rate-limiting, concurrency monitoring. That point about Wimbledon and the World Cup in Ultra-High Definition, that comes with a concurrency monitoring.

And then lastly, we’ve got circuit breakers as a way of protecting our services from failure. And how we can degrade in a nice manner.

The software development lifecycle

So, deploying software and developing software. Typically, we want to make a code change, we want to test it and we want to ship it. And then we want to carry on.

And you might have seen diagrams like this, this is just one I found. And we’ll sort of skip over the planning, analysis and design stages. And in this talk, we’re going to focus on implementation, testing and integration, and then the deployment.

And crucially, I say deployment, but we also mean releasing. And we’ll get onto the difference between those two terms.

So, we want to iterate quickly around the circle. We want to try and keep that feedback loop super short, because we don’t want to be doing spending multiple weeks building a feature.

We then release it to our audience and we realize actually it’s not quite right, we need to go iterate around again.

So, all the kind of standard thing that we’ve been talking about in an agile world for a while, let’s try and move quickly, ship things fast. And those fast feedback loops are important from a product perspective, to make sure we’re building the right thing.

But also from an engineering perspective, we want to be nimble, we want to react quickly. If there’s an issue in live, we want to be able to get on it quick.

So, that’s the kind of context of where we’re at.

So, we’ve got multiple people working on multiple things all at the same time.

How do we make sure we don’t block all these various people?

So, change A, shouldn’t prevent change B from being released and so on.

We’ve also got multiple services across multiple environments. We have an integration environment, test, staging, and production, live.

We also don’t want long-lived feature branches. That’s something that my team sort of share. It’s a trait that we believe in.

It’s also something that the BBC as a whole believes in. So, all the various product teams around the BBC in general, are trying to ship fast without using feature branches.

So, this means that we’ve got regular raising and merging of pull requests to master. And work is often more than one pull request in size, which therefore means that you can have changes in master that shouldn’t be user facing because they’re unfinished or they’re not ready for prime time revealed to all your users.

And we want master to always be releasable. So, we have all these different kind of plates spinning that we need to manage.

And then, of course, you’ve got urgent hotfixes that might need to go out.

If you’ve got changes in the master branch, you’ve got an urgent hotfix to go, you’re going to end up having to do a load of Git magic to try and get the thing resolved.

And that goes against our principle of not having feature branches or like a full release strategy using branches.

Feature flags

So, that brings us back around to feature flags.

So, if we’ve got some code or some particular feature or a change that we don’t want to actually be deployed yet, how do we handle that without it going out when we deploy?

So we can use feature flags. Now, in essence, we’re talking about an if-statement or some kind of conditional.

And we’re going to wrap our change in that conditional. And only if it’s true, will that new code get run. And this allows us to deploy code to production. And as long as the feature toggle is switched off, effectively, there’s no change. The net result will be no change to users.

But it allows you to ship code still.

And these feature toggles, you can sort of classify into two different forms. We’ve got temporary and permanent.

So, temporary toggles can be used for feature rollout and experimentation. We’ve got some great examples of how we do that in a minute.

The permanent ones, however, are also quite interesting. So these can be used for incident mitigation and failover.

So, imagine if you’ve got a service you’re reliant on goes down, you can use a feature toggle to turn off that feature or hide it from the user.

Or you might want to pull up some different signposting to say, “We know there’s some issues happening at the moment, how about you try this instead.”

Some kind of degraded experience, rather than just flat out erroring. It allows you to put messaging up during incidents as well. So you may be failing with kind of status pages, that kind of thing. You can actually integrate those into your app, you can turn on the feature toggle to say, “Yeah, we know there’s an issue. How about this instead?”

Some guidance for the user.

Mitigating incidents through failover

And then in terms of failover, we’ve got stability to do things like read-only modes.

So, if you’ve got a service that’s stateful, you’ve got a database somewhere.

What if you had a read replica database, that if the primary goes down and there’s some kind of latency between switching over, rather than your services erroring for that time—what if you just went into a read-only mode?

Pull up some messaging to say to the users, “You can’t make any changes right now, but at least the services are up and running for you to read data.”

You can also do that for static content fallback. That’s something that we do with our video-on-demand and live streaming solution. I talked about iPlayer earlier, I’ve got a great example for that in a minute.

Deploying is different than releasing

So, the key takeaway for me here is that deploying is different to releasing. They’re not the same thing.

Feature toggles allow for you to change code and ship it without actually releasing the feature to users.

You can deploy after, and you can release when everybody’s ready.

So, that might mean in terms of the code, the code is just not ready, it’s got issues, it’s not fully tested. It might be that you need to review that change with a product owner, or maybe your user experience team, your designers, that kind of thing.

Personally, I find, we get kind of, we work with our UX team, we get designs.

And then the first time that they see those designs in a usable kind of in-production, or in a real-life environment, is sometimes when we ship the feature, it’s obviously way too late.

So, if we can put those changes behind a feature toggle, it can…all the code can go out. We unblock ourselves, we get all these great benefits. But then we can release when everyone’s ready. We do soft launches as well.

Deployment artifacts used by the BBC

So, in terms of actual tech, sort of infrastructure that we’re running, we’ve got kind of two deployment artifacts that we care about and we consider.

So we’re primarily on AWS, certainly for my team we are. And we’re not using containers. We’re using similar to a HashiCorp Packer solution, that you may be familiar with.

Basically, we install our software on a base image, we use CentOS, and then we take a snapshot, that creates an AMI, and then we can just roll out.

We’ve also got static assets. We are a web product. So we have APIs, but we also have user-facing pages, sign-in screens, user registration, that kind of thing. So, we’ve got JavaScript bundles, CSS images, that kind of thing.

So we all like a diagram. This is an architecture diagram, roughly of what we’re doing in the BBC account team.

We’ve got a user down at the bottom, bit of load balancing, CDN with S3 as the origin. And then we’ve got like a multi-tier kind of standard setup database at the back with replicas as well.

So, my team primarily looks after this first tier, the user facing tier, as well as the tier that all the integrating products, iPlayer, Sounds, and so on. I talked about integrating with, on the back, managed by a slightly different team is the actual underlying authentication platform that we use.

Why use rolling deployments?

And in fact, let’s just jump back.

On these instances here, they’re just part of an auto-scaling group. So, when we create a new AMI, we’ve got our new code version in it, we can just progressively roll out that version.

Now, that’s good, but it’s quite a coarse way of rolling out features.

You’d have little control over the timings of when it’s going to get rolled out. We don’t take all the instances out and have no service. And we also don’t want to bring up a ton of instances and switch over super fast.

Because if there was an issue, we want to…if we roll out instances one by one, at least, it’s only a percentage of the audience that are going to get affected and then we can rollback.

Now, the benefit of doing this kind of rolling deployment is that we see no need for doing blue-green deployments.

So rather than duplicating this whole infrastructure again, and then doing either DNS magic or load balancing magic to switch between blue and green, we just have a single set of infrastructure for our production environment.

Of course, the whole thing has been duplicated for a staging environment, test, and integration.

But in terms of our production service, that our members of the public are using, it’s just a single set of ASGs with AMI that gets rolled out.

This also has the benefit of doing some fun stuff with CloudFormation, with CloudWatch, where we can do automatic rollbacks if errors or alarms fire during that roll-out process.

So, in terms of the static assets, we’ve got this front-end code, JavaScript, and so on.

But this is super boring, as a deployment, it’s really great. You’re probably doing this as well.

We’ve got a CDN URL, basically just proxy for you to S3.

We can push up our static assets with a version number, and it’ll never get used until something references them.

So, the actual case of deploying static assets is super boring. And then it opens up the whole conversation about how you release the static assets.

Feature toggles

So, we’ve got a change that we’ve made. We’ve put it behind a feature toggle.

And then the super fun bit becomes, I really like this bit.

This is where we can consider release engineering.

So, you’ve got the code out, that’s the boring bit. How do you actually decide to enable the feature toggles to let people at the new feature?

So you’ve gotta have a strategy. There’s different strategies you can employ.

On the most basic level, you could just have a simple on/off toggle. }Just a boolean toggle switch, that you flick, and suddenly the features are now available to the world.

For small features, we often will just use this, has kind of a low overhead.

Another approach is percentage-based rollout. So of course, the 100% will be all your users. What you can do in terms of implementation, we generate a random number between 1 and 100, 0.1 to 1 or whatever.

And then if the percentage you wanted to target was, say, 20%, if the number is less than 0.2, enable the feature.

It has some downsides, you need to think about whether your feature needs to be sticky for a user. So, if you have a user that comes to your site, goes away, comes back again, if you’re going to roll the dice again, you might have an inconsistent experience. So, it’s something to certainly think about.

Is that important to you, for your feature?

And then we’ve got just manual opt-in.

So generally, this is for internal staff users. So, what we quite often do, we’ll build a feature, put it behind a feature toggle, and then our QA team, and our product owners and the engineers, they can enable that feature in a production environment to really get a sense of how it’s going to behave once it’s actually released.

And there’s various different feature toggle implementations. There are commercial providers such as LaunchDarkly and Split, but at the BBC we’ve built our own.

So, generally, we’ve got something that looks like this. This is the UI, little admin interface. Cosmos is our internal deployment system.

I mentioned HashiCorp’s Packer, it does a very similar thing. It just bakes these AMI images, AMIs.

They also have something called dials. It’s just toggle switches. So these switches here, this is an example of a feature toggle that we’ve got.

We’ve just the two options, just the boolean and toggle. If you want to do percentage-based rollout, you can do that as well. You just have much more options, you choose the percentage do you want.

It’s a lightweight way of providing feature toggles to BBC teams. Under the hood, there’s a little bit of a UI here. But then you’ve got ultimately the source of truth is a JSON file that lives in S3.

There’s no huge database. There’s not a lot of APIs.

Ultimately, each service out of the 4,000 odd that we have, each of those services has a JSON file that lives in S3.

There’s an agent then installed on our EC2 instances. That’s part of the bakery process when we build these AMIs. Which is just pulling S3 every now and again.

And you can be quite efficient about it with ETags and so on.

There’s an audit trail for who changed what. And there’s a role-based access control setup around it as well. So, our dedicated operations team that are on shifts 24/7, they have access to it, as does the service and product team, but other product teams don’t.

We also emit events. So, if you come in here and you change these toggles, we’re emitting events to SNS. Looks a bit like this.

So, this is really important because it radiates visibility of releases. If deployments are going to become boring, that makes releases important.

So, you want to know who, what, when, what’s going on?

We’re pushing to Slack because of notifications. And as an innovation side project, I’m looking at how we can push that data to Datadog.

We use Datadog dashboards for our monitoring. Datadog has a feature where you can do vertical markers on these timeboards.

So, you would be able to see a vertical marker to say this feature toggle has been enabled to this percentage of users or whatever.

And then you’d be able to correlate that with if there’s sort of subsequent errors or latency increases, you can be able to tie that back.

Only production really matters

So, production is really the only environment that matters to us. Like I said, we have these four environments, integration, test, stage, and production. But really, production is the one that we truly care about.

All the different environments aren’t representative of reality.

There’s various different traffic levels.

We have, obviously, a huge difference in the amount of traffic coming into our production environment versus our stage environment that receives a few requests a second; production receives thousands of requests a second.

You’ve also got different browsers that are being used. Internally, most of us are on Macs, most of us are using Chrome, latest version of Chrome at that.

So, if you’re on a stage environment testing our new feature, maybe a front end feature that’s dependent on JavaScript in the client, it’s not a great wide breadth of different environments that you’re testing, different operating systems and so on.

And we provide APIs to various product teams at the BBC. And we find people using APIs in weird and wonderful ways that we didn’t plan for. We certainly didn’t document for, such as accessing internal APIs from JavaScript clients, and so on.

That’s not something you can find in a staging environment, because it’s a really synthetic test environment.

You sort of assume that everyone is like a good actor and is using your thing as designed. (Not always true.)

You’ve also got other components in the request chain. In our production environment, we’ve got a traffic management layer, DDoS mitigation, caching, and so on, that can be different to staging environments. In an ideal world, they would be the same, but they’re often not.

Staff mode

We talked about different strategies for turning feature toggles on and off, there’s a fourth one as well, and it’s called staff mode.

This is something that we’re doing at the moment as a way of soft-launching features to BBC staff only.

And the under the hood implementation of this is something called JSON Web Tokens. You may be familiar with these. It’s effectively a JSON block that’s signed, and you can assert that it was created by yourself.

And if you’re a BBC staff member, you get issued with a JWT, gets stored in a cookie. And then, we can use that as a toggle. And we can decide, “Okay, you’ve presented a staff toggle…staff cookie to us, we can turn on these features.”

We can integrate that with our analytics provider. And therefore, we can monitor things in our stats and our analytics to say, “This is a staff member, they’ve got the new feature, and this is how they’re using the product.”

There are biases with this, of course, if you’re using this to kind of try out a new feature. Generally they’re UK-centric users, low cardinality browsers and devices. Most users at the BBC are on corporate-issued devices, with corporate-issued and mandated software versions.

And of course, the users know the products well, they work for BBC.

Geolocation restrictions

The BBC, or at least the products I work on, have geolocation restrictions for various kinds of rights and legal reasons.

And that can be a problem when we’re trying to actually test features.

So, we’re deployed into islands, AWS data centers, which is outside the EU, or outside the UK, I should say.

And if I run an automated test from my environment, it appears that we’re outside the UK, and you don’t get the feature. So, you can’t run your automated tests on CI from AWS in that data center.

So, using that same JWT approach of issuing staff tokens, we can issue sort of pseudo staff tokens to our testing environments, or to any device really, or script that needs to access the product from the UK, or at least “from the UK.”

Embedded devices

And then we’ve got embedded devices like TVs.

So, I mentioned that earlier in my career at the BBC, I was building Smart TV applications, Smart TVs, games consoles, and so on. But they’re a real pain to test, and a real pain to kind of debug.

So, if you imagine your Smart TV, your 80 inch LCD on the wall, you can’t just open up a dev console and see what the app is doing.

So, what we did is we implemented a hidden debug menu. You type a secret code in on your remote, and it unlocks this debug menu.

And this is on production, and we ship this to the world. And if you know what the code is, you can access the debug menu.

There’s nothing actually that exciting in the debug menu. But what it allows us to do is point the TV application at different environments. You can enable features, get a big list of all the feature toggles, and you can toggle them on. So you can have any combination of features that you want.

And this is great if you’re giving a demo to somebody, if you’ve got your product owner here, or your design team, and you want to actually demo this new feature.

So, you can just go to any retail device, it doesn’t have to be a debug device or special firmware or anything like that—and you can just enable the features that you need.

It’s also a great way of allowing for rollout of features to particular devices.

So we’ve got these many hundred different Smart TVs, games consoles and streaming sticks. They all have different configurations because they all have different characteristics. Some don’t have colored buttons on the remote. Some have just a basic up, down left, right. Some can do live streaming, some can’t. So all this config already exists.

So, we make use of that as a way of adding to the config for a particular device type, this feature is now enabled. It allows us to roll out features that way.

Feature toggles vs flag poles

So, we’ve been talking about feature toggles as a way of sort of releasing features. But you can also use them for operational health.

So, we talked earlier about feature toggles and flag poles. Feature toggles, we know now. Flag poles, we haven’t talked about.

So, flag poles are generally a BBC concept. They go by different names in different companies. And it’s a internal-facing way of declaring your service status.

So, BBC Account is a shared authentication platform. We integrate with all these various products at the BBC.

And occasionally, BBC Account has problems. We ship bad code. We make bad assumptions. There’s a bug in live. Or a system that we’re dependent on is in a degraded state.

And we want to announce the status of our system to all these different products.

So, a product like BBC iPlayer, the video-on-demand streaming solution, that at the moment in the UK, requires you to sign-in with a BBC account. It’s like a mandatory sign in.

But if BBC account is down or degraded and users can’t sign in, what do you do? Do you just take out all these other products at the BBC?

Our solution is to expose this flag pole as an API that you can query. Under the hood, it’s just again, an S3 file. It’s super resilient. It’s super cheap.

These products can read the flag pole state, see if it’s red or green. They generally go by traffic lights. If they see that it’s red or not green, each product can decide how they want to react to that.

That might be that a product like iPlayer that usually requires mandatory sign-in, might fall into a non-mandatory sign-in mode.

It might just say, “Okay, we’re not going to enforce that people sign-in anymore. You can use the product, so you can continue to watch video and content.”

So as a user, they’re happy, they’ve got the experience that they wanted. And us as the account team haven’t broken all these other products. It takes a bit of pressure off when we’re trying to fix a live incident.

Circuit breakers

Then we’ve got circuit breakers.

So, you may be familiar with circuit breakers as a way of protecting your service from broken or slow endpoints, your dependencies you’re relying on. And it allows you to fast fail.

There’s been quite a few talks about circuit breakers.

I won’t go into a huge amount of detail about it. But we find these super useful as a way of preventing us from pummeling another service as it’s down.

So, if it is in a degraded state, we’ve got many thousands of requests a second coming in, we effectively just proxy that straight through to that backend services. And just take them out further.

And as they’re just about trying to recover, we’re just slamming traffic on them.

So we can, in an automated sense, throw a circuit breaker, stop sending requests to that backend for some kind of time period. And then on a percentage basis, we can let a few requests trickle through to see if the service is now back up.

In an ideal world, you would then run backup, that’s something we’re still working on. At the moment we will trickle through, see if the services backup, and then we close the circuit. Requests carry on as normal.

So, this is iPlayer on the web, this is not the Smart TV application, and this is what it usually looks like, I’m signed in. They’ve got content, and I’ve got continue watching personalized content to me.

But if the service that powers this were to go down, and that could just be a bad deployment, a bad release, it could be that it’s overloaded, sudden spikes in traffic is something that we deal with, how do we cope?

So, we actually have a scheduled process that runs, and it runs every few minutes. It takes a snapshot of this page as a JSON file.

And then our edge layer for traffic routing, if the backend service is down and it can’t render this page, we have a really lightweight system that can take that JSON file and render a bare-bones page.

It won’t be personalized for the user. It’ll have less features, but ultimately, the content will be there and the content will be playable.

So even if your service that…your front-end service that’s rendering these pages, if that encounters problems, or its dependencies encounter problems, we can still have a static fallback page.

It’s quite kind of an old school way of thinking. It’s HTML files on an FTP server file, but it works for us.

And our edge layer can just read to that and serve that up instead.

We also use it in Smart TV app as well.

So, this is what a playback experience looks like. A little bit covered here by the podium. But we’ve got different playback controls, we’ve got related content and so on.

All this is server-side rendered, even though it’s on a TV application. If that server-side rendering component as a service is encountering issues, we can fall back. Or its dependencies have issues, we can fall back.

So, for example, an internal service that’s shared across the BBC is a recommendation service (recommends content given a piece of content). If that goes down, we can just get rid of this row at the bottom. The service will degrade rather than completely blow up.

If the service… There’s a tick box here, so a button that allows you to favorite programs and add it to your list, to your BBC account. If that has issues, just get rid of the button.

So, we’ve taught it to say, “It’s not working right now.”

And then if all else fails, we do the same approach. We have a static fallback.

So, we have some static server-side rendered snippet of HTML that at least the client can show that instead.

It allows the user very basic controls, play, pull, stop, and so on. They don’t get any of those extra features, but at least they’re achieving their goal of watching content.

Live-streaming Wimbledon and the World Cup in UHD 4K

And then we’ve got Wimbledon and the World Cup in Ultra-High Definition 4K. That’s what I teased at the start.

So, the FIFA World Cup 2018, Wimbledon 2018 and 2019, just finished in the last week. We’ve been streaming that in Ultra-High Definition 4K as a live stream, rather than video-on-demand.

You might be familiar with kind of the Attenborough programs, the Natural History Unit the BBC has that comes with a lot of great content for nature programs.

But all that was done in UHD on-demand, which isn’t actually a huge ask.

Doing it as a live stream, however, there’s a lot of challenges there.

So, this is part of a trial. This isn’t a fully widespread feature that the BBC has rolled out yet.

But for live streams at the moment that we do as an IP stream, we’re looking at like five megabits per second as a bit rate stream out.

For Ultra-High Definition, we’re looking at 36 megabits per second as a stream.

And this is because the actual technology for video encoding live, real-time just isn’t there yet, to get that bit rate down.

It requires a ton of computation to reduce the bit rate. So at the moment, the workaround is higher bit rate.

It’s always going to be a higher bit rate than HD, of course.

You’ve got four times the amount of pixels. And then you’ve got deeper colors as well.

And it all comes down really to this quote. There’s a BBC internet blog that you might find interesting, all about the technologies at the BBC uses all under the hood. And it’s this quote here. And it comes down to distribution capacity. We use content delivery networks.

But there is a fear when we’re speaking to the content delivery networks, and internally, that if everybody was, everybody had a compatible device, new HD device, streamed in Ultra-High Definition, the UK internet infrastructure just wouldn’t cope.

And for an example, the Women’s World Cup, England versus USA, that’s just been, there were 9 million viewers for that program. Which, if we’re looking at 36 megabits per second, that’s over 300 terabits per second of unit cast video streams, which we’re talking sort of DDoS levels of traffic at that point.

Concurrency limiting

So, our solution was concurrency limiting (and remember, this was a trial).

So, we enabled this for certain streams. It itself is a operational control, and it provides a cap on the number of UHD streams that we will allow.

So, when you come to the product, you find the content, you hit play, you get a modal that says, “Do you want to watch in Ultra-High Definition? Do you want to be part of the trial?”

And we use a service called AWS Kinesis Analytics, commercial service from AWS. And using that we built something called a counting service.

It’s super simple. You send it a ping, it counts it. You send it a ping when you leave the stream, it subtracts by one. (Just a big calculator, really).

And what that has allowed us to do is keep a live count of how many people are in the stream. And then we can set a limit on it as well.

So after a certain point, and of some undisclosed figure, when that limit is hit, we then change that dialogue, and we say to users, “Sorry, there’s no more capacity for you to watch this in Ultra-High Definition. Here it is in HD, come back later, you’ll maybe get in the trial next time.”

Deployment isn’t releasing

So, really, the takeaway points that hopefully I’ve talked about are: deployment is different from releasing.

Deployment should be super boring. Release is where it can get really interesting.

That’s where you can do all these fun strategies for how you’re going to actually release your software to your audience. And your audience might be members of the public. It might be internal users, QA teams, product owners, and so on.

Use operational controls to test in production

You should implement operational controls to test in production.

This whole idea of testing in production is something that’s sort of been bubbling away recently, places like Twitter, you’ll find a lot of conversations about testing in production.

But there is a cost to these feature toggles, of course, you have to ultimately support multiple implementations of your service.

Sometimes it’d be super quick and easy if we could just bring the fix in, ship it, away we go, no ceremony around it.

But the benefits of having feature toggles outweighs that.

This extra complexity with automated tests, for example, you’ve now suddenly got two sets of tests to run.

If you’ve got two feature toggles, you’ve now got four. Think of all the different combinations. And that can suddenly skyrocket. All these different permutations that you can have, of features enable, of features disabled. And then on a percentage basis, it can get quite tricky.

So, it comes back to the idea of making sure that it’s super obvious what feature toggles are enabled, what’s the actual experience that your users are having. And we do that through Slack and soon Datadog.

Use flagpoles to create great user experiences

And then flagpoles, that way of radiating your service, uptime, and availability.

You might want to consider the critical user journeys for your apps. The video streaming example earlier, the critical path is, can users find content? And can they play it and the player works?

The player will disappear and they can watch the program for an hour. That’s the main thing that we care about.

What’s nice to have?

It’s really nice that we have all these extra features and we can offer recommendations to keep the user within the app and keep them playing content.

But if those backend services are suffering or they’re degraded, then they’re the kind of things that you can just…you can cut.

This is similar to the microservices architecture. That whole idea that you’ve got multiple small services. If one dies, it hopefully doesn’t take out the entire system.

Thank you.