All right, sounds like I’m on now. You can go to the screen. All right, let’s get a show of hands. How many people here like free money? That’s awesome. I have good news for you guys. So my friends at the Cash Cab are gonna pick you up when you leave and they’re gonna take you to the hotel, or to the airport and that’s the spot to do it. But I’m gonna talk to you about something that’s almost as good, so how to do more with less. See that’s kind of like giving you free money in some ways.
This is about experimentation and how you might apply this into a development sense. I actually was experimented on as I was flying out here, the airline asked me if I wanted to spend $29 to sit in exit row, and they succeeded. So clearly there’s experimentation going on everywhere. I wanna talk a little bit about this in a software development stance or simulation here.
So I’m Brian Lucas, quick bit on me. Optimizely Senior Staff Engineer, run a lot of the production and release developer operation side of getting code out into customer’s hands. I was a founding member at Credible, we IPO’d last year, Sling TV, we had a lot of scaling challenges there. So Optimizely, real quick on this… Optimizely is an experimentation company. We focus significantly on taking one or more variations and presenting it to your customers. So we play in a lot of areas: iOS, Web, backend, frontend.
We started in 2010, with a very simple concept, you could simply drop this into your website, it’s a WYSIWYG editor, you know, go ahead and rearrange some text. Does this green button or does this blue button perform better? Does the cart back out? Do people leave? Fast forward to now, we’re actually doing experimentation everywhere, frontend, backend. If you have an Apple TV, an Android device, there are probably actually experiments running using our platform, or one of the many experimentation platforms out there. Lots of customers primarily all larger enterprise scale, but we also have a wide mix of everyone.
So wanna talk about a few things here, big traits on successful software companies. They all have several common traits here: high velocity, high levels of quality, highly productive developers. I’m from Colorado originally so plays into that, or you know, some people might have the Ballmer peak on that as well, if you know Ballmer peak. But unfortunately, as software complexity grows, so do the costs to develop it. So this includes a lot of things: slowdowns, risks, safeguards. You’ve gotta build massive amounts of process and complexity around that.
So this also leads to process disintegration and that’s actually what we’re talking about today. We’re gonna talk about a few things that you as both engineers, developers, DevOps can all implement in ways to get your code out faster, improve your velocity, make things a more sane environment for you to be in. When you have process disintegration you have a lot of risks, internally expensive validation, QA takes forever, brittle build processes, we’re gonna talk about a lot of these things. Snowflakes, talk a little bit about that, and then customer risks, uncaught bugs. Like can you actually make everything perfect? Probably not. Is this actually something the customer wants? Maybe, maybe not.
The companies that exceed or do really well on this, you know, the FAANG companies come to mind, they’ve all mitigated or dealt with a lot of these things. One of the ways they do this is through faster delivery and through getting code out very quickly, releasing with a high degree of certainty, confidence, and minimal change sets. So this is all about velocity. When you get the feedback built out and you have a lot of things turning around giving you that feedback, you know whether this is actually useful, whether this is a meaningful signal, whether, “Hey, I’d rather go out with one day’s worth of changes versus two months of changes, because there’s probably a smaller chance of an uncaught thing creeping out that’s massive, versus two months of changes going out.”
So releasing faster and doing it with experimentation, that’s really what we wanna focus on here. We wanna talk about how you apply experimentation, how the bigger companies apply this to their own development process, and how you can do the same thing. So how to do more with less.
I’m gonna talk about several core concepts, the first one is reducing QA dependencies, sounds perhaps counterintuitive there. But you wanna automate your feedback loops and then adopt experimentation.
All of these things are essential for high velocity, high performing teams, and you need to do each one of these in order to actually get a lot of changes out, a lot of ideas out, and innovate at a rapid pace to address your changing customer demands and needs. Let’s focus on the first one, reducing your QA dependencies.
Reducing QA dependencies
So 50% of time, this is perhaps a little flexible here or there, is generally spent fixing bugs. There’s a lot of tech debt when a company rushes to get their code out to market, “Oh, no, you know, we took a few shortcuts because if we didn’t do this we didn’t get funding.” Okay that’s fine. That starts to pile up. a lot of tech debt builds up.
As everyone here in this room knows, everything works great when it’s on your laptop. You’ve got you know, your… you got some unit tests, that pass or you just have it doing the thing, and your hack week project, or your prototype or whatever, works awesome. So that’s in isolation. What happens when we put them all into the sandbox? Everyone has to play nice. No one’s happy, all the kids in the sandbox can’t share one ball. No one’s happy about this.
This manifests in something like this, high severity bugs or engineering emergencies. So this is something that we had at Optimizely, we still track this pretty regularly. Engineering emergencies tracked as customer impacting outages, or customer impacting bugs. High severity, you guys probably all have your notion of like sev-zero through sev-three or whatever.
All of these things are largely the result of integration problems. And then sometimes they’ll be like third-party things, Amazon outage, or Google outage, or whatever. A large portion of them can probably be attributed to integration. So the first thing you wanna do is consider prioritizing all of these heavy lifting processes that at the back of your mind wanna kick down the road.
Writing comprehensive suites takes a lot of time. And what’s the traditional path that we all take? Well, “Hey, I’m a developer, got a bunch of code, throw it over to our poor QA team over the fence.” No one likes that. We’re initially inclined towards this model, so you guys have seen… some of you or many of you’ve probably seen the test pyramid. This is an inverted one, this is why we’re largely inclined to this because the engineering effort maps as follows.
So it takes a very large amount of effort to build out tens of thousands of unit test integration, end to end and so forth. You know, things driving web browsers and selenium and what have you. And it doesn’t take a lot of engineering effort to build out a manual QA process, because you just write things on, you say, “Okay, have at it. Good luck.” But it actually translates into this, these things add up quickly. So the validation effort when you don’t invest at the upfront part looks like this. So the manual QA, the end to end, everything else, actually is inverted here in terms of effort. So organizations lose momentum instead of conserving it when they take this approach.
Shifting left with QA
I’m gonna talk about a few things, I’m shifting left, you may have heard the concept shifting left. I’m gonna talk about it with QA right now. Very simply put, here’s what your traditional build process might look like. Some of you may have this process, design, build, deploy, QA time, it’s time to fix and then release. So QA comes back and says, “I’ve got 10 things to put in JIRA.” Okay, great, we fix it, two weeks if you’re on an average Sprint release cycle.
If you can shift portions of this process left then what you can actually find is your front-loading a lot of that heavy lifting and moving it. Instead of QA, fixing, releasing, adding up to two weeks, you can find that this takes a much shorter period of time. And you’re just pushing it out into a hot fix process. So one to three days depending on how fast and how much final manual validation you actually wanna perform.
So reducing your QA helps keep momentum while you can do… of course, the upfront cost is higher, but the validation effort when you actually build these things out, can start to look a lot closer to this, where it’s a single bar representing the amount of effort here to actually validate. When all of these things are automated, it all works beautifully and shifts all of that burden upstream, instead of having a QA team do it. The other thing that we really believe in is shifting a lot of the QA on to developers.
So who knows all of the code as well as the developer? It’s probably not the QA team. Who knows all the unit tests and what they covered? Probably not the QA team. Who does? It’s the developer. We actually ask all of our developers to go validate their code and sign off on it before it gets released. And this is something that’s like crowd source, it’s just distributed, “Hey, can you take five minutes, validate your GitHub commit, and then confirm it against our pre-production environment before we go live?” Okay, good. Guess what I just de-risked a little portion of my release?
Keep test signal:noise ratio high
You also wanna keep your test signal-to-noise ratio as high, tests have a good chance of becoming brittle. So you wanna mitigate those things quickly. GitHub actually has a notion of code owners, so if you wanna see, “Hey, who owns this code?” You can actually… you should adopt that or you can actually point to a group or set of individuals on either test, directories and so forth, if there’s questions in a certain directory.
This is actually an example of how Facebook does something. They have a fail bot or a test warden that automatically catches and quarantines tests. This is actually like the Facebook model of how they take bad tests, put them into a corner, make sure a developer triages them, fixes them, and then they release them back into the development cycle.
So this is a lot of text here, but don’t worry about that. This is just an example of how you can instrument some kind of feedback on your side to make sure that you’re catching those tests that would otherwise slow people down, keep things moving. And you actually want to do this, this is extraordinarily important. If you don’t do this and you don’t catch those poorly performing tests, every single developer that runs an integration build, has a non-zero chance, probably a higher chance of getting hit with this. So that’s it for the QA side.
Automating feedback loops
I wanted to talk now about automating your feedback loops, and then we’ll jump in after that into the experimentation side. So automating your feedback loop sounds simple enough. It actually requires a few things or a few prerequisites to happen. One of which is, what we all I hope do or what we almost probably embrace, which is the continuous delivery, deployment, and integration. So this is actually how they map, they’re all slightly distinct and slightly different.
But what you’re actually doing is you’re always building your testing code. That help shake out the tree like, “Hey, you know, Boto is having a dependency issue. Okay, well, I just did that, I figured that out of my integration build.” Delivery is this is what we’re getting ready to go out and do with our code which is what customers will see. And the deployment side, when it actually hits customers, there’s one small difference between a delivery and a deployment, it’s that you probably want a very slight amount of manual work.
So Optimizely used to deploy several times per day to production, we were actually… we would have a 10:00 a.m. and a 2:00 p.m. We would go out multiple times per day, you know, smaller change sets looks good, we had all of our code ready to go in customers hands. If you got it in that morning, it would be live in customers’ hands that afternoon. There’s a few things that we found where we didn’t actually catch… we had tens of thousands of tests, we didn’t catch every little thing. And very basic elements of human intervention or human checking, could have caught this.
So I strongly suggest implementing a model like this because they cover a lot of bases for you. When you build out you get all of that automation, you get all the feedback, and it allows you to create feedback loops and measure all of those things. Whether it’s the health of your deployment, how easy and how confident you are that when you flip the switch or when you need to get code out that morning, things are gonna go off without a hitch. And it actually lets you measure each one of those components.
So don’t need to talk too much on that. There’s probably like five other tracks talking directly on continuous deployment and delivery. But what you actually want is a feedback loop, you wanna build out all of these components that provide that meaningful feedback, primarily because it just reduces surprises, but it also forces you to get that discipline of reducing snowflakes. I’m gonna talk a little bit about that.
And that means like more confidence in your deployment, you know that when you pull that trigger or you push that button, things are gonna work well. If you’re only doing that periodically, meh, maybe one container changed its image or its dependency or something, we don’t know. If you do that multiple times per day you’re pretty certain. Martin Fowler, you may have heard this name before, had this to say on snowflake environments, “Good for a ski resort but bad for a data center.” So avoid things that can’t easily be reproduced or source controlled. So examples if you have code clearly like GitHub, Bitbucket, Jenkinsfiles, system configuration: use Dockerfile, Terraform, Kubernetes, Immutable builds: Bazel, or blaze as it’s internally referred to in Google.
And for people looking at this, this is actually just one of the dashboards we use in build out. These things help us drive a report on the state of it. So again, don’t focus too much on what each of these graphs are, it’s just an example of how you can build out what’s meaningful and an instrumentation of capturing all that data. Datadog has very advanced tools and mechanisms of doing that. This is one of the ways that we track how our developers are tracking code, and how long it actually takes for them to land, or get code merged into the mainline branch.
And then we track a measurement or a metric called DPI, Developer Pain Index. It’s simply a mathematical function of the total time it takes to land or run your full battery of tests, times the average success ratio. If that means that your build is only passing 70% of the time, but it takes you 60 minutes, your DPI might actually be closer to like 90, or so 90 minutes. And that’s just a representation of how long it takes for that feedback loop to happen.
Feedback loops drive constant improvement
So these things drive constant improvement. If you didn’t have some of this data, you don’t know what that low hanging fruit is which is the obviously like the best place to start. And then you can start moving into more advanced techniques, more optimized ways of improving things. We just moved from unit tests, to pytests, and started parallelizing everything on our Python unit tests which has, again, like tens of thousands of tests, and it dropped it from 30 minutes down to four minutes. So you can do a lot of those like low hanging fruit type of things there. These all have a direct correlation to how developers perform and how quickly they’re getting that feedback.
Feedback loop for quality code coverage, pretty straightforward. Feedback loop for testing your code? GitHub plus a Slack bot. So instead of building like one-off systems, consider something like this. I’m gonna open my pull request, GitHub exposes nice WebHooks, it’s gonna, you know, send it to some endpoint you designate. I just opened my pull requests, “I need that feedback. “Hey, I’m starting it, it’s kicking off.” Oh, no it failed. Okay, well, instead of me going to the GitHub pull requests maybe I can actually just have Slack shoot me a message, and that optimizes the amount of time that I’m waiting.
So I can contact Switch, instead of saying like, “You know, in the back of my mind I’ve got 10 things. I gotta be checking all 10 of them on GitHub for the pull request status making sure they’re all finished.” Just close that feedback loop as quickly as you can, and that frees up developers to just work quicker, and jump into those things or mitigate those failing tests as quickly as possible. Also another important thing is you wanna track what’s going out, and how much is going out in your release in case you need to either communicate that to your marketing, success team, or internally you need to track down some hairy issue that seems problematic.
Create a feedback loop for release notes, GitHub and a Jira bot. Like this, “Hey, this is what my ticket is that I’m linking in my template.” Great. Now, I just… we created a simple scraper that parses out those things out of commit messages. Checks Jira for the state and then creates a simple template that we can send out to everybody. So you always wanna prioritize process and frameworks first, not just the tests of the builds, as it kind of plays back into the snowflake system syndrome. You wanna have more focus on the commonalities versus the one-offs if you can avoid it.
So moves us into the experimentation side of things on how you can supercharge your development cycle. So there’s a famous quote from someone that you may recognize or at least you have a sense of who might have said this. “Doesn’t matter how beautiful your theory is, doesn’t matter how smart you are. If it doesn’t agree with your experiments, it’s wrong.” It sounds like someone that cares a lot about experiments. Richard Feynman. Might have heard of him from many earlier things. Theoretical physicist, story teller. Worked on a variety of atomic and nuclear fission theory, Nobel laureate. A perfect example of why and what you should do with experiments.
Let’s talk about some examples from Google. So it might be that… we’re not getting into the pixels just yet here, but two examples on the screen. One is both are searching for IBM, one is ibm.com, one is the Wikipedia page. So this is actually an experiment that Google ran. They said, “Do people really want to go to the IBM corporate page, or do they want some information about the history of IBM?” So they ran this, this is actually like just something that’s ongoing, kinda changes from time to time, that’s one example of an experiment.
You may have heard some examples of pixel perfection. I know we’ve all heard how Google optimizes down to the byte, they also do it down to the pixel. Two examples here: A and B, they almost look identical. One of them has a very slight difference, the plus sign has a couple pixels different. So this must have been the winner or this was the main difference in that. The reason they try this is because, for them, it’s about what gets users to engage more. That’s an experiment that they’re constantly running and tweaking on.
Ninety-eight hundred search experiments that they ran on their search application, search algorithm, and so forth, in 2016, there’s a lesson here. So Optimizely was actually spun out of Google or a variety of people came from Google. One of the things that they believed very strongly in was experiments. One of the things that you guys can all take away from this is at least engineering is expensive, development is expensive, trying to get approval for trying a change can be very expensive.
You know, product management needs to approve it, or we don’t have the resources to do this thing right now. All of these things started out as, “Hey, I just have an idea, I’ve got an experiment.” “Okay, go ahead and throw it on like our little A/B testing solution, you can try it out.” That’s what they’re excelling at, and that’s what a lot of the companies do so well on. They actually… there’s tens of thousands of Google employees and probably tens of thousands of Facebook and so forth, they all have great ideas but building out engineering changes is expensive.
If you all had your own idea about doing something, think about all the people you have to loop in. And then think about how much time and effort it takes to get it perfect. That’s a lot of time and effort. What if you just carefully veiled it as, “I’ve got an idea I wanna experiment”? What if you just wanted to get this out in front of people and say, “It may or may not work, but it’s an experiment.” I think you’ll actually find that you can get things moved through if you craft it as an experiment, that’s how a lot of things happen at Google and other places.
So the search releases are defined as like a release of the application. Within Google they did 1653 of those. And now, fast forward to tens of thousands of experiments later across the whole platform, this is what IBM homepage looks like today. So they actually have news articles, they’ve got the stock price, they’ve got the founding team, a lot of the Wikipedia information. This all been done or honed with tens of thousands of experiments.
Let’s talk about Facebook. So there’s a few things that we’ll talk about with Facebook, how they do some of the components here. There isn’t just one version of Facebook, there’s probably 10,000 because there’s a lot of experiments running. So they plan it, innovation by the hour.
One quick example. So Facebook wanted to increase engagement on something important, an election notice. Like, “Hey, election day is coming up, we wanna at least, if not get people out to the polls,” they want to get people to click that they voted. So whether that directly translates into more people going out, or whether it just means like, “Hey, I happened to sign in and I felt compelled to show it.”
So there’s two of them: There’s one that’s just like click, “I voted,” the other is more of a social sharing component. It shows which users and which friends of yours or which first of your connections actually clicked the “I voted.” So you could kind of like share that as a badge, and if you clicked that it would show up to the other people on your News feed, and it would let them know that you also voted. So which one… and again, the primary thing they were working or wanting to track was user engagement.
What do you suppose won out for this? If you guessed the bottom one, you’re correct. This actually had a much higher meaningful turnout, but the way they tested it was they actually did a multi-varied or a variation test with group A at the top, group B at the bottom. There could have been a couple more slight differences on how they did this. But the primary thing they wanted to track was, “How do we get people to click ‘I voted’?” Clearly, when you can put that into a control group and a variation group, it’s a lot easier to measure that out.
Shifting left with experimentation
So experimentation is no longer just for marketing teams. We all as developers can own or build out riskier experimental processes wrapped behind experimentation. I’ll talk a little bit about that. So talked about shifting left with QA, you can also shift left with experimentation. So this is what a typical, on a product or feature process might look like. You design it, you build it, launch it, pray for the best, like, “I hope I got everything out in time. I hope all the bugs that I could’ve foreseen are actually there.”
Well, what if you could do something different, shift part of this left using experimentation? I’ll talk about some of the techniques that we’re gonna go into. With deployment flags you can actually do that. So you can actually de-risk certain components of this using feature flags and turn it into an experiment, and then that allows you to quickly iterate, refine. And again, if you recall at the very beginning, “Do customers actually care about this? Do we need to talk to build this? Do we need to invest months on end?”
We can all probably think back to a couple releases or a couple projects that we worked on, where we spent months and months toiling on it, and the customer impact might have been less than we hoped for. May not have been negligible or minimal, but it was probably not as much as we’d hoped for, and our resources could have been better spent elsewhere. Experimentation is great to help answer some of those questions.
So feature flags are simple on-off commands. They gate experimental or risky changes behind a simple on-off way. So it allows you to think about things like if I have code path A, code path B, and I want my use… I think the new code path is the right way to get people to start using my application or increase engagement, or eliminate cart abandonment. Or the old legacy way that I know is battle-tested, and has been in production for a long time. You might wanna actually have both in your code, because what if you need to roll back? Well, you’ve got an expensive deployment cycle.
So if a problem is found at this point in time, you can actually do this using a feature flag without an expensive or risky feature redeployment. So it’s kind of how I think of feature flags like a musician dashboard kind of thing. You have all their knobs and you can turn things on or off, flip switches, it’s analogous to this. There’s a couple tools, there’s actually a lot of companies that are here today even. Go check them out. I think Split and LaunchDarkly are here.
There’s a number of open source alternatives as well. Here’s one of them called Bullet Train: they allow you to build out part of this into your deployment or DevOps lifecycle. There’s also another one from Dropbox: Stormcrow. So they wanna make changes and have them hit production fast. They open source or released a tool called Stormcrow. This is what it looks like, so if you have code path A or B, it’s as simple as saying, “I have a red button, or a blue button, those are my two variations.” You just simply instrument that into your code and then presumably, you might call some additional classes or do something else so that it’s not all wrapped in those nice if-else blocks. But it’s a very simple way to work or have control flow going in there.
Feature rollouts or expanding on the feature flag notion. So think of this, if I have a bunch of mobile devices, or I have a certain group of mobile users, I may want to target just one particular group. I may only wanna target iPhone, I may only wanna target Android. But I want them to receive this with new experimental or risky flag that I should only have a small group of users seeing. Well, feature rollouts enables that. So every new feature gives you an opportunity to run an experiment.
Another thing that’s pretty neat here that you can think about, actually Istio, the workshop yesterday, was potentially interesting for this reason as well. Traffic Splitting and Canary Builds. So there’s a great argument to be made that no matter how much time and effort you invest in your QA, and no matter how much time and effort you invest in your unit tests, end-to-end tests, integration, and so forth, there’s probably gonna be something that slips through the cracks. We don’t know what it is, we don’t know how serious it is. What if it’s actually like an engineering emergency? What if it’s a small nuisance? We don’t know what that is.
Traffic splitting gives you that option because what it says is, “I wanna go out with a small group of these users receiving my change, but I want you to leave everybody else unimpacted.” So this was rudimentary type of things that Facebook implemented in 2012, they had all employees go to www.latest.facebook.com, everyone else that was a non-Facebook employee went to www.facebook.com. By the way, I think you can still go to www.latest.facebook.com, if you wanna check out the newest Facebook, you know, beta and so on.
So that was in 2012, kind of basic but if you perhaps were a user at that time, or if you were an employee, it’d be a very easy way for you to like test out your own code before your customers did. Now, it’s much more advanced. So with traffic splitting, they’re doing a form of experimentation with traffic splitting, where at the very beginning of a release, employees are still the only ones that get it. This spans a course of several hours. If no issues are found, slowly ramp that up to 1% or 2% of users.
What this gives you is, again, a way of triggering or shaking the tree. I may not have caught this in my extensive development or QA cycles. This looks like every possible edge case that I could think of works fine. But when you have enough sizable users and you’re not just simply limiting it to like one group or the other, but it’s actually like a random subset or random sampling, you can actually start to shake out those trees. So long as your version tagging like… and all the metrics that you’re logging contain a certain version or it says it designates new versus old version, you can actually track that very easily.
You can say, “Hey, this bug that is now manifesting in the brand-new version of our code, that 1% of people on, didn’t exist before.” “All right. Well, before we go up to the rest of the 99% of the users, let’s just keep it right now at the 1%. We can go take a little bit of time.” Right now, even if it was catastrophic like “Oh, you know, my database is down,” it’s not an engineering emergency. It might be if, you know, 1% of the users aren’t unavailable. But clearly that’s far preferable than having it down for everybody.
So you can spend a little bit of time digging and you’ve got those random sets of users. And if you need to, you can actually turn off the flag or simply roll back and you haven’t impacted everybody for that very small nuance of an issue that you may not have caught, or was just prohibitively difficult to catch like a race condition or a scale condition or something like that.
So let’s talk about some internal experimentation ideas. Vendor Bake-off. Is one tool better than the other? This is perfect for using or trying to run experiments. As an example, if anyone here is familiar with Saucelabs or BrowserStack, they have some similar types of products you can actually test your browsers based on whether they’re running on Internet Explorer, Safari, you can do it on mobile devices and so forth. Contract time, let’s run the two of them and see if everything works fine.
Actually, we’ll skip through this. So if this works fine, things look good, we check out the results, everything looks good. We now have a very clear way of determining who the winner was. If you wanna de-risk your critical new feature or change, you wanna run this through a similar system. Hypothesis: our aggregation of logs takes too long. Well, let’s start into an experiment. We now wanna run this through a counting service. We have a few different ways of tracking this.
So quick quote, “If we put something on production that doesn’t seem to be working we wanna get rid of it quickly.” So shift left, both with QA and experimentation, a solid foundation of tests is gonna help you out later on. You also wanna embrace that experimentation, helps you get things out quicker. Don’t let feature development take priority, you wanna build those feedback loops, you wanna focus on the shared tools, shared libraries, without focusing on one-offs, don’t focus on snowflakes.
And then make experimentation a foundation for your release process. Build trust in your data, run A/A tests so you understand it, build out and make sure you understand the statistics on this. One of my friends here had a good parting thought, he said, “I don’t always test my code but when I do I do it in production.” Why not? Seems like a great idea. And I actually talked with Paul, who you saw at the keynote today. We actually had the same thing I said, I’m gonna throw this out, mine is the variation of 10,000.” He said, “All right, go find if you can find who attributed that to Edison,” so I’m gonna check. Fail fast and carry on. Thank you.