Airbnb's journey to a service-oriented architecture at 1,000-engineer scale
Published: July 12, 2018
Willie: Hi, everyone. Hello, thank you for joining us. We hope everyone’s having a great day at DASH today. We want to thank Datadog for organizing this great conference, and also thank you all for joining us this afternoon.
Tiffany: My name is Tiffany. And as introduced, I lead the Shared Services team at Airbnb. Our team is responsible for the common components that help developers build and operate services at Airbnb. And we’re gonna focus a lot about the common services to enable our move to SOA. So I’m really excited to be here today to share our experiences.
Willie: And my name is Willie. I’m the engineering manager for the Observability team at Airbnb. My team’s mission is to make sure that our engineers have the monitoring and introspection tools to effectively develop and operate their services.
Our talk is divided roughly into three sections. I’ll cover largely the high levels, setting up the context leading up to our SOA effort, and closing our talk by generalizing our learnings based on our observations on the industry.
Tiffany: I’ll be walking us through the two SOA attempt that Airbnb did, and sharing a lot about both the successes, and the learnings that we had from the experience. Now, take it away, Willie.
A need for SOA
Willie: Thank you. Okay. SOA stands for service-oriented architecture. It’s a concept that’s been around for at least 25 years. How many of you have been a part of an effort to break apart a monolith into a service-oriented architecture? Good, many of you. Just to make sure that we’re on the same page, we’re going to define SOA quite broadly for this talk as a software development model in which you divide up your stack into any number of distributed services rather than a single monolithic application.
When preparing for this talk, we noticed that most existing talks and literature on the topic tend to focus on how to do SOA the right way. At Airbnb, we discovered that it was actually rather difficult for us to apply those learnings because most of our challenges ended up being organizational in nature. So we took inspiration from this observation and decided to make organizational challenges the theme of this talk. We will share the motivation, the challenges, and sort of some of the strategies that we use to overcome those challenges at 1000-engineer scale, including a look at our failed attempt, first failed attempt at SOA.
Starting with a monolith
Founded in 2008, Airbnb is a global travel community that offers magical end-to-end trips including with the place that you stay, the people that you meet, and the things that you do. This is the earliest screenshot that I could find of our website from 2008. We’ve come a long way since this simple web page. The next year, we added key features like search, maps, and the listing details page. And all these remain core to our product today.
In 2014, we underwent a major rebrand to redefine who we saw or how we saw ourselves as. For engineering though, this is a major project involving over half of our team redesigning every aspect of the website. It had to be executed without any asset leaks and launch seamlessly with the click of a button.
In 2016, we launched our most ambitious project yet: Airbnb Experiences. This features unique activities designed and led by inspiring local hosts. We built it to create an opportunity for anyone to share their hobbies, skills, or experiences, not just places but people with places to share.
More recently, we built Airbnb Plus and Airbnb Collections. Airbnb Plus features beautiful homes from exceptional super hosts around the world, and Airbnb Collections help business, family, and really any traveler find a special home just for them.
During my five years at Airbnb, there was a major product launch once every five months, which is a lot to ask of the relatively young engineering team. Such a high velocity of product development presented an incredible challenge for our engineering team.
Like most startups in seeking product-market fit, Airbnb started off as a monolith. Unlike an SOA, a monolithic-code base is a single-tiered application that services all of your needs. For early Airbnb, this included functionality like search, payment systems, and fraud prevention, all in the same code base.
There are actually many benefits to a monolith: because they’re simpler, they’re easier to build, develop, monitor, deploy, you name it. And building new features generally require lower overhead and it’s easier to onboard new hires. We chose Ruby on Rails as our first original framework, and so, we affectionately called the monolith Monorail. And with this choice, our founder Nate was able to iterate quickly to achieve early traction and get Airbnb off the ground.
But as many of you already know, monoliths don’t scale without significant investment. When you have a lot of engineers working on a single code base, a lot of contention starts to arise. One problem that we ran into early was that we had so many engineers committing to the same code base and there’s so much comments on our PRs that we were simply overwhelming our code repository. Another early issue that we ran into was that things like unit testing, build, integration testing started to take longer and longer to complete. These build steps tend to scale proportionally to the number of engineers, and so you have to trade off productivity and cost.
There are much harder scaling challenges though to a monolith than unstable repository and slowing build times. As Airbnb’s engineering grew over 100 engineers, we noticed that we were starting to experience increasingly more frequent backlogs in our Monorail deploy chain. On a busy freeway, traffic accidents can quickly cause the entire freeway to suddenly back up, and then it takes time for the emergency responders to arrive. In the meantime, you’re stuck in your car getting nowhere. Bad changes in a monolith is just like an accident on a busy freeway. Backups occur very easily because too many engineers are waiting to make this change at the same time. Excuse me.
Then there’s the issue of code ownership. When an entire company works on one code base, it can be difficult to clearly delineate who’s ultimately responsible for a given code. Some code goes entirely unowned, while I’m sure you’ve all seen, other code have too many engineers developing on at the same time. And because co-ownership can lend to much bigger problems, most significantly, important engineering concerns like performance, core quality, and scalability, become difficult to have accountability for. So for us, these are the most significant challenges with our Monorail: deploy contention and unclear code ownership. And by 2012, these limitations were beginning to hinder our productivity in noticeable ways.
Initial service development
As it so happens also in 2012, it was the year when we first started to build services outside of Monorail for the first time. Our very first service was a search engine, that helped us more quickly respond to user queries based on date and destination. Next, we built… Sorry, my notes are not lining up. Around the same time, we started building a pricing service so we can run machine learning models to perform smart pricing based on location of listing and seasonality concerns. In late 2012, we started building a fraud prediction service providing early supervised learning to help us catch bad actors on the website.
What do these three services have in common? There are many correct answers. But the one I was looking for is that these services tend to be CPU bound with low latency requirements, which make them ill-suited for the Ruby programming language. What we then observed was that the few teams who had their own services were able to deploy on their own timeline whenever they chose. Monorail deploy contention simply didn’t apply to them. And if they did make a mistake, it was also more clear who should be responsible for owning that mistake. This gave service developers a strong sense of ownership and control over their own services.
And best of all, they did not experience the traffic jam that Monorail developers had. And for the rest of us, this is what we saw, and we all wanted to be that guy. And thus began our journey into a service-oriented architecture. And I’ll now hand it over to Tiffany to tell you more about it.
A first try at SOA
Tiffany: Thanks, Willie. So, SOA attempt number one, our goal: power the first endpoint in Airbnb, entirely through services. So we started building out services. And what we realized was that we were missing the frameworks and tooling to really support service development. So one of the first things we started investing in instead was a robust HTTP client. What we observed is that as more function calls were translated into, you know, service calls, once the service kind of went bad, this could easily cascade upstream and effect all the other clients. And so by rolling out this robust HTTP client, we were able to introduce circuit-breaking behavior and kind of like improve this resiliency in our infrastructure.
What we also noticed, though, that by standardizing around a single client, we not only solve this problem for existing service calls but any new service and any new service-paired client that came online. And this gave us a taste of like the high leverage that frameworks could have when you have rolled it out through your infrastructure.
Based on the success, we invested in an automated service discovery framework. It’s been open-sourced, and you may know it as SmartStack. And what this was trying to solve was that as we moved from managing a single monolithic app and a single pool of resources we’re managing multiple services, multiple resources, you know, we had to kind of grow them out over time, replace instances and we wanted a very low touch, but again, resilient way of managing this. And so SmartStack provided like a transparent way of doing service registration, deregistration, health checking of services, and load balancing client traffic.
When all this investment, you know, it seemed like things were going well, we’re starting to build more and more services. But unfortunately, we had underestimated the challenge of migrating that endpoint to services. The SOA attempt was behind schedule and missed several deadlines. And a year into the effort, we called it off. But it happened, you know, with sufficient time, with sufficient resources, all of the problems I share could have been solved. But really we hadn’t approached the project the right way. We’d gone for a bottoms-up approach when we should have got everybody on board in the first place. As a result, we hadn’t a strong case for why Airbnb had to make this transition to services at this point in time. And we kind of remained in the monolith.
What about the contention that Willie kind of shared? Well, we found that just by investing in deploy and build pipelines, we were able to kind of overcome that productivity hurdle. We shared a lot of what we did in a talk called, “Democratic Deploys at Airbnb.” But essentially, what we’d kind of learned was that it was more convenient to work in a monolith than it was to work in services. Now, in fact, the two investments we made actually meant that even though there wasn’t an effort to move to an SOA, developers were still building services. And what they realized from building and operating the services that, you know, there was quite a lot of costs involved. Costs that just didn’t exist because there was already a monolith and you can make changes there. And so we’d remain in a monolith for, you know, a while. But spoiler alert, we had two attempts to move to an SOA. So what happened? What changed between 2012 and 2016?
In part, a large part of that is we grew year over year, you know, our engineering team doubled in size. And what we found with this growing team size was that the problems of the monolith had become magnified. We were now spending 15 hours every week stuck in that traffic jam. And with many more engineers on the team, more people were frustrated by this experience. 5% of all changes to the monolith took upwards of an hour, close to two hours to deploy to production. And worst of all, in spite of all our efforts invested into further tooling and improvements to the build and deploy systems, we kind of plateaued at 200 commits merged per day. And for an organization rapidly approaching 1000 engineers, that was just not acceptable.
Another problem that kind of increased over time was the problem of unclear ownership in the monolith. And now for a single endpoint, it wasn’t really clear how many teams should be involved and how we get that change rolling. And so that became to be a serious problem for our company. There were many efforts that took a look at this and tried to refactor this for clear ownership boundaries, but really, we were concerned that over time we would just regress. What was stopping us, how could we find a trusted framework to ensure that changes in a monolith had clear ownership forever and ever? Ultimately, we weren’t able to land on one, and such refactoring efforts failed.
And so, now we had a serious problem, we could no longer rely on tooling to improve deploys. And there wasn’t a clear solution for this mixed unclear ownership in the monolith. That’s how we decided it was time to move to an SOA again.
SOA at Airbnb, second attempt
Now, it really did feel like we were attempting to scale an impossible mountain. How do you convince an entire company to change the way they do product development, to move from a monolith into services? And if you think about it, we’d failed once before. And in fact, over the years, the problem had gotten bigger: there was much more code in the monolith, there were many more engineers involved, many more teams and products being affected. And so, we kind of had to look back really examine why the first attempt failed and incorporated three strategies in our second attempt to make sure that, you know, it goes better this time.
The first of which was to develop a template for success. Now, the first SOA attempt wasn’t really clear about what they’re trying to achieve, and didn’t have like a set, clear set of deliverables. And ultimately, they weren’t able to prove that by moving the services, product development would increase, or they would solve the problem of code ownership. And so for us, as we started the second SOA attempt, what we realized was we couldn’t just pick any endpoint in the product flow, we needed an endpoint that was both critical to the business and of reasonable complexity that would really kind of like prove that SOA could work for any flow in the site. And so, we decided to pick the listing detail page.
And what is this page? This the page you see me click into any search result in Airbnb. And for many guests searching for a night to stay, you go through multiple of these listings before you land on that perfect listing. And so you can imagine it’s both critical for our business and under frequent product development.
So a quick poll for the audience. How many teams would you have to work with to commit a change to this page? Any takers? Right. One? Can I have another one? That’s too low. Do you have a number in mind?
Tiffany: 8, 16 at last count. So you can imagine making a change to this page was a quick way to fill up your calendar with lots of meetings.
To have a sense of the complexity, imagine from this product page, each box is a component, and each component owned by multiple teams. So you can see that, you know, it’s really difficult to get changes here and when things went wrong, you had to kind of figure out “what went wrong here?” And if we could prove that this page could be powered entirely by services, this would definitely be a model for any other page moving to an SOA. And so with a target in mind, we assembled a team of engineers from product and infrastructure to kind of tackle this problem together.
So first of all, what was in the Monorail? Let’s imagine this is Monorail and it’s powering the listing page for both web and native. And what we realized was Monorail was doing multiple things. So first of all, it served as our API layer. It was also rendering like a lot of web traffic. And finally, there was this big block called business logic, whatever you used to power this page. And so, the product teams took a look at it and they were able to kind of clarify which pieces belong to which product and which team. They really did a good job here.
What they also quickly realized was across these different product pieces, there were a lot of shared components. So for example, common data models like the user, you know, the listing, there were also a lot of horizontal, like, concerns like authentication and translations, and all of those needed to be accessible, you know, out of the monolith and services. So the next step for product and infrastructure engineers was to move all this logic out into services. And by doing so, we basically enabled the next step, which is to move the product logic into services. But before we did that, we wanted to make sure that by doing so we could still maintain a consistent API as we moved out of the monolith into services. So we invested also in an API framework.
And at this point, what was the monolith doing? It was just our API layer, but it was also rendering web. And so again, the teams worked together, we refactored the web rendering controllers to use the same API layer, so mobile and web behave more alike. And then finally with that refactor, we were able to separate out a clean common API tier, as well as, you know, server-side rendering for our web.
So as you can tell, like this tight collaboration between product and infrastructure was essential to really streamline the migration and identify the critical components needed to move to SOA. As a result, we’re powering the first endpoints entirely through this infrastructure today. And we’ve even written a guide for other product flows to follow in our example.
What’s next? So we have a template, you know, we proved it worked for any other product flow. We weren’t done, we had to go from one product flow to all flows in SOA. And to do so our second strategy was to launch SOA as a product.
So as you all know and as we’ve shared, Airbnb launches products really frequently. We’re able to do so because we mobilize this entire company around, you know, the product launch. And really we need that launch date. But this would be a risk for SOA migration because resources could be pulled off to prioritize the next product launch. So we decided that SOA should also be a launch, you know, like everything else. And we realized that the first thing we needed to do was to get alignment, you know, based on our learnings from the first SOA attempt. SOA isn’t just an engineering specification. It really meant for us, like, a new way of doing product development. And so, we needed to get all the product stakeholders on board. This would include product engineering, design, data science. We all needed to understand why we had to move to SOA, you know, and how it would help the business. And so we had to go on road shows.
So starting from the first flow in the homes product, we, you know, worked with the homes product teams, and got commitment to move all of homes product flows to SOA. And then talking to all the enabling organizations across Airbnb, you know, the customer support products, trust and payments who were also committed to moving to services. And then, establishing headcount and resourcing across all the remaining like products at Airbnb.
So, this is great. We’ve got everybody committing, you know, engineers to this migration. We also needed a launch date. And this was important because even though everyone is committed to the importance of SOA, it kind of felt like it was far off like you could wait on it, right? Like, you could wait one more launch, like it could still stay in a monolith. So by setting and committing to a launch date, we’d kind of concretized it. It was something real and it was happening and you had to do it now. And so, our commitment was to freeze feature development in the monolith in 2019.
So this is great. It sounds like, you know, we’ve really got this going, we’ve got this momentum, everyone’s invested, we’re on board. The third strategy, we observed that we wanted to make sure that the SOA migration went smoothly. And especially given that we’ve been used to monolith development, we had to onboard the entire engineering org to a new way of doing things. And so, we decided it was really important to invest in tools, frameworks, and standards for service development. If you think back to what I shared from the first SOA attempt, even though it didn’t work out, we had really critical infrastructure investments like the robust client, as well as automated service discovery that helped us to accelerate the move to SOA even though we didn’t have a dedicated effort.
And so, we also wanted to compete against the perception that working in the monolith was more productive. What was this overhead that people were observing? So two numbers. One, it would take someone many weeks to launch their service in production. And secondly, once you’ve launched a service, you find yourself spending, like almost 24% of your changes to that service would be kind of boilerplate. And this wasn’t something that would kind of exist in monolith development. So to solve the first problem, to speed up the time to create a new service, we invested in a service configuration framework. What were we spending all that time doing?
So I know you can’t see this, but this is like a flow chart of you wanting to make a service, and you getting a service after two or three weeks. And basically, the process was you would commit code in one repository, set up your alerts in another, and then configure your service in another through a different deploy processes. At some point, you need to talk to security for a cert and then talk to SRE for an IAM role. What’s an IAM role?
And so, from any product teams, this just felt like a lot of overhead just to get to the point where they’re writing their first line of logic in your code in the service. And our vision there was everything about a service should be managed in one place, and there just needs to be one process to deploy it. And so, with the service configuration framework, like, we simplified and streamlined this process. And so you know, all the configuration for your service lives with the code for your service. And the way you deploy those changes to production, it’s the same way you deploy any code change to production. I wouldn’t be surprised if we have more talks about this, but this is all I can share for now.
The second, we realized that, you know, we could solve both the problem of needing to do so much boilerplate in services, as well as ensuring that the quality of services coming online meant more consistent, and met the very high-quality bar. And so, we invested in a service framework.
Now stepping aside a bit, what is service boilerplate? So let’s say I have a service and I’ve set it up, you know, within a day with the new service configuration framework. And I’ve written a piece of logic, I’m so excited to get this to production. Well, it’s a service, so I need to add an endpoint for it. You know, and with an endpoint, I also need to set up the clients, one for each of the standard service stacks. And this kind of, like, means I’m done, right? Well, no, because it runs the production. So you need to add like metrics, you need to add error handling, like data validation, all this good stuff. And then that’s not it. Don’t forget, like, you know, you have to have dashboards, alerts, you have to maybe write a runbook, so whoever gets paged about this service knows what to do with it.
And we found that we were doing this over and over for every single service. And if you changed any one part in your like endpoint, you have to change your client, a lot of cascading changes, very repetitive. What if instead, you could just focus on defining the interfaces for your service that’s in a strongly-typed interface description language, define your request-response models, and you have this service framework take care of everything else for you. We share a lot more about this in a blog post called “Building Services at Airbnb,” we link to all these things at the end of our slides.
So now with all of these things, you know, I’d like to share again, like, the three strategies we did. One, develop a template for success. Two, launch SOA as a product. And three, invest in tools, frameworks, and standards for service development. I’d like to hand it over to Willie now to talk a bit about where we are in the move to SOA.
Where are we now?
Willie: Thank you, Tiffany. So how far has Airbnb gone in our journey to a service-oriented architecture? So this is just a quick slide to kind of put the strategies and milestones that Tiffany outlined into a bit more of a timeline. Late 2016, when we started becoming clear that a significant refactor was necessary for our code base, because we were exhausting incremental gains in tooling investments. During this time, we as an organization were starting to really internalize many of the challenges that we faced in our first SOA attempt. We also used this time to realign on the conclusion that SOA remained the best approach forward in order to prevent future refactors of this scale.
By 2017, we got the dedicated staffing and buy-in that we needed. And on top of that, some of our early framework investments started to pay off. Earlier this year, we successfully migrated our most complex web page out of Monorail. And since then, we’ve also migrated our search page out as well.
In terms of results, we’ve seen some good progress. As you may recall, our main motivation for going into service-oriented architecture was developer productivity and code ownership. So this is a screenshot I took directly from an internal presentation. Based on this study, shipping a change in Monorail took on average 15 minutes for this particular team, sorry 115 minutes for this particular team. So where does that 115 minutes go? Well, most of it was actually the amortized time across when a bad change does go out, and figuring out who that change kind of belongs to. And when that same logic was moved into a service, we were able to drop that time to a mere four minutes.
For those of you who were able to make it to Liz and Christina’s talk, Google talk on SLOs, you may be familiar with this. So for ownership, we began setting clear service level objectives, or SLOs, for error rates and latency across service endpoints. What you are looking at as a screenshot of our internal SLO dashboard. These efforts have only been possible because of the clear service interfaces that had been defined by our SOA.
Finally, we’ve done a better job on code quality as well. When our product details page was in Monorail, on average, three out of every 10 bugs filed against that page would go out of SLA. Since migrating to our service, we’ve been able to maintain 100% of those bugs being fixed within SLA.
A service-oriented architecture is not without its flaws and costs. Distributed systems are more complex. So by nature, they’re harder to monitor and harder to coordinate. But in terms of the objectives that we outline for SOA, the project has been really successful.
At the right time
To close this talk, we want to reflect a little bit on sort of what changed between our first attempt and our second attempt. And to do that, I’ll start with a familiar concept. Show of hands, how many of you have technical debt in your code? Every hand should be up right now. Because if you don’t have technical debt, you’re probably not prioritizing enough.
Our Monorail scaling challenges essentially came down to this tradeoff between infrastructure investments for tomorrow and product development today. And our monolith represented the technical debt that we had taken on over the years pursuing ever faster product development.
This is normal. Product market fit is critical, and so, many startups focus on the product and band-aid their infrastructure together. This is true for even tech giants like Amazon. This is an excerpt from the book, “The Everything Store.“ Amazon’s original system was built under battle conditions and their CEO refused to lift the foot off the pedal. He wanted to put all the engineers on building new products rather than rewriting the same code. This is a familiar story of engineers wanting to slow down to invest in infrastructure while business needs continue to push for everyone to develop ever more ambitious product launches. And it’s no different from the story of our first SOA attempt.
At some point, though, you hit an infrastructure inflection point, and that equation starts to flip. LinkedIn hit this inflection point in the year following their IPO. This quote is from an article that we found during our research, it’s called, The Code Freeze that Saved LinkedIn.” While we have no particular insight into LinkedIn, this is likely a fairly familiar story. After years of optimizing for feature development, LinkedIn code base had incurred enough technical debt than an overhaul of the architecture was necessary.
Each new feature and each additional hire add just a little bit complexity to your system in your entire organization. At some point, you can no longer push back a necessary refactor or infrastructure investment because the long-term success of your business compels you to act now. Precisely when you hit this inflection point, though, can be a matter of great debate. And it was something we debated for basically, five years at Airbnb. We believe that the timing of this inflection point relative to a company’s evolution has a lot to do with the business environment that you compete in.
For Google, it was very early. To increase search quality, they had to index more data using better algorithms than their competitors. And to do that, they needed to run better algorithms. To do that, you’re largely bounded by scale and cost. So Google had to invest in a scalable and efficient infrastructure earlier than most tech companies.
For Facebook, it became a little bit later when they just went beyond colleges. Other social networks of that era had struggled because they didn’t scale to reach the critical network effects that were necessary for this product.
And for both of these advertising giants, uptime is critical. Downtime bugs mean poor engagement, and engagement is the key to an advertising business.
Airbnb is not an ad-driven business, we’re a community-driven business with unique properties dictated largely by consumers’ travel patterns. In practice, this means our infrastructure handles a comparatively low throughput. But each of those interactions that you have with Airbnb, generally has a higher transaction value. So as a result, developing new features in product is really core to our competitiveness. What we found is that the real push for infrastructure for us is developer productivity.
So this, if I could have one takeaway for you, it would probably be this slide. There’s a tendency for infrastructure investments to devolve to the lowest denominator that the business model demands to stay competitive. And all the learnings that we’ve presented today are essentially a corollary step of this thesis. In 2012, our SOA attempt failed, because we as an organization was not able to translate the investment into business needs. Part of it was that we hadn’t yet learned this lesson yet. The other part was that the business imperative was, frankly a little bit weaker in 2012.
So we would like to leave you with this. Take a moment and think about the struggling engineering efforts, projects, or even entire organizations that you’re working with. What are the macro environments? Does your business model really demand that investment to be competitive? And if you truly believe that it does, don’t make technology the focus of your presentation. Focus on the business requirements and risk factors. Take it upon yourself to help your leaders understand why this is a right investment for your company.Thank you very much.