Strengthening the Foundation: Airbnb's Platform Transformation | Datadog
Datadog's Research Report: The State of Serverless Report: The State of Serverless
Strengthening the Foundation: Airbnb's Platform Transformation

Strengthening the Foundation: Airbnb's Platform Transformation


Published: July 17, 2019
00:00:00
00:00:00

Thank you Jason. Hello, everyone.

I hope all of you are having a great day so far at DASH today, and I also want to thank Datadog for hosting and organizing such an amazing event.

The origins and growth of Airbnb

When most people think of Airbnb, we think of home sharing.

Indeed, that was an amazing idea 12 years ago from our co-founders, Brian and Joe, who pumped up airbeds in their home and hosted the very first three Airbnb guests who were attending a design conference in San Francisco but were not able to find any hotels available.

When the time that Brian and Joe opened their home and welcomed our guests, the Airbnb community was born.

That was 12 years ago.

And today when we look back, the innovation and achievement are phenomenal.

In March 2019, which is just a few months ago, we celebrated 500 million guest arrivals.

And on any given night, there is an average of 2 million guests staying at Airbnb, with the choice of more than six million listings around the world across over 191 countries.

Together, there are more than 25,000 experiences hosted by our community in over a thousand cities.

These numbers are amazing and very encouraging.

Scaling Airbnb as the organization grows

However, as we are contemplating the next 10 years, we want to make sure that we can take the same kind of innovation, scaling, maybe a little bit of magic together with our community, our hosts, our guests, and bring it across the entire end-to-end trip, the entire journey, because this is exactly Airbnb’s vision: to provide the end-to-end solution for anyone who travels.

And this is also the motivation behind us building a shared platform to support our products.

On top of our shared platform, we’ve built multiple business units. We have Homes, we have Experiences, we have Luxury for high-end vacation rentals, and we also have an entire entity in China.

Ideally, whenever we spin up a new business or we enhance an existing category, we should be able to reuse the shared technologies, processes, user experiences where possible, because a robust foundation would enable us to expand into different businesses easier and faster, especially when Airbnb now is moving toward service-oriented architecture, SOA.

We are facing a very unique challenge in front of us that during the migration period, the transition period, we have to make sure we can continuously support the product growth, and at the same time, we have to build and scale our shared platform in order to achieve our long term goal.

So, in the rest of my talk today, I’m going to share some of the most foundational pieces that we built into our platform from the perspective of building microservices in the same chronological order when they happened, and I will group them into five different principles that we follow internally, and I will share how Airbnb benefits from each one of them.

Reusing components and services

Now, let’s take a look at our first principle today.

A shared platform maximizes the reuse of components and services.

I joined Airbnb in 2016, which is the same year Airbnb started its SOA migration from a single Rails application into microservices.

In order to start decomposing the monolith, we first look at the tiered architecture from the Rails application.

We took the bottom as a strategy which starts from the bottom layer, the data access layer, and then we identify a few core business data models. For example, listing, pricing, and calendar availability.

And then we decided, okay, let’s design and build the corresponding data services that will encapsulate the database access logic into the data services, and then we can start routing the traffic from hitting the database directly from the Rails application to going to the data services.

And eventually, once the data service owns 100% of the read and write traffic, then we can start to perform further data model optimization or database sharding.

So that was our initial plan. However, as you can imagine, building services, especially data services that require complex data flow, is more than just migrating queries from Rails into data services.

So let’s take a look at an example by using one of our main pages in our core booking flow, the listing product detail page.

In order to display the rich content from this page, data will be loaded, will be fetched from multiple data sources. For example, from the basic listing information to the amenities, to listing pricing, and all the way to the listing pictures.

This is from the product’s perspective, but if we look at it from a service buildings’ perspective, this is what we have in mind: a data flow.

So assuming this is a simple data flow from an endpoint in our data service, in order to serve the response, multiple steps will be executed with a predefined sequence on top of the request.

And the steps themselves can be as simple as just in-memory data transformation, or they could be remote service call.

If there is no dependency between steps and steps, we could potentially execute them in parallel. However, if there is dependency, for example, certain listing data will not be loaded until we confirm that a user is within a specific experiment.

In that case, the steps will be executed sequentially. So concurrency is a big deal here.

When traffic comes in, things are getting more interesting because not all the steps will run at the same pace. Some will run faster, some will run slower. And the slow running function, the steps, will potentially become the bottleneck for the entire data flow.

How do we detect it and how do we monitor it?

Failures could also happen from time to time, especially between service communication. When it happens, should we retry, or what is our fallback or circuit breaking strategy?

So even before we talk about any of the business logic, we already see some complexities here.

And the very first lesson that we learn in our early days was that our feature complete code can oftentimes be quite brittle under the operational constraints imposed by SOA. In other words, without a thoughtful plan or a thoughtful implementation, our feature complete code is definitely not production-ready.

The creation and operation of Powergrid

So we summarized a few challenges in front of us at the time at Airbnb, and here are the top three: concurrency, error handling, and observability.

We need to deal with this, especially as we were still at a very early stage in our SOA migration journey. We need to build a solution into the shared platform in order to increase the overall reusability for the future service building.

So as a team, we got together, we designed, built, and introduced the very first major service building block in Airbnb: Powergrid.

At a high level, Powergrid is an in-house developed Java library that simplifies the concurrent data loading, and at the same time, it emphasizes publishing standardized metrics.

One of the high-level design principles from Powergrid is that we want to educate, we want to encourage the service developers just to start thinking, building an endpoint in a service as constructing a data flow, a directed acyclic graph, a DAG.

Instead of just putting, randomly putting, functions together in the code, we want to organize it as a DAG, because Powergrid internally organize a code as a DAG where each node is essentially a function or a step within the flow which represents our business logic.

So in our simple data flow, Powergrid introduces a concept of a node, which is essentially an abstraction around the underlying function. And on top of it, Powergrid exposes the node level API to allow the developers to specify things like concurrency mode or error handling strategy.

For example, if today we have an input node and we want to run three different nodes in parallel, here is what our service developers will write by using Powergrid.

So first of all, starting from an input node, and then we can apply the mapping functions.

And then the developers can configure the node for the concurrency mode. In this example, we’ll do runAsync, which runs these nodes asynchronously with a specific timeout.

So in this case, the developers can focus on building a node, constructing a node, and put them together as a data flow, as a deck sequentially, doing sequential, synchronous programming.

And behind the scenes, Powergrid will handle the concurrency or asynchronous programming, and at the same time, publish standardized metrics.

Another example is for error handling.

So today, Powergrid also allows the developers to specify recovering function for any single node so that when an error happens, this specific function will be invoked by Powergrid, and as well, the metrics will be published behind the scenes so that the developers can see what are the error counts and your costs.

So to summarize, Powergrid allows the developers to use its Fluent API to construct the data flow and to write concurrency safe code, and at the same time, provide complete observability.

So for any single service that uses Powergrid in Airbnb, we have this centralized dashboard automatically, because this will collect all the metrics by the services that are using power grid and expose very consistent information.

For example, per data flow and per node for the execution count, latency, error count, error root cause, and other information.

So, as Powergrid standardizes the server-side implementation and also the metrics publishing, we started to gain significant visibility toward the internals of our services.

However, as more and more services were developed in Airbnb, Rails is no longer the single client that talks to our services.

Instead, we started to see dependencies among the services, and here, came our next challenge.

Standardizing communications between components

Principle number two: a shared platform standardizes communications between components.

So besides the three tiers that we briefly mentioned, the presentation, the business logic, and the data access, Rails also handles API routing and executes the common middleware filters.

For example, user authentication or user session injection, and something like risk checks, which are super important to Airbnb’s business, because all the components are wrapped within a single Rails application so that it actually makes the internal communication simple and straightforward.

However, in the world of microservices, the Requests Life Cycle management is completely different from a single Rails application.

For example, for any single user request coming to Airbnb, regardless of how many services it will eventually travel through, the same set of user session injection, user authentication, or risk checks, has to be performed.

And at the same time, we have to make sure that the user contextual information, the request context, can travel through the entire life cycle of the request.

Why is that?

Because there are so many good use cases where we rely on the user contextual information.

For example, with a user country information that can travel through the entire life cycle of a service request, we can actually enable a location-based feature roll-out based on the user country.

And there is another very exciting feature that our service developers rely on, which is distributed tracing.

So with a unique trace ID that travels through the entire request life cycle, we can enable the distributed tracing with Datadog so that the developers can visualize how much time was spent at which service in the entire life cycle of a request from a centralized dashboard.

So, because there are so many use cases that Airbnbs rely on, so we have to make sure that the request context, the contextual information of a user, can be accessed and also be propagated with the request.

In order to do that, we have to standardize the communication between service and service, right?

So in Airbnb, we built an end-to-end solution starting from an API gateway.

So an API gateway is a service layer that provides request routing, the API routing, and also the Middleware platform.

This is also where the request context is formed.

So from a high-level architectural perspective, so a high-level API Gateway guards the entry for the requests to Airbnb’s internal services.

And at the same time, it allows the developers to register additional middleware filters.

So the API gateway allows the developers to register additional middleware filters.

So once the execution results come back from the middleware filters, the API gateway will extract the execution results, form a request context object, and then passing to the internal services by HTTP request headers.

So in order to define the request context information, we have to use a standardized way to define the schema, right? So in Airbnb, we utilized Thrift IDL to define the request context structure. So IDL stands for Interface Definition Language. So we use IDL to define the schema.

Maybe we just forward, go to the next slide. Yeah, that’d be fine. Cool. Thank you.

So for the Request Context Schema, we can define multiple available attributes, each one with its own strong type.

And we also use annotation to specify what is the request header key that will be associated with the specific attribute to be used when it’s being propagated.

So once we have the schema being defined, last thing we need to do is we have to put them together.

We have to propagate it through the entire service culture.

So in Airbnb, we built a standardized RPC Service Client. So RPC Service Client is more than just an HTTP client wrapper. It does a lot more than a traditional HTTP client wrapper.

So first of all, it performs request, response, communication, and also it propagates the request context.

Beyond that, a RPC client also handles exception handling because it knows how to interpret the response code from the server side so it can take proper action.

And of course, we can utilize RPC client to enhance the security level between service communications.

And overall, all of the standardized metrics are published from the RPC client.

So we can always get complete observability from the client’s perspective to monitor the performance.

So a quick recap for this one.

So we build an API gateway, standardize the Request Context Definition, and we use RPC client to propagate the request context.

So overall, we standardize the communication between services.

So right at this point, building services, creating services, has become much easier in Airbnb.

But our next challenge was: how do we ensure a higher level of productivity for our service developers?

Increasing development velocity

So here comes our next principle, principle number three.

A shared platform increases development velocity.

So ideally, this is what all the service developers should only focus on, right?

Business logic.

This is what provide the business values for all the service developers.

However, at the time at Airbnb, this was what we expected our service developers to implement.

A lot of the plumbing code and the skeleton code from both the client service side, right?

So usually under the constraint of resources or, you know, deadline, this was what we end up having.

Which leaves our service development in an inconsistent and incomplete stage, which is not ideal.

So one question that we ask ourselves was that, how can we ensure a much more efficient development flow?

How can we automate as much as possible for our developers?

So in Airbnb, we took the approach of schema-driven service development flow, which we built an IDL-based service development flow.

And IDL, again, stands for Interface Definition Language.

And this is built on top of the Thrift IDL framework.

So if we look at our development flow now, the very first step our developers need to focus on is, define a schema for both the service API data model, and also the request-response structure in thrift.

And then by utilizing thrift framework, we’ll generate the data objects that we can also use to integrate with Powergrid.

So at this stage, the developers can focus on writing business logic.

And then since we have the service API defining thrift as well, then the client-side code can also be generated together with our RPC client.

And, of course, the schema will drive the creation of the documentation, dashboard, and all publishing the metrics.

So let’s take a look at an example to see how we put all of them together.

So let’s say today we are building a listing data service.

And the first step we’ll focus on is, define the listing data model, right?

This listing data schema.

So we use thrift to define a listing structure with all the available listing attributes.

Each one of them has its own strong type.

Attribute and a strong type, yeah.

And then we can put together by using our existing data model, put together a request-response structure, because this will be used in our endpoint that we create.

And once we have the request-response structure, the service API is ready to be built.

So we have service name, My Service, and endpoints.

For this example, we have Low Listings by IDs, and we, of course, we have request responses, so input-output for the endpoint.

And we can also use additional information to define things like SLO, Service Level Objective, so that our framework will know how to generate alerts and metrics based on the SLO definition.

So once we have the schema defined, we can generate all these data objects that we can use in the service side.

But one of the most exciting features that we also introduce is that the documentation is also being generated as well.

So for any single service in Airbnb, we have such a single, a central place for any developers to view the information.

And all this information is schema-driven.

For example, if you want to integrate with My Service, first of all, you will take a look at what My Service provides.

And if you want to integrate with any of the endpoints that we provide, here are all the information with endpoint name, request, response, and all the available attributes and their type.

And one of the cool features that we introduce is, what if we can send out a request directly from this page?

So we allow the developers to plug in certain testing requests from here, from this user interface, and to get a sense of what the response look like.

And this is through our development environment.

So this end-to-end development flow increased overall the experience for our developers.

And of course, we mentioned the metrics as well.

Since we also publish standardized metrics, we built a standard, server-side dashboard.

So with the filters on top, our developers can easily navigate between servers and services.

And since all the information is unified from this specific dashboard, so we group the information into different sections.

The users (the developers) can easily consume the data. And of course, the client-side, we have the same filters on top and the grouping for the information. And overall, we provide very consistent information again between the server-side and client-side.

A quick recap again.

So by utilizing IDL-based service developer flow, we greatly increase the velocity for our developers.

But development is not just about coding, right?

Coding is a big part, but there are times we have to take or have to follow certain processes.

And especially if we talk about SOA migration, chances are, we have to run through the like data comparison to ensure parity between, like two code paths, like an old code path and a new code path.

Since this is such a common practice, in Airbnb, we also focus a lot on the process.

And let’s take a look at our next principle.

Streamline development and operational processes

Principle number four, a shared platform streamlines development and operational process.

So to further illustrate this principle, let’s take a look at example.

So this is the Airbnb host side listing, editing product flow.

So from this page, the host, the user, can edit the listing policy from the check in time, check out time, cancellation policy, and other information.

And once the Save button is clicked, the information will be persisted.

Before our SOA migration for this specific product flow, this is what happened after the Save button is clicked.

So Rails still gets a request, and eventually, the request will be routed into data services for data persistence.

And our goal is to come up with a service counterpart so that eventually, once we have these services developed, we can route the traffic through our services directly.

So that is our goal.

But as you can imagine, in order to do the traffic cut-over, we have to ensure accuracy and also that the features are identical in the two tracks.

And if we have tens or hundreds of such product flow, this comparison could take a lot of time—very repetitive and tedious work.

So in order to achieve a better efficiency for our developers, in Airbnb we came up with an offline, request payload-based comparison framework to facilitate this comparison.

So in this approach, we’ll set up a staging environment for comparison purpose, but instead of just comparing the database state after the right operation, we actually compare the request payloads collected from the data service.

So let’s take a look at how it actually works.

So let’s say this is to the product flow we want to migrate, and we then implement the service counterpart.

Depending on the ownership and complexity, a single request may travel through multiple business logic services before hitting that data service.

And in order to do the data comparison, we will set up a staging environment so that we can instruct the API gateway to perform and request a replay—the staging tier replay.

So in this case, we will replay exactly the same request to a staging tier.

And within the staging tier, we have staging services and the staging database.

Because the request originates from the same user requests, so the request contains exactly the same request context, and also the request key, which is unique to the request.

And once the data service receive the request, what it will do is it will publish an event to a queue.

So for a production data service, it will publish to a production queue and for staging service to a staging queue.

And within the event, it will contain request context, exactly the same request context the data service received.

And also, it will include the request key, so that once the event’s been published to the queue, our offline comparison framework will consume the event.

So basically, it’s essentially a data pipeline. It will collect all the events with the same request key, and then it will merge the payload, and then you’ll compare the payload between the two environments, two tiers.

So once the comparison result is done, it will be redone to offline table.

It’s a hive table for the complete audit log for offline queries.

And also, the metrics will be published to Datadog so that the service developers can easily view what attributes are matching and what are not.

So after these frameworks being released, a lot of our engineering teams are actually using these frameworks to do the comparison.

And this gave us a lot of benefits.

First of all, it separate the concern because now the developers only need to focus on business logic and also implement the comparative function, not the entire infrastructure.

And also this reduces risk because traffic now is replayed through the API gateway level.

So we can eliminate a lot of the potentially human error.

And of course, the best benefit is performance.

The comparison is done offline so that it doesn’t add up additional overhead into production.

And of course, observability: all the metrics are published from the frameworks so that engineers can easily view the metrics for the comparison result.

So another quick recap. So far we have been talking about building shared…you know, the foundational pieces into the shared platform, and it will hide all the implementation details and standardize server-client communications and also the development flow and observability.

But in order to grow our shared platform to the next level, let’s also take a look at the area that Airbnb also spent a lot of time in the last couple of years.

Build for extensibility, scalability, and resilience

Principle number five, a shared platform build for extensibility, scalability, and resilience.

So one of the common incident patterns in the past in Airbnb, especially in the first few years in our SOA migration, was that the server-side was vulnerable to the incoming bursty traffic.

So this usually caused server-side resource exhaustion and then cascading caution failures.

Along the years, we have been building multiple service resilience mechanisms into our systems.

And one very important lesson that we learned from our experience was that service resilience is not just about the client-side implementation, and it’s not just about server-side either.

Most of the time, it’s about effective communication between the client and the server.

For example, request queuing is a very common strategy or implementation in the server side to absorb bursty traffic.

The idea behind that is before a request is handled by a worker thread, it will be first put into a request queue with a specific timeout in the queue.

In Airbnb, we implemented our request queuing mechanism with a slight variation. By combining control delay and adaptive last in, first out, so that when the service experiences high load, the control delay will kick in so that instead of assigning a fixed timeout, this time it will assign a much smaller value for the timeout, in this case, 10 milliseconds for the incoming request.

And at the same time, we will reverse the process’ sequence from first in, first out, to last in, first out.

So in this way, a much smaller timeout will prevent the request queue from continuously growing, and the reversing of the sequence will favor the last-in request so that it will increase the overall success rate from the server-side.

However, the first-in request still has a very high chance to time out, right?

So eventually, it will be rejected or be discarded by the server-side, so that if the server is still returning a regular 500 error to the client, the client has no idea what’s happening in the server-side.

So here comes our next implementation, which is backpressure propagation.

So back pressure is actually a signal that a server sends out to the client to say, “I’m currently under heavy load. So please take some proper action,” so that by using this effective communication between the client and server, we actually give the client much more information.

For example, if today we have three services, and service C is currently under a certain heavy load, so instead of returning 500 as a status code, it will return a mutually agreed 529 back to the client.

So when the client receives it and understand that, okay, this is a backpressure from the service C so that, first of all, I’m not going to…okay, let’s go back.

Assuming that you memorized it.

Yeah. So that it is a backpressure, I’m not going to retry because I want to prevent a retry storm.

And at the same time, it will propagate the backpressure back to the upstream so that within the entire service call chain, no one will ever retry on this specific request, which will give service C a chance to recover.

Okay, cool.

And then another very closely related mechanism that we build into our Airbnb is the API request deadline.

So for any public API, any single public API in Airbnb, we assigned a carefully chosen request deadline budget.

For example, a request can have a budget of two seconds, which means within the entire life cycle for the request, all the services, whoever will process a request, has to share these two seconds.

So on the service side, before we process a single request, we will check whether the request has to expire, for example, two seconds.

If it is already expired, then the server-side will not spend any additional resources to process a request.

And at the same time, if the request will be rejected, it will send out a mutually agreed status code back to the client, so the client will know what’s happening and it will propagate it back to the upstream.

So both of these examples let you guys know that communication between a client and server is very, very important.

We have to let the client know what’s happening so that we can take proper action to protect the server.

So in Airbnb, service resilience is not a feature, it is a requirement, which means we are continuously building the resilience mechanism.

We are learning for our system and building it back to our system.

Overall, we want to achieve a much better user experience for our end user.

In conclusion

So to wrap up, to conclude all our five principles today, I want to share two takeaways for all of you.

First of all, everything we build matters.

So when we build, build for production, because platform transformation requires a long commitment and strong discipline.

So always avoid the temptation to cut corners when chasing a deadline because that could potentially put our platform at risk.

So when we build, build for production, and never jeopardize the shared resource.

But that doesn’t mean we want to over-engineer everything, because over-engineering can usually increase overall complexity and increase the time to ship.

So when we scale, scale for tomorrow.

Always have a straightforward story about your scaling plan.

What is your plan for the next six months, next year, next two years?

So when we scale, consider about 3x, 5x, but not 20x, or 50x.

But if you expect your service to grow 50x in the next three months, please let me know.

I’ll be very, very interested.

Okay.

So build for production, scale for tomorrow.

And Airbnb so far has had a very, very positive experience during our platform transformation.

So my name is Victor.

I want to thank you for joining me today and listening to our story. Thank you.