Interview with Zach Daniel
In this edition of the Community Interview series, we sit down with Zach Daniel: Elixir afficionado and maintainer of the Datadog Elixir APM (Tracing) library known as Spandex. Read on to learn more about Elixir internals, observability concerns, and open source citizenship.
Dan: Hi! I’m Dan with the Datadog Community team. I’m here with Zach, who’s made some great contributions to the Datadog ecosystem. Please introduce yourself!
Zach: Sure, my name is Zach Daniel. I’m a software engineer and I work at an ed tech company out of Chicago called Albert.io. I’ve been writing Elixir now for about two years.
Note: Since this interview was conducted, Zach has parted ways with Albert.io and is now at DockYard, where he continues to bring the Elixir heat!
Dan: What got you into Elixir? Professional? Personal? Mix of both?
Zach: I showed up at Albert.io and we needed to do some high performance stuff—really large scale statistics. We were using Node.js and RethinkDB and we realized that that stack wasn’t really going to cut it for what we needed in the future, so we decided to switch to Postgres as our backing data source.
Zach: We really couldn’t adapt our existing Node.js infrastructure to use a Postgres database, primarily because it lacks good SQL primitives, or a good ORM. We started looking around at the popular web frameworks—things that provided good interactions with a database. We realized that Ecto and Phoenix for Elixir, two of the big community projects, would serve pretty much all of our needs.
Zach: Ecto was in fact just a very high quality tool to use for interacting with our database and as an ORM. That Ecto and Phoenix were so high quality was our biggest reason for going to Elixir. But at the same time we just thought to ourselves, “what’s something we really want to spend the next years working on, that we’re really going to enjoy?” We were happy that we could pick something so fun but also be practical.
Dan: That’s actually a rare combination.
Zach: It is.
Dan: I imagine that one can draw a line from there to participating in the Elixir library for Datadog—now known as Spandex.
Zach: Yeah, it’s actually pretty interesting! Elixir on it’s own provides a lot more introspection and monitoring capabilities, just sort of in the raw, than most other tools or frameworks do. You can just run
observer.start in an Elixir application and it’s going to pull up a graph of all of your running processes, and it’s going to show you statistics.
Zach: That kind of lulled us a little bit for the first, maybe, six to eight months. We got the idea that it just “came with monitoring”—but that really doesn’t cut it in the long term. Those end up being very primitive interfaces for working with things at scale. Also, a lot of the built-in tracing tools say things like, “don’t use this in production”…
Zach: So those tools exist but none of them really provide what I think is the most important aspect of monitoring: quick determination of issues. Ease of use matters, not just because we want something easy to use, but because there’s so much data, you never know what you’re looking for. So if you just have a UI that shows you all of this stuff, with no context on what’s going wrong or what you might want to look at next, you’re never going to be successful.
Zach: We ended up switching to Datadog specifically because the observer user interface left a lot to be desired. We needed something that was just more snappy—something where I can get a quick overview of what’s happening, or get quick alerts of what’s going wrong.
… when writing in Elixir, it’s just a lot simpler.
Dan: That is an excellent reason to switch over, frankly. I imagine at that point you went, “oh wait a minute, now we need a way to get our data from our our code, into Datadog”?
Zach: Yeah I mean really we knew from the outset that we were going to have to write a client, because there wasn’t one at all. There may have been some small projects, but I’m pretty sure we were the first.
Zach: Honestly, I thought the project was going to be a lot easier than it was. The documentation seemed pretty straight-forward: send us a list of your spans and a list of your traces? Sure, this is all pretty cut and dry. But then I got into it and I realized there’s a whole world to the concept of tracing an application, of doing it unobtrusively, of never failing.
Zach: A monitoring tool should always opt to fail itself as opposed to causing whatever you’re doing to fail. Likewise, if your monitoring tool is making your application slower it’s doing a really bad job. It’s difficult to manage all of that state without clogging up the works.
Zach: There was a lot more to learn than I expected. There were a lot of little things, that if you do them correctly when you’re building out your trace data to send it to Datadog, you’ll get this really great UI sugar for it. You get these really fancy breakdowns on successes and failures by status code, for example. The documentation has gotten way better, but when I started the trace stuff was still in beta.
Dan: What in the development process did you find particularly interesting? Particularly difficult? Particularly fulfilling to overcome?
[Erlang has] a simple process dictionary. Using it is generally a bad idea…
Zach: What I discovered is that to protect the code that you’re working with, and to make it easy to integrate your tracing service, you have to use some sort of state management on the side. So if I say
span.start, something else needs to manage where I am in that whole process, and to keep track of it for me. That way I don’t have to weave in the whole trace to everything I’m doing. That clearly comes with a lot of ways to do it wrong—there’s all sorts of mistakes you can make there.
Zach: Elixir is an immutable language, even though you have processes and you can store state like that, so I had to learn a lot about practical ways to use OTP to store state. I actually ended up going with a funny feature provided by OTP that, deep down in the Erlang documentation, it says that they didn’t really want anybody to know about it: a simple process dictionary. Using it is generally a bad idea, but the process dictionary is what empowers things like Logger metadata and IEX shell history. It’s a way to fake mutability in this immutable language. I don’t know where my code is running, but there’s always the process dictionary, so I use that to stack the spans and pop them off of that.
Zach: Another thing was just never being a bottleneck to the application when sending data. That proved to be fairly difficult too. We had to implement a lot of back pressure features. When the queue starts rising drastically, you make the clients wait until the queue goes down, which spreads the latency distribution across the system. You need all sorts of concurrency and latency management stuff, and I had to learn to make it into a useful tool that didn’t break things.
Dan: When you were developing the library, the Datadog features were still in beta, and the documentation wasn’t yet complete. How did you figure out what features existed, what API endpoints could be called, how they should be called? Was that through looking at other code? Was that through trial and error?
Zach: It was a mix. The primary way was by picking apart the existing Python and Ruby clients. The public Slack channel for Datadog provided more than few opportunities when I was really confused about how things were working—there was one gentleman, I forget his name, that came to the rescue a few times. Those were, and are, my two main resources.
Dan: Excellent! I’m happy that that the Slack channel was useful to you.
Oh, people are using this in their production applications! I need to step it up.
Dan: When you were looking at the existing libraries, was there anything there that surprised you? Or anything there that you thought, “wait a minute, I could do this better?” Or, “this is better—I’m going to do it that way?”
Zach: I didn’t really do any sort of quality analysis. What I do remember is feeling—and I feel like this often with Elixir tools—is that when writing in Elixir it’s just a lot simpler. I remember thinking that there’s an awful lot of code here that seemed overly complicated. Even now, I’m pretty confident that the Elixir client has significantly fewer lines of code.
Dan: How did you get other people on board with the project? Was that just a professional obligation amongst your teammates or were you getting the word out?
Zach: Internally, we had pretty good buy in at the time. Datadog’s really slick, and there were lots of things that made it an easy decision—like we didn’t have to host it ourselves, so the Ops team was already happy.
Zach: In terms of the library, I put it on the Elixir Weekly newsletter just because it was a reasonably decent library that I had made. I didn’t make a big deal out of it because I was worried that I didn’t know what I was doing. But then something interesting happened: it got more and more GitHub stars, and people started lodging more and more issues, and I realized, “oh, people are using this in their production applications! I need to step it up.”
Zach: Now there’s another individual who’s on the core team with me now: Greg Mefford, who’s also on the core Nerves team for Elixir. We’ve worked through a lot of big organizational changes to make the library more generic.
Dan: So are you still actively contributing to Spandex or have you entered more of a maintenance mode?
Zach: I am actively contributing. I’m building something on the side that is going to heavily utilize the Spandex core primitives—I think it’s going to be a really interesting tool.
Dan: Right on. And hopefully there will be native Datadog integration for that thing that you’re building?
Zach: Well, yeah!
Dan: Glad to hear it! Thanks for taking the time out to speak with me today—and I look forward to your next project!