As a monitoring company, time-series databases (TSDBs) are the foundation we rely on to build a scalable and reliable monitoring solution
at Outlyer. Over the past few years we’ve gone through 3 iterations of our architecture as we scaled up with more metrics
and more customers, and we haven’t always gotten it right. We’ve tested most of the open-source TSDBs
out there, and currently own the top ranked blog on the subject, and also attempted to build our own.
You could say, through a lot of trial and error we’ve become experts in building TSDBs, and know how hard it is to do at scale. This
is why I wanted to share our journey, and as every week it seems another person has released their own open-source TSDB, maybe help
some of you avoid the mistakes we made along the way.
This blog is based on a talk I did at our last DOXSFO meetup so you can either watch the video
or continue reading below:
Our Journey To Building Something That Works & Scales
Version 1: 2013-2015
When we started back in 2013, analytics and containers seemed quite far off on the horizon. With a very small team we opted to go for the simplest solution
of just storing metrics and datapoints. We knew it had to be reliable and performant, and as we had a small team, having something operationaly resilient and
easy to manage was important to us.
That lead us to building our first architecture on Riak. Although Riak is a
simple key/value store, Riak Core, built in Erlang, is very reliable and performant. During testing we literally killed nodes and replaced them,
and Riak “magically” redistributed the data and continued running without any outages.
This worked well for the first year as we built the first version of Outlyer, but as we added more users we started getting performance issues, so we did
what all good engineering teams do when hitting scaling issues, and put the in-memory cache Redis
to act as a write buffer and quicker source to query for the latest metrics that were coming in.
Version 2: 2015-2017
Version 1 actually worked really well for the first year and a half, but by 2015 the limitations of this approach were becoming obvious as more
advanced analytics and containers became a fast approaching reality, and our product would need to advance towards having more complex analytics
capabilities and dimensional metrics.
Riak out of the box, being only a key/value store, only allowed us to do simple, non-dimensional metrics (metric names with no ability to label metrics),
and we knew we’d need to add this capability, as well as a query layer that would enable us to do advanced queries such as adding and dividing time-series
together, calculating percentiles on the fly, and other common queries you’d expect from a TSDB.
We had a choice of either building it ourselves on top of Riak or bringing in another solution already out there. We looked at the obvious
solutions at the time like InfluxDB, as we knew the founder Paul and liked it a lot,
but at the time (2015) its lack of clustering and maturity didn’t convince us we could use it for our needs (they’ve made huge progress since).
Along the way, we stumbled across a TSDB created by the talented Heinz Gies,
called DalmatinerDB. Heinz is an extremely smart and accomplished Erlang programmer, and had
taken the best part we liked about Riak, Riak core, and built a TSDB around it with a query engine that could do advanced analytics on top. It didn’t
support dimensional metrics when we found it, but as the team were excited to use Erlang based on WhatsApp’s success at scale).
Our initial testing showed a single node could write 3 million metrics/second. We felt confident
we could take DalmatinerDB, make it our own and become the main contributors to the project. We even contracted Heinz in to help, essentially sponsoring its
development, and worked closely with him to add a new PostgreSQL index so we could support dimensional labels on metrics.
This is where our journey stumbled. We continued to grow and add some new large customers, and as DalmatinerDB had never operated at the scale we were at, all the
non-obvious issues of building a TSDB started to appear quickly, and we hit some pretty bad performance issues that we ultimately fixed, but only by putting our
entire engineering resources onto the problem. Like an onion, as soon as we fixed one performance problem, the next layer would reveal another issue underneath.
Without going into the full detail of the challenges we hit, here is a short summary:
No Sensible Limits: There were no real limits or protections around what customers could send us in our architecture. This really hit us when one team at
one of our customers decided to dump 30 million metrics on us, send all of their mobile product metrics into Outlyer. This immediately put our team into
fire-fighting mode trying to operate at another level of scale we weren’t expecting to hit until much later.
Container Churn: As containers became more prevailent in 2016, we started to see some container environments churning containers every 6 minutes in some cases
leading to the problem of Series Churn (see more below) and our PostgreSQL index exploded as it had to index more metrics, slowing down our ability to query for metrics.
Conflicting Write & Read Architectures: While we were blown away with being able to write 3 million metrics/second on DalmatinerDB in our initial testing, the challenge
any TSDB needs to handle, is balancing the ideal patterns for writing and reading data in parallel as scale. Fabian Reinartz
from Prometheus put it elegantly:
“There’s obviously a strong tension between the ideal pattern for writing collected data to disk and the layout that would be significantly more efficient
for serving queries. It is the fundamental problem our TSDB has to solve.”
We experienced this problem a lot on DalmatinerDB as we scaled. While we were able to write metrics at scale, the same architecture that enabled us to achieve such
high write volumes, really hurt us when we tried to query the data back to users. For example, DalmatinerDB didn’t push queries down to the nodes, which meant
it had to pull all the datapoints for a query across our network to the query engine, which would then cacluate the time-series the user saw based on the query. This meant that
we started seeing network IO bottlenecks as at times 40GB of raw datapoints would be pulled in for a single query!
In the end, after investing a lot of time and resources, we did the other engineering solution that teams take to solve scaling issues, and essentially sharded all our largest
customers onto seperate DalmtainerDB clusters. This stopped the fires, but we knew that unless we were going to become a time-series database company, we would have to
fundamentally change our approach, instead of trying to peel the onion and hitting all the other unknown scaling issues we would inevitably hit as we grew larger.
The good news is during this time, as we were also writing and helping design the TSDB, we had learned the hard way how to scale and operate a TSDB, and could take
that knowledge to do something that would actually work for us.
Why Building a TSDB is Hard
Monitoring has become a hard problem to solve as our architectures have evolved towards Cloud, Containers and Microservices. The number of metrics seems to increase exponentially
every year as more and more company’s move towards these architectures. The single-node solutions we used to rely on (i.e. Nagios or Graphite) just don’t scale
anymore, at least not without a significant amount of time and effort, effecitvely requiring full time resources to build and manage at scale.
This growth in metrics is often referred to as cardinality. Every unique set of metric name and metric labels represents a single metric. Whereas in 2013 our largest customers
would send us 100,000 metrics (which was considered a lot at the time), nowadays its not uncommon for some of our customers to send us millions of metrics every minute.
With every new service, every new node or container, the number of metrics grows and your solution needs to scale too. For
example Kube-State-Metrics for Kubernetes can easily generate over 100,000 metrics on a single cluster,
and many organisations have multiple clusters.
At this volume, an entirely new architecture is needed, not just in the way you store and query your metrics, but also single-node architectures don’t cut it anymore. If you want to scale,
stay reliable and maintain performance, you end up having to move to a far more complex multi-node architecture to keep up with this volume of metrics.
In addition you have to meet the expectations of your users, primarily developers, who having grown used to instant search results on Google, also expect to be able to send
as many metrics as they want and be able to query them in sub-second response times. As Amy Nguyen from the Pintrest observability team has mentioned in her excellent
Monitoring Design talk, “Do whatever it takes to make it fast”.
To make the problem even harder, as our environments become even more ephemeral with containers, your solution has to handle a problem called Series Churn.
For example, if each container generates 50 metrics to monitor its performance (i.e. CPU, memory etc.), that container ID label on your metrics will change everytime
your containers start and stop, adding another 50 metrics to your index each time, that only existed for the time that container was up and running. If you don’t design for this
you will see your metrics index grow extremely large very fast as we did on DalmatinerDB.
There are some properties of time-series data, especially for our use-case, IT monitoring, which can be leveraged to make the problem
far simpler to solve:
Most Data Is Only Queried In The Last Hour: We did an analysis of our queries back in 2016 and found that the value of data decreases exponentially
after only just one hour, with over 90% of the queries only looking at the last hour of data. This means you can focus most of your efforts on solving for the
newest data, and put anything older onto cheaper, albeit slower, storage and architectures.
Datapoints Are Immutable: Most datapoints come into your monitoring system in order, or at least can be organised into order over a short period of time, and
once written don’t need to be touched again. This means you don’t need to build expensive ACID database transactions,
or worry about inserts and updates once the data is written making the way you store the data much simpler.
You Can Lose a Few Datapoints Here and There: At high volume, losing a handful of datapoints is just not noticable on graphs and alerts. While you want to ensure
you store as many of the datapoints as possible, designing a system that guarentees every datapoint must be written, is a much harder problem to solve. Accepting that
its OK to lose some datapoints in extreme scenarios simplifies the problem without affecting the end user’s monitoring in any noticiable way.
Most Datapoints Don’t Change Much Between Readings: Facebook wrote a paper on this for their own TSBD, Gorrilla DB.
As many datapoints will be the same as the last reading, or only change slightly, this lends itself well to compression,
which allows you to shrink your storage resources substantially.
These challenges as well as the proprties of time-series data discussed, lead us to create a brand new architecture for our TSDB that became the
backbone of our new version of Outlyer we recently released.
Version 3: Our New TSDB Architecture
When we made the decision to move off DalmatinerDB in 2017, and redesign our architecture from the ground up to ensure we didn’t hit the same scaling and performance
issues again, we made two key decisions that helped us design the new TSDB architecture:
- Whatever technologies we used had to have been proven at a larger scale than our current needs. This would mean we could scale up without being the
first person to hit technical issues with the technology and having to solve them ourselves as other people had already lead the charge and solved those
- As a SaaS solution, we could build a more distributed architecture than if we had to provide the solution on-premise and worry about making it easy to install.
What this ultimately means, is we can separate the concerns of the TSDB into seperate microservices and processes that we could scale and evolve independently.
This is in contrast to a monolithic solution which may have a low setup cost, but quickly becomes a full time job to operate and manage at scale. While DalmatinerDB
did separate key parts into different services, we wanted to take this further so we could switch and replace components with higher granularity on the fly as our scale grew.
We performed another evaluation of all the TSDBs out there, looking again at InfluxDB, but also newer solutions that had appeared since 2015 such as Beringei (Gorrilla DB)
from Facebook and Atlas from Netflix.
During this process, now having a very skeptial view of amazing claims that each TSDB made when we were researching, we saw there is no “perfect” TSDB. Each solution
has its pros and cons, the dream of having a solution that you can just write metrics into and scale up linearly is just that, a dream. That’s why everyone seems to
write and release a new TSDB every week, because you can find issues with everything out there.
Using our key decision that we wanted something that was battle tested in production at significant scale already, and our
relationship with Roy Rapoport, who used to head up monitoring at Netflix and had spoken at our DOXSFO meetup last year, we
ultimately decided to go with Atlas, which is written in Scala. What we liked about Atlas, was Netflix has been using it for several
years to collect and query billions of metrics a day, it had sensible limits built in, and was in-memory so performed
super fast when we tested it.
We have learned that Atlas by itself is a component, not the full solution. It doesn’t try to do everything like shared state and replication;
every Atlas node is independent. This solves a lot of complexity and performance issues as the volume of metrics grows, but requires a lot more
tooling to manage Atlas at scale. Because we store 13 months of metrics for customers, as opposed to only 4 weeks like Netflix, putting this
all in memory for the full period would have been cost prohibitive so we had to add disk based storage to our long term nodes, which was fine
as most of the metrics on these nodes would only be queried from time to time.
Understanding this, and taking our lessons learned from Version 2, we were able to design a new architecture that has been running in production for V2
since January. The key design concepts are:
Multi-Tier Architecture: Knowing that the most recent data is the most frequently queried, we decided to split our architecture into seperate
tiers based on the retention period of the data. The most recent data is stored on in-memory nodes and loaded with new datapoints in real time, so they
can be queried instantly for the majority of users, while older data is stored on cheaper, slightly slower disk based nodes. Having multiple tiers also
solves the issue of Series Churn, as old series get removed from the short term tier within a few hours, stopping the index from growing infinitely.
Backup to S3: In parallel, the raw metrics are also written to Amazon S3, which uses a Spark along with scheduled processes in Kubernetes to perform
rollups for the longer term tiers. Using durable and reliable S3 storage as an immutable source of truth separates the concerns of short and and long term
retention. This means if we have issues on any tier, other tiers will continue working unaffected. In addition we have a full backup of all the original
data which we can use to regenerate tiers and reload instances from scratch in minutes if we spot an issue with one of our teirs. We can also use
the raw data to run batch reports for users about their environments later on.
Sharding Everywhere: We built dynamic sharding into the architecture using Consul from day one. This means our larger customers can be put onto
seperate shards, which we can independently scale and tune to their needs, and all our smaller customers can be put on a single shard so we can support them
cost effectively too. As customers grow its really easy to move them to their own shard without them noticing anything, and because each shard is dynamically
configurable via Consul, we can also provide customers with “personalised” retention settings, i.e. store metrics at any resolution they want, for however long
they want so they can control their costs on Outlyer too.
Mirrors Vs. Clustering: Clustering adds a lot of complexity and overhead to solutions as we saw with DalmatinerDB. Eventually your performance bottleneck
will be the amount of chatter between nodes trying to sync data across your network. Therefore every Atlas instance in our short term tier is an independent mirror
across availability zones. This means we have at least 2 nodes with the same data for each shard, but if one goes down, the other node can handle the load while
we automatically recreate and reload the data into the other node without users noticing. We’ve built all the tooling to do this automaticaly using
Auto-Scaling groups, which means when we actually lose a node, the system recovers automatically without any service interruption or one of our team having to
wake up to fix it.
As you can see below, the architecture uses Kafka and multiple microservices to process and query metrics, we’ve essentially unbundled the
TSBD into granular services we can independently scale and evolve. The other service on the diagram called Blend is our query engine, that handles queries from our API,
and using Consul knows to which nodes/tier portions of the query should be delegated. Blend then combines the responses into a single results which the user
sees on a graph.
Note the difference between this approach and that of DalmatinerDB, where raw data was pulled from the storage nodes and aggregated later. Most of
the work is delegated to the nodes, close to the data, with the pre-aggregated results combined by Blend.
While all the detail behind how this architecture works will be left for a seperate blog by our core TSDB engineers Dave & Chris, the results so
far have been fantastic. We’re scaling while loading queries in sub-second response times, with customers doing millions of metrics per shard. In addition
we haven’t had any fires yet, and in fact the architecture has so much resiliancy built in, it auto-heals itself for the few issues we have had without
any of our team having to wake up and fix things.
Hopefully our story will make you think twice before trying to build your own TSDB in house using open-source solutions, or if you’re really crazy, building
a TSDB from scratch. Building and maintaining a TSDB is a full time job, and we have dedicated expert engineers who are constantly improving and maintaing our TSDB,
and no doubt will iterate the architecture again over time as we hit an even higher magnitude of scale down the line.
Given our experience in this complex space, I would sincerely recommend you don’t try and do this at home, and if you have the money you should definitely outsource this
to the experts who do this as a full time job, whether its Outlyer or another managed TSDB solution out there. As so many things turn out in computing, it’s
harder than it looks!
One of the advantages we also had in Version 3, is the immense help and support from the Netflix monitoring team, in particular Brian Harrington who is the
core author of Atlas and has got on calls with us several times to help us build a successful monitoring platform around Atlas. It really does take a strong
team of super talented experts to build a TSDB that performs at scale, and its been awesome collaborating with Brian and the rest of the Netflix monitoring