How to design scalable data systems

Introduction

“Data is a precious thing and will last longer than the systems themselves.” – Tim Berners-Lee, inventor of the World Wide Web.

In today's applications unlike decades ago, CPU power is not the only limiting factor anymore and data handling became more challenging part of our applications. From capturing various sensor data for farm monitoring to the contents created by users in a social media to tracking users behaviour on ecommerce websites, all and all are bombaring our data systems.

A big difference between handling large quantity of tasks versus data is in their statefullness. Unlike CPU-intensive tasks that need multiple stateless worker, in a distributed database we're dealing with statefull queries. This small but important difference makes handling the data trickier than CPU intensive jobs. Since the data needs to have access to entire database in order to respond. Historically, when we had only one database instance, everything was straightforward. By default we had Atomicity, Consistency, Integrity and durability (ACID) in our data. However, todays application data demand far greater power than a single node of database can handle. Hence we need well-thought strategies when it comes to designing a scalable data system for our applications.

In this article I will review some of the challenges and best practices for designing data intensive applications.

Scalability

In simple words, scalability means the ability of a software to cope with increased load. Let's first take a look at at how we've overcome scalability problem in CPU-intensive tasks. If the load on our application is beyond the power of our server, we can either make our server more powerful (scale up) or distribute the load between multiple servers (scale out). As you may have already guessed, scaling up is not a resoultion for all applications since there is a limit on how powerful a machine can get. However, throritically, there is no limit on number separate machines to perform similar tasks on our requests. This approach has been used widely in the industry and been accepted as a best practice for handling large volume of requests.

In data world we can use almost similar approach we are using a combination of async queues and load balancers to handle plethora of requests. By leveraging tools like Apache Kafka, AWS SQS, etc, we delegate the incoming jobs to other processors. The key here is that these services are decoupled and they can work independent of other services.

However, in a distributed database, dependency is an inherent nature of the system. When we want to respond to a query we cannot rely only on the part of data that is withheld in the current node, we need to have the result of the same query for all the nodes in our data cluster and this is where distributed data stores becomes more challenging comparing to a distributed compute cluster.

The first step toward making any system scalable is to have a good understanding of its current load. As a concrete example let's take a look at Twitter example. The Two primary functions of Twitter is posting a new tweet and reading the home timeline. If we want to design a data system for Twitter, we need to know the load parameters of Twitter:

  • The number of new messages posted on Twitter currently is 6k requests/sec on average.
  • The number of requests twitter needs to handle for every user to display their home page feed is 300K/sec.

Handling 6K writes/second is not challenging. You can insert new row for tweets into a global table like "Tweets".

However, creating the home feed of a user would tricky. If we're using a relational database, then we need to look up all the people they follow, find all the tweets for each of those users, and merge the result to create a single user's home timeline. Something like this:

SELECT tweets.*, users.* FROM  tweets
  JOIN users ON tweets.user_id = users.id
  JOIN follows ON follows.folowee_id = users.id
  WHERE folllows.follower_id = current_user

Actually the first version of Twitter was using this approach. But since this query was expensive with increasing number of tweets everyday they struggled to keep up with the load of home timeline queries. As a solution they implelemented a different approach for loading the home timeline tweets. They introduced a cache to keep the home feed of every user rather than computing it everytime they refresh their homepage. They started to update the home page cache for every user following another user whenever the folowee user posted a new tweet. For example if 10 people are folowing Alice, and Alice posts a new tweet, twitter should update 10 cache entries whenevenr Alice posts a new tweet.

This approach performed better because the home feed query was fast and this was more important for Twitter and also the number of home timeline read requests was almost two times the number of new tweets posted. And it makes more sense that they preffered to do more work for the task which its frequency is lower.

However, this approach had its own downside. On average a tweet needs to be delivered to 700 followers so 6k tweets per second becomes 4.2M writes per second to the followers home timeline. And also worse than than some famous people on twitter have around 100 millions of followers and if they post a new tweet then twitter has to update 100 million records for each of these users. And just by reading this number you could guess that processing this number of tweets in a timely manner is not a trivial task.

So what's the solution? How Twitter handled this problem. If we think inside the box and just try to resolve this issue by either of approaches then we're out of luck. In order of find a solution for such data problem we need to know more info about the data and the pattern the data is fetched or being used. For example a key parameter in this problem is the frequency of tweets by users. Not first or second approach alone can resolve this scalability issue. That's exactly what Tweeter did with their current approach. The hybrid model.

Basically based on the characteristics of their user they use either of approaches or a combination of the two. Most user's timeline is read from the cache since it's fast and efficient but for famous users with many followers or users who tweet frequently and the cache update is massive, there is a separate data store and these people. If a user is following any of these people then their timeline would be a combination of their cached timeline plus a result of a select query on famous people table.

As we saw in this example, scalability doesn't have a one size fits all formula even for two similar applications. We need to have a good understanding of our data, our users and the pattern our data is being used. Then we can fine tune our data system in a way that works for our application.

Choosing the right data model

The most popular database today is relational database in which the data is organised into tables. This type of database has been around for almost 50 years and it's very mature and efficient. However, another typo of databases known as non relational databases are getting much attention these days. In this section I want to take a look at different data models and when to use them.

Relational databases excel when it comes to ACID (Atomicity, Consistency, Integrity and Durability). They also have a very good performance thanks to their query optimiser especially if we know how to index and store our data in them. They have a predefined schema and they've proven that they're performing well for many applications.

On the other hand, enforcing the schema could be a downside for some use cases, for example if the data we're storing in our database is inherently schemaless then trying to come with a predefined schema could be a limiting factor for us. The other point is that in a relational database we typically try to avoid duplication by normilising our data which brings us the integrity, however, this could be another bottleneck in some applications as the cost of retrieving the data through multiple tables could be so expensive for our usecase.

The other point is that many applications today are written in object-oriented programming languages, which is not by default compatible with relational data structure and for using the data in such languages we need to transform the data from tables to objects. Although ORMs are taking care of this issue, we cannot ignore the fact that still the data is getting converted from one form to another.

Considering all the problems mentioned above with the relational databases, the new NoSql databases have emerged to address those isssues. For example a document databse stores the data in a denormalised way by including all the related data in a single document so we don't need to join between multiple tables to retrieve the data. It has a better write throughput and also a faster read since all the data is in one place. However, we're sacrificing the integrity of the data. The data integrity needs to be implemented in the application code and databasw will not enforce it. Also this way of keeping the data valid across our database is less efficient thatn the way relational database engines enforece it since in relational databases it's happening in a lower layer.

If many-to-many relationships are very common in your database, then again a relational database will be able to handle that but it's' much easier to use a graph database for such use cases. There are some special algorithms implemented for graph databases which are missing from relational databases. For example some well-known algorithms like shortest path between two points can be used in a graph database and if the nature of your application is in way that you need to query for such data, then definietly a graph database will suit you more. Graph databases are also very good candidates if you're going to add new features to your database later. In a relational database you need to update your schema but in a graph databse you can easily add a new edge to your databse.

As you've seen all three major category of databases (relational, document and graph) have their own pros and cons and based on the usecase you need to decide which database fits for your use case.