Scalable data systems

In today's applications unlike decades ago, CPU power is not the only limiting factor anymore and data handling became a more challenging part of our applications. From capturing various sensor data for farm monitoring to the contents created by users in social media to track users' behaviour on e-commerce websites, all and all are bombarding our data systems.

A big difference between handling large quantities of tasks versus data is in their statefulness. Unlike CPU-intensive tasks that need multiple stateless workers, in a distributed database we're dealing with stateful queries. This small but important difference makes handling the data trickier than CPU-intensive jobs. Since the data needs to have access to the entire database in order to respond. Historically, when we had only one database instance, everything was straightforward. By default, we had Atomicity, Consistency, Integrity and durability (ACID) in our data. However, today application data demand far greater power than a single node of a database can handle. Hence we need well-thought strategies when it comes to designing a scalable data system for our applications.

In this article, I will review some of the challenges and best practices for designing data-intensive applications.


In simple words, scalability means the ability of software to cope with increased load. Let's first take a look at how we've overcome scalability problems in CPU-intensive tasks. If the load on our application is beyond the power of our server, we can either make our server more powerful (scale up) or distribute the load between multiple servers (scale out). As you may have already guessed, scaling up is not a solution for all applications since there is a limit on how powerful a machine can get. However, theoretically, there is no limit on the number of separate machines to perform similar tasks on our requests. This approach has been used widely in the industry and has been accepted as a best practice for handling large volumes of requests.

In the data world, we can use an almost similar approach we are using a combination of async queues and load balancers to handle the plethora of requests. By leveraging tools like Apache Kafka, AWS SQS, etc, we delegate the incoming jobs to other processors. The key here is that these services are decoupled and they can work independently of other services.

However, in a distributed database, dependency is an inherent nature of the system. When we want to respond to a query we cannot rely only on the part of data that is withheld in the current node, we need to have the result of the same query for all the nodes in our data cluster and this is where distributed data stores become more challenging comparing to a distributed compute cluster.

The first step toward making any system scalable is to have a good understanding of its current load. As a concrete example let's take a look at Twitter example. The Two primary functions of Twitter are posting a new tweet and reading the home timeline. If we want to design a data system for Twitter, we need to know the load parameters of Twitter:

  • The number of new messages posted on Twitter currently is 6k requests/sec on average.
  • The number of requests Twitter needs to handle for every user to display their home page feed is 300K/sec.

Handling 6K writes/second is not challenging. You can insert a new row for tweets into a global table like "Tweets".

However, creating the home feed of a user would be tricky. If we're using a relational database, then we need to look up all the people they follow, find all the tweets for each of those users, and merge the result to create a single user's home timeline. Something like this:

SELECT tweets.*, users.* FROM  tweets
  JOIN users ON tweets.user_id =
  JOIN follows ON follows.folowee_id =
  WHERE folllows.follower_id = current_user

Actually, the first version of Twitter was using this approach. But since this query was expensive with the increasing number of tweets every day they struggled to keep up with the load of home timeline queries. As a solution, they implemented a different approach for loading the home timeline tweets. They introduced a cache to keep the home feed of every user rather than computing it every time they refresh their homepage. They started to update the home page cache for every user following another user whenever the user posted a new tweet. For example, if 10 people are following Alice, and Alice posts a new tweet, Twitter should update 10 cache entries whenever Alice posts a new tweet.

This approach performed better because the home feed query was fast and this was more important for Twitter also the number of home timeline read requests was almost two times the number of new tweets posted. And it makes more sense that they prefer to do more work for the task whose frequency is lower.

However, this approach had its own downside. On average a tweet needs to be delivered to 700 followers so 6k tweets per second becomes 4.2M writes per second to the followers' home timeline. And also worse some famous people on Twitter have around 100 million of followers and if they post a new tweet then Twitter has to update 100 million records for each of these users. And just by reading this number, you could guess that processing this number of tweets in a timely manner is not a trivial task.

So what's the solution? How Twitter handled this problem. If we think inside the box and just try to resolve this issue by either of approaches then we're out of luck. In order to find a solution for such a data problem we need to know more info about the data and the pattern the data is fetched or being used. For example, a key parameter in this problem is the frequency of tweets by users. Not the first or second approach alone can resolve this scalability issue. That's exactly what Twitter did with their current approach. The hybrid model.

Basically based on the characteristics of their user they use either approaches or a combination of the two. Most user's timeline is read from the cache since it's fast and efficient but for famous users with many followers or users who tweet frequently and the cache update is massive, there is a separate data store and these people. If a user is following any of these people then their timeline would be a combination of their cached timeline plus a result of a select query on the famous people table.

As we saw in this example, scalability doesn't have a one size fits all formula even for two similar applications. We need to have a good understanding of our data, our users and the pattern our data is being used. Then we can fine-tune our data system in a way that works for our application.

Choosing the right data model

The most popular database today is a relational database in which the data is organised into tables. This type of database has been around for almost 50 years and it's very mature and efficient. However, another type of database known as a non-relational database is getting much attention these days. In this section I want to take a look at different data models and when to use them.

Relational databases excel when it comes to ACID (Atomicity, Consistency, Integrity and Durability). They also have a very good performance thanks to their query optimiser especially if we know how to index and store our data in them. They have a predefined schema and they've proven that they're performing well for many applications.

On the other hand, enforcing the schema could be a downside for some use cases, for example, if the data we're storing in our database is inherently schemaless then trying to come up with a predefined schema could be a limiting factor for us. The other point is that in a relational database, we typically try to avoid duplication by normalising our data which brings us integrity, however, this could be another bottleneck in some applications as the cost of retrieving the data through multiple tables could be so expensive for our use case.

The other point is that many applications today are written in object-oriented programming languages, which are not by default compatible with relational data structure and for using the data in such languages we need to transform the data from tables to objects. Although ORMs are taking care of this issue, we cannot ignore the fact that still the data is getting converted from one form to another.

Considering all the problems mentioned above with relational databases, new NoSQL databases have emerged to address those issues. For example, a document database stores the data in a denormalised way by including all the related data in a single document so we don't need to join between multiple tables to retrieve the data. It has a better write throughput and also a faster read since all the data is in one place. However, we're sacrificing the integrity of the data. The data integrity needs to be implemented in the application code and the database will not enforce it. Also, this way of keeping the data valid across our database is less efficient than the way relational database engines enforce it since in relational databases it's happening in a lower layer.

If many-to-many relationships are very common in your database, then again a relational database will be able to handle that but it's much easier to use a graph database for such use cases. There are some special algorithms implemented for graph databases that are missing from relational databases. For example, some well-known algorithms like "shortest path between two points" can be used in a graph database and if the nature of your application is in a way that you need to query for such data, then definitely a graph database will suit you more. Graph databases are also very good candidates if you're going to add new features to your database later. In a relational database, you need to update your schema but in a graph database, you can easily add a new edge to your database.

As you've seen all three major categories of databases (relational, document and graph) have their own pros and cons and based on the use case you need to decide which database fits for your use case.