NoSQL

In 2006, Google released a paper that described its Bigtable distributed structured database. Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

Similar to an RDBMS model, at first sight, Bigtable stores rows with a single key and stores data in the rows within related column families. Therefore, accessing all related data is as easy as retrieving a record by using an ID rather than a complex join, as in relational database SQL.

Bigtable is designed to be distributed on commodity servers, a common theme for all NoSQL databases.

Amazon released a paper of its own in 2007 describing its Dynamo data storage application. The Dynamo paper basically describes the first globally distributed key-value store used at Amazon. Here the keys are logical IDs, and the values can be any binary value of interest to the developer.

Many open-source NoSQL databases had emerged by 2009. Riak, MongoDB, HBase, Redis, Cassandra, and Neo4j were all created between 2007 and 2009.

Features

  • Schema Agnostic: A schema isn’t required, giving you the freedom to store information without doing up-front schema design. While relational databases enforce the schema at write time (schema on write), most of the time in NoSQL we know the schema while reading the data (schema on read).

  • Nonrelational: Related information is stored as an aggregate which means everything related to current data is stored in current data. In relational database theory, the goal is to normalize your data (that is, to organize the fields and tables to remove duplicate data). In NoSQL databases — especially Document or Aggregate databases — you often deliberately denormalise data, storing some data multiple times. There is a misconception that Null values on relational databases take space. However, Null Values on relational databases don't take any space if the column type is not fixed length. For example if you use varchar then the Null value doesn't take any space. So this is not an advantage of NoSQL databases over relational databases.

  • Highly Distributable: A cluster of servers can be used to hold a single large database. Any server can be used and doesn't need specialised hardware. Adding more of these servers allows NoSQL databases to scale to handle more data. Storing data across multiple machines and allowing it to be queried is difficult. You must send the query to all the servers and wait for a reply. But NoSQL databases will handle that for us. An exception to the highly distributable rule is that of graph databases. In order to effectively answer certain graph queries in a timely fashion, data needs to be stored on a single server. No one has solved this particular issue yet.

  • Not ACID: ACID-compliant transaction means the database is designed so it absolutely will not lose data:

  • Each operation moves the database from one valid state to another (Atomic).

  • Everyone has the same view of the data at any point in time (Consistent).

  • Operations on the database don’t interfere with each other (Isolation).

  • When a database says it has saved data, you know the data is safe (Durable).

When the Oracle RDBMS was released, it didn’t provide ACID compliance either. It took seven versions before ACID compliance was supported across multiple database updates and tables.

NoSQL databases lose the support for ACID transactions as a trade-off for increased availability and scalability.

NoSQL databases are not ACID compliant and they use two-phase commit hence they're eventually consistent.

NoSQL databases are often used as high-speed caches for web-accessible data on mission-critical systems. If one of these NoSQL systems goes down, though, you lose only a copy of the data — the mission-critical store is often an RDBMS!

Types

  • Key-Value Store: This type is similar to a hash map and stores the data in form of key-value pair. It is extremely fast for writing and reading but not suitable for multiple updates or if you want to query the entire store. You see key-value stores used a lot as caching (Redis, Riak, memcached, Azure Tablestore)

  • Document Store: Similar to key-value store but instead of the value being a file or a single field, the value is a JSON or XML document. You can define secondary indexes on other fields as well. Examples of document store databases include MongoDB and Couchbase.

  • Columnar: This type is similar to the relational database but the columns for each row in a table could be different. They're very similar to document stores. The main difference is that document stores (e.g. MongoDB) allow arbitrarily complex documents, i.e. subdocuments within subdocuments, lists with documents, etc. whereas column stores (e.g. Cassandra and HBase) only allow a fixed format, e.g. strict one-level or two-level dictionaries. With a column-oriented database, the storage engine can be much more efficient than a document storage engine. MongoDB has to rewrite the whole document on disk if it grows bigger, but Cassandra doesn't have to. This makes Cassandra much faster when it comes to writing. For read-heavy systems, go for MongoDB. For write (update)-heavy systems go with Cassandra. In Cassandra, you can change the value of a column without even reading the row that contains it.

  • Graph: Graph stores focus on relationships. They excel at determining relationships between nodes and finding patterns. For example, you can use a graph store to determine friends of friends of friends to the nth degree. They handle these crazy queries quickly.