Tuesday, October 9, 2012

SQL or NOSQL: Google App Engine – Part 1

SQL or NOSQL: Google App Engine – Part 1: NOSQL
NOSQL is a trending topic and pretty much everyone from Google to Facebook has some flavor of it. In this two part post we try to answer the dilemma of Sql or NOSQL. First part will explain benefits and mechanics of each plain old Sql and new shiny NOSQL. In second part we will specifically look at Google App Engine Datastore and try to answer if it’s a better choice to a given business problem

The Traditional RDBMS

It’s probably safe to say, that the majority of real-world applications rely on some type of RDBMS to store and retrieve their data. There might be plenty of reasons for your application to not choose NOSQL and stick to RDBMS instead. Let’s take a look at the key points RDBMS excel at:
  • Query flexibility
  • Maintaining consistency across the dataset
  • Managing transactions
  • Separation of concerns / Dealing with an ever evolving application(s) underneath it
In order to maintain a consistent dataset, the RDBMS enforces integrity constraints. Every action on the dataset has to take place within a transaction with ACID properties. This guarantees, that whatever happens inside the transaction will never break the consistency of your data.
Whatever new challenges your application might face, as long as your data-model stays the same, the RDBMS will happily process your queries. This separation between database and application is ideal for multiple application accessing the same database. The structure (the relations) in your data  is handled by the RDBMS. This of course requires complete knowledge of how your data is connected (provided by the relations you create), as well as strict compliance with the 12 principles of the relational model. These principles provide a solid theoretical foundation for simple and highly structured storage of (ideally) well-defined data.

A Disturbance in the Force

Unfortunately, all these great features come with a few downsides, to mention a significant few:
  • Entities with variable or complex attributes are not supported well
  • Weak support for hierarchical or graph data
  • No easy way to scale
First and second are both functional deficits: RDBMS need structure, and it can only structure what fits logically inside a relation. You could store complex attributes as binary strings for example, but the RDBMS won’t be able to operate efficiently on them. Variable attributes don’t go well with the static schema and every row is forced  to contain every attribute. Schema updates are slow and require scheduled downtime. From a “relational point of view”, this makes absolute sense. Remember, the RDBMS handles tasks that require knowledge about the structure of your data. It has to be “informed” whenever you intend to change that structure (by adding or removing attributes for example).
Relational databases are simply not great for hierarchical or graph data. These types of modelling require lot’s of one-to-many and many-to-many relationships, which can’t be modeled efficiently in a relational database. Figuratively speaking, you are trying to fit a tree or a mesh into what is essentially a table.
The third weakness is performance related. RDBMS can’t easily scale out horizontally, a huge (quite literally) problem for today’s multi-million user “web-scale” applications.
At a very basic level , databases scale by sharding. If one machine can’t handle the volume anymore, the dataset is split into subsets, shards, that can then be stored on multiple machines. A master-server handles load-balancing and routes each request to the appropriate machine (“slave”). This master-slave configuration is not the only one possible, but for the sake of brevity I won’t go into other models.
And this is the point, RDBMS can’t automatically shard data into subsets, because the information for one application-entity is (usually) stored across several database relations. If you are not familiar with relational data modelling, normalization is a good point to start. Yes, some giant web-scale applications like Twitter run on RDBMS, but developers have to implement an application-specific sharding layer, this is not automatically handled by the RDBMS.

A New Breed: NOSQL

NOSQL, “not only SQL” is a generic term used to describe a variety of databases, many of which are by no means new. But they’ve been experiencing a kind of renaissance lately. NOSQL databases aim to solve the performance problems of RDBMS by putting the structuring work back into the hands of the application programmer – You. Fortunately, applications don’t always require aggregation and structure at database level, which is why those apps can safely benefit from the vast performance improvements NOSQL databases can provide. Our goal will be to find out if your app is one of those, but first let me present a few NOSQL databases (roughly in order of complexity):
  • (Ordered) Key-Value stores (Apache Cassandra, Dynamo, Project Voldemort)
  • Object Stores (AppEngine Datastore)
  • Document stores (CouchDB)
  • Tree & Graph databases (Neo4J, Twitter’s FlockDB)
All of these solve some of the problems, and of course bring some of their own.
We now have a fairly solid understanding of mechanics and constraints of a relational and a NOSQL database. With that as a context we will take a look at Google App Engine datastore in next part of this post. Keep watching.

No comments:

Post a Comment