Cloudy Journey: Schema Design in MongoDB vs Schema Design in MySQL [feedly]

Schema Design in MongoDB vs Schema Design in MySQL
http://www.mysqlperformanceblog.com/2013/08/01/schema-design-in-mongodb-vs-schema-design-in-mysql/

For people used to relational databases, using NoSQL solutions such as MongoDB brings interesting challenges. One of them is schema design: while in the relational world, normalization is a good way to start, how should we design our collections when creating a new MongoDB application?

Let's see with a simple example how we would create a data structure for MySQL (or any relational database) and for MongoDB. We will assume in this post that we want to store people information (their name) and the details from their passport (country and validity date).

Relational Design

In the relational world, the basic idea is to try to stick to the 3rd normal form and create two tables (I'll omit indexes and foreign keys for clarity – MongoDB supports indexes but not foreign keys):

mysql> select * from people;  +----+------------+  | id | name       |  +----+------------+  |  1 | Stephane   |  |  2 | John       |  |  3 | Michael    |  |  4 | Cinderella |  +----+------------+  mysql> select * from passports;  +----+-----------+---------+-------------+  | id | people_id | country | valid_until |  +----+-----------+---------+-------------+  |  4 |         1 | FR      | 2020-01-01  |  |  5 |         2 | US      | 2020-01-01  |  |  6 |         3 | RU      | 2020-01-01  |  +----+-----------+---------+-------------+

One of the good things with such a design is that it's equally easy to run any query (as long as we don't consider joins as something difficult to use):

Do you want the number of people?
```
SELECT count(*) FROM people
```

Do you want to know the validity date of Stephane's passport?

SELECT valid_until from passports ps join people pl ON ps.people_id = pl.id WHERE name = 'Stephane'

Do you want to know how many people do not have a passport? Run

SELECT name FROM people pl LEFT JOIN passports ps ON ps.people_id = pl.id WHERE ps.id IS NULL

MongoDB design

Now how should we design our collections in MongoDB to make querying easy?

Using the 3rd normal form is of course possible, but that would probably be inefficient as all joins should be done in the application. So out of the 3 queries above, only the query #1 could be easily run. So which other designs could we have?

A first option would be to store everything in the same collection:

> db.people_all.find().pretty()  {  	"_id" : ObjectId("51f7be1cd6189a56c399d3bf"),  	"name" : "Stephane",  	"country" : "FR",  	"valid_until" : ISODate("2019-12-31T23:00:00Z")  }  {  	"_id" : ObjectId("51f7be3fd6189a56c399d3c0"),  	"name" : "John",  	"country" : "US",  	"valid_until" : ISODate("2019-12-31T23:00:00Z")  }  {  	"_id" : ObjectId("51f7be4dd6189a56c399d3c1"),  	"name" : "Michael",  	"country" : "RU",  	"valid_until" : ISODate("2019-12-31T23:00:00Z")  }  { "_id" : ObjectId("51f7be5cd6189a56c399d3c2"), "name" : "Cinderella" }

By the way, we can see here that MongoDB is schemaless: there is no problem in storing documents that do not have the same structure.

The drawback is that it is no longer clear which attributes belong to the passport, so if you want to get all passport information for Michael, you will need to correctly understand the whole data structure.

A second option would be to embed passport information inside people information – MongoDB supports rich documents:

> db.people_embed.find().pretty()  {  	"_id" : ObjectId("51f7c0048ded44d5ebb83774"),  	"name" : "Stephane",  	"passport" : {  		"country" : "FR",  		"valid_until" : ISODate("2019-12-31T23:00:00Z")  	}  }  {  	"_id" : ObjectId("51f7c70e8ded44d5ebb83775"),  	"name" : "John",  	"passport" : {  		"country" : "US",  		"valid_until" : ISODate("2019-12-31T23:00:00Z")  	}  }  {  	"_id" : ObjectId("51f7c71b8ded44d5ebb83776"),  	"name" : "Michael",  	"passport" : {  		"country" : "RU",  		"valid_until" : ISODate("2019-12-31T23:00:00Z")  	}  }  { "_id" : ObjectId("51f7c7258ded44d5ebb83777"), "name" : "Cinderella" }

Or we could embed the other way (however this looks a bit dubious as some people may not have a passport like Cinderella in our example):

> db.passports_embed.find().pretty()  {  	"_id" : ObjectId("51f7c7e58ded44d5ebb8377b"),  	"country" : "FR",  	"valid_until" : ISODate("2019-12-31T23:00:00Z"),  	"person" : {  		"name" : "Stephane"  	}  }  {  	"_id" : ObjectId("51f7c7ec8ded44d5ebb8377c"),  	"country" : "US",  	"valid_until" : ISODate("2019-12-31T23:00:00Z"),  	"person" : {  		"name" : "John"  	}  }  {  	"_id" : ObjectId("51f7c7fa8ded44d5ebb8377d"),  	"country" : "RU",  	"valid_until" : ISODate("2019-12-31T23:00:00Z"),  	"person" : {  		"name" : "Michael"  	}  }  {  	"_id" : ObjectId("51f7c8058ded44d5ebb8377e"),  	"person" : {  		"name" : "Cinderella"  	}  }

That's a lot of options! How can we choose? Here is where you should be aware of a fundamental difference between MongoDB and relational databases when it comes to schema design:

Collections inside MongoDB should be designed with the most frequent access patterns of the application in mind, while in the relational world, you can forget how data will be accessed if your tables are normalized.

So…

If you read people information 99% of the time, having 2 separate collections can be a good solution: it avoids keeping in memory data is almost never used (passport information) and when you need to have all information for a given person, it may be acceptable to do the join in the application.
Same thing if you want to display the name of people on one screen and the passport information on another screen.
But if you want to display all information for a given person, storing everything in the same collection (with embedding or with a flat structure) is likely to be the best solution.

Conclusion

We saw in this post one of the fundamental differences between MySQL and MongoDB when it comes to creating the right data structure for an application: with MongoDB, you need to know the data access pattern of the application. This should not be neglected as creating a wrong schema design is a recipe for disaster: queries will be difficult to write and to optimize, they will be slow and they will sometimes need to be replaced by custom code. All that can lead to low performance and frustration.

The next question is: which way is better? And of course, there is no definite answer: MongoDB fans will say that by making all access patterns equal, normalization make them equally bad, and normalization fans will say that a normalized schema provides good performance for most applications and that you can always denormalize to help a few queries run faster.

The post Schema Design in MongoDB vs Schema Design in MySQL appeared first on MySQL Performance Blog.