(Original post with video of talk here)
Ben Engber, CEO and founder of Thumbtack Technology, will discuss how to perform tuned benchmarking across a number of NoSQL databases. He describes a NoSQL Database Comparison across Couchbase, Aerospike, MongoDB, Cassandra, HBase and others in a way that does not artificially distort the data in favor of a particular database or storage paradigm. This includes hardware and software configurations, as well as ways of measuring to ensure repeatable results.
Ben: Hi, My name’s Ben Engber. I’m the founder of a company called Thumbtack Technology. We are a consulting company, with one of our primary practice areas being doing NoSQL development and advising clients on NoSQL. And, the background of this talk is, you know, one of the things that comes up really often when we talk to clients, one of the first things they ask us is, ‘What NoSQL database should we use?’ And then, you know, the followup is, ‘Well, we need to learn a little bit about your business, so let’s do some discovery’. It’s the correct answer, but it often doesn’t go over that well. So, what we wanted to do, is we wanted to have sort of at least a basic baseline which would introduce them to the main concepts to give them right off the bat, and then sort of introduce a deeper discussion based on that.
So, about six months ago, we started researching within our company to do some NoSQL database comparisons, and research on the subject. And, this presentation is sort of presents a way that we can perform NoSQL database comparison. So, in some ways, what I’m going to do is come in and argue with everything that Will just said about why you can’t build an abstraction layer.
Audience: [laughter] Love it. [Laughter]
Ben: No, it’s actually a good compliment, here, because these products are different, and you’ve got to be pretty careful in what you’re going to do, and comparing them is pretty difficult. So, all the use cases the he described are use cases people have, and this is going to put a walk through our experiences in trying to do this abstraction layer, and sort of a solution around that. I think it’ll go fairly well.
So, as I said, we get interest all the time in NoSQL database comparison, and people coming to us range from clients that have heard the hot buzzword NoSQL, and have decided to use it, but haven’t decided what to do. Often that’s a very bad idea. To very technical companies who actually maybe have already implemented various types of NoSQL solutions, but are bumping into performance problems or data modeling problems, and so forth.
Some of the problems that people have is that a lot of the published literature out there and so forth, comparing these is kind of riddled with different kinds of problems. A lot of the benchmark studies, which is often the first place people go, are sort of quick and dirty things where people just sort of do default installs of the databases, run some queries against them, make conclusions they really can’t make. There’s a lot of bad reports out there. And then, others sort of delve into somewhat obscure things, like the CAP theorem.
Do people here know what the CAP theorem is? All right, so about half. The CAP theorem is actually very simple to describe. It essentially says, the C is for data consistency, the A is for data availability, ‘Is your system up?’ and P is partition tolerance, which essentially means, can you handle network failures. The CAP theorem basically says you can choose two of those. So, it’s easy to describe, and also, I would argue, mostly misses the point, when people talk about this.
It misses the point, because, if you’re implementing one of these systems, you have a network, and your network will fail, so you’ve got to be partition tolerant. And consistency and availability probably doesn’t mean exactly what you think it means. And, trying to compare things on these dimensions just leads to more confusion. So what we want to do is sort of take some of these theoretical discussions, and turn them into business cases, and try and answer some business use cases in technical terms for people.
So, why use NoSQL at all? I actually think most people come up with a version of what Will talked about. You need to handle a large number of transactions, you need to handle a lot of data, and you don’t want to deal with crazy problems and tremendous manual overhead when systems go down and you need to recover and so forth.
Last bullet that people come to me with they want is rapid application development in having schema and storages used in that way, but it’s not really something that’s easily quantifiable, so I’m going to focus on the first three bullets.
So, we have this great plan of how we have this new research room. We are going to do a whole bunch of stuff. The first thing that we’re going to do is be really, really simple. Just take a big bunch of databases; we’re just going to do a very simple key values storage test. We’ll use the Yahoo Cloud Serving Benchmark, which is as much of a standard as you can kind of get for this, although that itself is riddled with problems. And just take some of the existing studies out there and apply it to a new use case that hadn’t really been studied too much. That is, we’re still doing key values storage, but let’s measure it against bare metal and solid state drives, because that is pretty much what people who are handling really high volumes are going to need to buy, anyway.
And then, once we have that, we’re going to release a series of other reports. We’re going to study secondary indices, we’re going to bring in all kinds of other databases. We’re going to measure data loss, and this was just going to be just a nice, quick beginning, just to do that. And boy, were we wrong on that. This was much, much harder to do than we thought.
It’s hard to do, not only because these databases are different in terms of how do you size your data set, and room sizes, and their configuration parameters. But fundamentally, they work in very different ways. We’ve talked about CAP, and we’ll go into it, but fundamentally, they take very different steps to achieve these different kinds of things, and it’s very hard to compare them on equal terms. So, what we’ve quickly found, was, The first question we had to cover was, ‘How do we even make a fair comparison with these databases?’ So, first thing we did, was we drastically reduced the number of databases we were going to cover. We brought it down to, as I said, Aerospike, MongoDB, Couchbase, and Cassandra. We chose those because those were the four that clients most often came to us. Aerospike’s probably less known than some of the rest, but in ag tech, which is where we have a strong presence, it’s very well known. Plus, they have a database specifically geared towards SSDs. So, it was then an obvious choice.
Wait, I went backwards here. Alright. Sorry. Why am I going backwards? Alright, time for a keyboard.
Alright. So, fundamentally, once we get these baselines in place, we will just do some basic performance measurements on them, and then after that, we’re going to look into what it means when we encounter hardware failures.
The first thing we did was break these performance based lines into two different scenerios. We called them ‘fast’ and ‘reliable.’ Fast essentially meant, we’re going to configure these databases to serve as much traffic as they can, without regard to your ability to obtain data. That means the principle is simply less replication, and simply less writes to disk. Keep all your data in RAM on the nodes. It doesn’t necessarily mean not consistent, and we’ll talk about why in a minute. Another thing to mention here, is that when we say fast, it’s doesn’t mean it’s not reliable data.
The data still gets written to disk, you know, it still gets replicated to other nodes. It just doesn’t do it synchronously. And reliable is simply the converse. We’re just going to take the data, make sure it gets to SSD, and not consider any writes valid until it’s written to disk, and replicated. To describe how to do this, We sort of need to go into how these databases function. As I mentioned, these databases are widely different from one another, so please bear with me. There’s going to be four pictures here of what goes on.
Couchbase is, in many ways, the easiest to describe. It’s essentially memcache with replication and persistence, and basically, This is the hypothetical cluster of six nodes. There’s data on various shards spread across it, and each one of these, we’re assuming a replication factor of three, here. So, each node is a master for one shard, and there are two slaves for each shard. And, Couchbase is a consistent database, but it achieves this very simply. It just always reads and writes from an NAS disk. The fast scenario, very easy to describe, it reads and writes to master, and returns immediately, and doesn’t wait for the data to replicate. You have the option of asking Couchbase to wait until that data is replicated. Now you have consistency.
MongoDB works very much the same way, but it uses a more traditional MySQL master-slave replication model. But essentially, you either wait for the writes to succeed and get propogated, or you don’t.
Cassandra is where it gets a little more interesting. There’s no way you’re going to be able to interpret this, so bear with me. But I want to talk about it, because thie really starts to illustrate why this notion of available and consistent starts getting really confused, all right? Cassandra. In the fast case, you have a single client, that’s going to read to a node or write from a node, but which node you’re reading to is completely independent. So, if you write to a node, eventually it’s going to make it to some of the other nodes in the cluster. And it you read from a node, That node’s eventually going to get the data you want. But as you can imagine, that is truly inconsistent data.
Cassandra enforces consistency in, actually, a simple way. When you write, you can either wait for one, or all, or a plurum of the nodes to acknowledge the write. And the same when you read. So, if you read data, if the number of nodes you read from plus the number of nodes you write from is more than your replication factor, you’re going to get consistent data. So, that might mean you write to all the nodes every time, and read from any one. It might mean you write to one, but you always read from all of them, and just take the most recent, or you write to two and you read from two, and take the most recent. That’s all built into Cassandra.
The complexity here is if you look at what Cassandra says on its Wiki, it says, ‘In distributed data systems like Cassandra, consistency usually means that, once a writer has written, all the readers have seen that write.’ That’s true. That’s true. That model absolutely enforces that. But, if you think about a case like one node writing to all the other nodes, and you have people reading just one node, while that write is in progress, some reads are reading the new data, some reads are reading the old data. And, sure, when the write is done they’ll all get the new data, but it’s probably not consistency like most people think of consistency, and certainly not consistency like people moving from relational databases think of consistency. For our test, we used the write all, read one model of consistency for our test, because it is durable, for sure. But, it’s something to keep in mind when you’re trying to design your database.
B: Can you take a question?
Ben: Sure.
B: So that means that, when you write one, you basically hold the subsequent writes to the slaves in the computer, for instance. So let’s say, on one of the nodes the RAID cache battery dies, right. It becomes slow. That means that the whole, you know, set of nodes connected will also slow down, is that right?
Ben: Yeah, yeah. That’s a really good question, and we’re going to get into this in some more detail.
B: So, it kind of defeats the redundancy, in a way.
Ben: It doesn’t defeat the redundancy, but it does cause a dramatic effect on trying to do consistent writes. I will actually get into that in a bit.
C: One more question. What is the node count? Is the node count relevant to these pictures that you’ve been drawing?
Ben: The node count?
C: Yeah.
Ben: This is assuming a six node system with a replication factor of three. Just drawing it because . . .
C: I mean, is that similar to all the other pictures? I just want to make sure that I understand it.
Ben: Yeah. All these pictures that I’m showing are what you do is you had a six node, replication factor of three. In fact, our test was four nodes with a replication factor of two.
C: Got it.
Ben: because of four.
C: Yeah.
Ben: This actually illustrates a little bit more, so I was a little bit generous in the picture.
C: That’s what I just wanted to make sure about.
Ben: Okay.
Aerospike, again, simpler to describe. It works just like Cassandra when running in fast mode for its writes. For the reliable mode it actually has assets and mandates, so it does distribute transaction commits on the nodes, and behaves in an asset way.
This is a big grid, which I’m not going to go into now, because of the interest of time, but this presentation is available online. You can look at it if you want. What I wanted to talk about is, consistency on single node failure. That’s kind of the interesting thing we’re talking about here, when we talk fast and reliable. If one node goes down, have you lost data? When we run in synchronous mode, no. When we run in fast mode, yes. So, that being said, those were the baselines we set up.
The tests are actually very simple. We put the databases on the cluster. We tuned the crap out of them, and made sure everything is optimized with the appropriate hardware, according to the vendor recommendations. All these guys are our partners. They all helped us, you know, make sure everything was tuned, and yelled at us a lot when we made mistakes. And then we the loaded the data set to the disk, just doing regular inserts, determined the maximum throughput we can get, either with a super balanced read write work load or in a 95% read, 5% write workload. And then do that again, and there’s your latency for various levels of traffic. And then repeat the whole thing for the end memory. The fast case.
So, talking about inserts. It’s not the most important part of this. But you can see is the key value stores were, in fact, extremely fast. Both Couchbase and Aerospike, hundreds of thousands of transactions per second. The SSDs, heirarchy, I made sort of a funny bar, and that just sort of, I just want to illustrate why it’s so hard to compare these.
The issue, in this case, is Couchbase needs a lot more RAM, a lot more metadata in the RAM for each row. So, there was just no way to compare insert performance back and forth. We had only 200,000,000 rows of data for Couchbase to 500,000,000 rows of data for Aerospike, so, yes, loaded it quickly. It’s not directly comparable, and there’s no way to balance everything out. What we wanted to do is get the right is get the right balance of disk to RAM, so that meant different data set sizes.
A more interesting chart is this. Again, this is for both workloads. You can see the number of transactions we get against the system. When running in fast mode, these things were very fast in DU, ranging from 450 to 900 thousand, with Cassandra and Mongo trailing drastically. But, not all that slow, with 100,000 for Mongo.
When running in our reliable mode, probably the most obvious note here is, well, two obvious things. One, Aerospike was an order of magnitude better than anybody else. And Couchbase didn’t run at all. Couchbase, in theory, you know, the underlying disk was doing about 40,000 operations per second, but it got bottlenecked by RAM. So, this is something we’ve been going back and forth with Couchbase on. They hope to release a fix, you know, one of the minor versions coming up. But right now, don’t try and use Couchbase in a synchronous way is the main takeaway.
I have a lot of these graphs. I’m only showing you a couple. The main thing to see here is much like Will and Mike said. We saw less than one millisecond latencies all the way up to top capacity on both Aerospike and Couchbase. And, even for Mongo and Cassandra, for most of their throughput levels they’re fast, and trailed off only when they were reaching their absolute peaks. So, these systems were all quite fast.
C: That’s what I was just want to confirm. So, you’re saying that the peaks of many of these are kind of way toward the left side of the graph. I’m trying to understand that.
Ben: Well, yeah. Good question.
Unknown: Can you repeat the question please?
Ben: So, the question was, why are the peaks all the way to the left side of the graph? The answer is, actually, go back, because the maximum load of this under Mongo is getting way less than the maximum load that the others are getting. So, what you see, sort of all the way to the left with Cassandra, is actually Cassandra maxing out at about 30,000 operations, and that’s . . .
C: That’s still having a decent throughput, but a very low, kind of a fairly low response time. What was the limitation on that?
Ben: Sorry. What is the question?
C: Is it always memory that is the limitation? I mean, usually what happens is there’s usually an exponential climb similar to kind of the Cassandra on the RAM asynchronous. So, usually, when something maxes out at, say, x 30,000 a second, right? All of a sudden I’ll see a exponential climb when it goes to 35,000. Almost drops off. It’s unmeasurable.
Ben: Yeah. On a synchronous, it was definitely just that. I’m sorry, the question was, what limited these things and caused these latency slowdowns? For the synchronous case, it disk iops. For the asynchronous case, now, it’s actually, the one thing I said, it’s a little hard to tell, right here. They seem to be, sort of, page swaps, but it wasn’t necessarily that easy to know what was going on
D: Question?
Ben: Sure.
D: I got lost somewhere. When you’re talking about synchronous, what are you referring to here? Are you talking about, you know, waiting for the write to complete, is that what you mean by synchronous? In other words, the next write won’t occur until the first one completes?
Ben: Right, right. So, the question was, what do I mean by synchronous here? I should probably have described and used consistent terminology through my slides, because I define the fast and reliable use cases. This just means the reliable use case. That means we will wait for the replica, in this case, our replication factor was two, so there’s one replica. We will wait for that replica to confirm that it’s been written to before the marking the write as Okay. And, if possible, we will confirm that the disk has been written. Okay, now, confirming that the disk has been written is not the whole story, because disk could be written, but the fsync might not have run yet, so it might actually be still in, sort of, controller memory, and that differs from database to database. But the replica into the other replica is the critical point, because if you lose the node, you still have the data.
E: Question. Aren’t these like oranges and apples? It says that Cassandra’s model will use eventual consistency, whereas MongoDB and CouchDB work differently in terms of being in RAM, I’ll give you that. So, when we’re looking at this, I mean, we could, you know, if you wanted to, you could tune it to serve any particular purpose, I just want to understand how were these done? Were these for an add-related test, because SSD will crash in certain situations. Where have terrabytes of data, it’s just not going to work. I’m sorry, I didn’t find it earlier in your presentation. What was the focus of this study? What would be the consistency across the board? I mean, of the CAP work here was C-A-P, or were these tests just . . .?
Ben: Okay. I’ll try and paraphrase your question.
E: Sure.
Ben: The question is, at these scenarios, when you say consistency, are you really trying to match consistency to consistency? Are you trying to match the same consistency levels for databases? And the answer is, ‘Sort of.’ The problem with the term consistency is that it means different things, fundamentally. So, you know, Couchbase is always consistent, because it always writes to the master. However, if that master node goes down, you’ve lost data.
E: You’re screwed. Yeah.
Ben: Yeah. Maybe you’re not screwed, and that’s a perfectly, There’s lots of good uses for that scenario, and, you know, as a persistent cache, it’s fine. So, we’re not actually measuring that. We’re trying to measure how much you care about data loss. So, what these two scenarios are, the asynchronous represents Cassandra running in inconsistent mode, and all the other databases as well. The synchronous represents waiting for another replica to be populated. In Cassandra, that means using write all in our test, so it is running Cassandra as consistent mode . . .
E: I think you answered my question. So you are forcing it to be consistent and then measuring it.
Ben: We actually hacked the YCSB code to have a consistent mode, because that’s not something supported [inaudible].
E: [inaudible]
Ben: Okay.
E: Thank you.
Ben: So, I think the conclusions for the performance part are obvious. Aerospike claims to be an SSD optimized database. They’ve written their own drivers to access the raw hardware. That was very clearly true in our tests, so that’s why they like us. For the asynchronous, both Aerospike and Couchbase were just unbelieveably fast, and the other databases really were pretty fast, too. This only answers with half the questions, right? That’s the raw performance. The next question is, sort of, what does this really mean in the face of node failures. We know you lose data, but what actually happens? So that’s what we try to answer next.
So, we then sort of created a set of tests to disrupt the cluster by taking a node down. This picture down here was our hypothesis. It’s what we expected to happen. A node goes down, we expect the cluster to be unavailable as we try to figure out what the hell happened. It comes back up, eventually gets to its original throughput, and then the same thing happens when we bring a node back up, and it tries to recover from it. And then there’s a period of replication, and then you’re done. So, we did this. We modified a whole bunch of parameters throughput, different ways of killing the nodes, the same scenarios we talked about before. And, I going to walk you through some pictures quickly, that I think illustrate, better than numbers do, all right?
This first set of pictures describes what happens when you run at 75% load in our fast case. We chose 75% load because, if you’re taking one node out of your four node cluster, you should expect to be able to do up to 75%. So, the perfect system will show no impact, but an imperfect system will show stuff. What you can see here is that, in the fast mode, these all basically worked sort of similarly to how they did before. Cassandra had some latency penalties while that node was down. Aerospike showed very little, other than a very brief period of down time. And the others sort of came in between. You can see in Couchbase, there’s a little bit of replication traffic at the end, because Couchbase erases its node when you rejoin it to the cluster.
F: I’m sorry. Could you just tell me what these graphs are?
Ben: Oh, I’m sorry. The question is, what do these graphs mean? The horizontal axis is time. And the vertical axis is throughput in operations per second.
F: So, the top one is operations per second, and the bottom one is for . . .
Ben: So, the top one is operations per second, the bottom one is latency, and they’re highly correlated.
F: Are all the graphs on the same scale?
Ben: They’re not on the same scale. The numbers are from the last one, so you can imagine Aerospike’s is . . . We had 300,000 before, it’s 300,000, and that’s actually a good point which we’ll talk about later. Sometimes you’ll see a big dip in one of these graphs. That doesn’t mean it’s slower than the others, it just means it’s slower than it was. It’s probably more important to see the dip, because you’re going to be using this for capacity planning, but don’t compare them in absolute terms.
Just sort of to illustrate what happens here, the next thing we did was we ran it at 100% capacity, which, as you might expect, when a node goes down, you’ve lost 25% of your capacity. You can see that all the databases except Mongo, which just uses traditional master-slave, showed a behavior, and you can see some of the impact of getting up to speed a little bit more in some of these other ones. Both Couchbase and Aerospike have to work a bit to get up to that prior level, because they’re trying to replicate against, essentially, their maximum capacity.
To me this illustrates why we’re doing 75%, but you shouldn’t be running your hardware at 100% capacity. Probably more interestingly, is the 75% versions were reliable. But see, we have a question mark, should Couch’s for Couchbase, because we just couldn’t get it to run. But this goes, I think, to Mike’s question. In the case of Aerospike, when you bring down a node, and you require that you replicate each request, Aerospike does what you would hope that something would do, and replicates to a new node. It finds a new place to store the second copy of its data. So, it’s able to continue processing transactions throughout the whole period.
When you bring that node back up, well, you’re just going to end up doing hundreds of thousands of writes per second for ten minutes. You’ve got a lot of database operations to replicate, so there’s a very severe penalty that lasts for some time as it tries to bring the node back up and replicate, and bring the new node up to date. Cassandra and MongoDB, sort of, this is the central question of what we ask. If you’re running with only two copies of data, and you’re saying write to both of them, Cassandra and MongoDB do not relocate that data. Those writes will simply fail. So, they simply have down time.
The good news is, when they get back up, there’s no data to replicate, so it comes right back up. It does for Cassandra, too. That’s an artifact of a driver which is not a particularly interesting story.
But, I think the central lesson here is, if you want consistent data, have at least three copies for these databases, and don’t use write all, because you’re going to see this. In terms of total downtime, you know, it’s a little bit chaotic. There seems to be no pattern as we rewrite the test. But the good news is that all of these databases didn’t stay down very long.
When we brought a node down, the worst thing we saw was about 12 seconds of down time. Couchbase has a little special thing, they take about 45 seconds, but if you write your own monitoring and recovery scripts, which we did, you can get it down very low. And in bringing it up, only Mongo had any penalty, really, whatsoever, about 30 seconds. Again, this doesn’t mean you’re at full capacity, this just means how long is your cluster down.
But our reaction on ths was, you know, if you’re talking about six seconds of down time, you’ve probably got other problems bigger than your database, so from a downtime perspective, these all seem to work as advertised.
Another big chart here. Again, if you’re interested please download it. If you have questions, I’m happy to go back to it. Potential data loss here. We’ve sort of calculated some of the data loss out, assuming the result of our test.
C: Can we go back once? I just want to make sure I understand one thing. It was a node down replication systema. What does that mean?
Ben: Oh, right. What does that percentage mean? Node down replication is the percentage of data that is, 50% means that you have one and one and a half copies of your data, right, which is exactly what you’d expect if you took down one of four nodes, and so, Cassandra writes and [inaudible] there, because Cassandra doesn’t really report that number. It reports the number of SSTs. I forget the name of what it has, which it doesn’t necessarily clear, so it’s a little hard to get, sometimes. And then, when it comes back, it went up to 100, with the exception of Couchbase, which was up to 76, because it takes some time to recover, basically.
And measuring the time to recover, by the way, you know, when a node comes up, Couchbase wipes out that node and re-replicates everything. So it’s always sort of showing you the worst case replication number. One thing we’ve considered doing was trying to measure how long does it take to achieve sort of full capacity again? The problem is, that’s not a very interesting number, because if Mongo’s doing 20,000 writes per second, and Couchbase is doing 500,000 writes per second, Couchbase has a much bigger problem to solve, so how even compare these numbers? Sort of what you need to know is that’s going to be there. You’d better plan for it.
C: One more question about this. I guess the potential data loss column, or row, only Aerospike sync, and Cassandra sync are the ones that had none. All the rest just have to pull the data loss?
Ben: Yeah. You run Couchbase in synchronous mode, and it works. That would also be none. And MongoDB in synchronous mode would also be none. Those are not typical write modes for those two databases, and this table would be completely unreadable if I included those two in there, so if it’s run synchronously, the answer is none.
C: So it’s actually, in a typical usage scenario, Aerospike and Cassandra should be your target if you’re targeting better reliability?
Ben: MongoDB can do it, too. MongoDB can do it, too. Let me repeat the question. The question is, should Aerospike and Cassandra be your databases of choice if your targeting reliability? Of these four, those two are designed explicitly for that case, so that is a natural setting for them. Couchbase claims it is a setting, but we are not able to get it to work. MongoDB can have that setting, and we were able to get it to work. For these results, we found Cassandra and Aerospike more compelling. We found them more compelling because their numbers were higher for this case. MongoDB is more of a document database. We’re doing a raw key values store. So, we highlighted certain things. But there’s a lot of data behind these tests. Mongo’s not a crazy choice.
F: I’m wondering how much of this is really being influenced by the limited number of servers you’re using. If you had 10 or 20 servers would you be experiencing these same latencies of recovery, especially if you have lots of redundancy and replication going on already to prevent, you know, failures.
Ben: The question is, how much of this behavior is an artifact of our cluster size, which is a good segue into the next slide, or two slides from now. The first thing I would say, adressing this is, right, what does it say about MongoDB and Cassandra if you have a replication factor of two? Well, it doesn’t work if you want to do reliable.
If you had a replication factor of three, it says it won’t work if you’re using a write all semantics. But, if you’re using quantum semantics, it should work. I would loved to have given, sort of, pictures of what’s going on in that case, and that’s going to have to be chalked up as a to do. The issue there is just, very simply, what we were running was bare metal hardware with an array of SSDs. It’s actually pretty expensive to put even a four node cluster together for this.
There is a sort of an analog test, which we’re sort of doing right now. Let’s just put these things on EC2, take a whole bunch of nodes. You’re not going to get, sort of, those Aerospike SSD numbers, because there’s latency in all the virtualization, and so forth. But you can get, a sort of, a better picture of how larger cluster sizes handle failure. That being said, some of these effects are effects of the amount of data.
If you have hundreds of millions of rows that have to be replicated, those hundreds of millions of rows need to be replicated, and they’re going to need to be replicated to every node that is picking it up. So obviously, if you have dozens of copies of this data, and one node down, it’s probably not going to affect you nearly as much. I’m not aware of too many people who has a lot more than three copies of data floating around. Personally, I’m not really aware of people doing that, because your hardware investment goes through the roof.
G: So, for the rival scenario, could I assume that for similar data reliability, I could use a replication factor of two for Aerospike to get similar reliability to MongoDB and Cassandra?
Ben: Yes.
G: Okay.
Ben: Yes, with certain caveats. If you use two, and you lose both those nodes, you’re still going to lose some data, whereas if you use three, in MongoDB and Cassandra, you’ll still have a copy.
G: So three is probably a good idea, just across the board.
Ben: And five is even better. Right, you have a balance of how many copies do you want versus how much do you want to invest and how much do you want to deal with a large cluster. The odds of both nodes going down in dis-catastrophic ways, not very likely. Especially because you can bring another node up and start recovering immediately, but it’s a possibility. The odds of three going down is getting remote.
These products do have cross data center replication, so your data center getting nuked or something could be an issue right there. If you want to get to that highest level of reliability, you know, cross data center replication is something to consider, but that sort of opens a whole new can of worms, because you start to be limited by the speed of light of getting your data to the other data center. So you’re definitely not getting one millisecond response times if you’re trying to write to multiple data centers synchronously. And all these databases allow you to tune how you do cross data center replication. That also was well beyond the scope of our lab.
G: Okay, thanks.
Ben: Sorry, did you have a question for me?
H: So, the asynchronous case is the eventually consistent one, right? And it would seem in that case that having greater than two is good, because the thing you get voting, right, in terms of, if a node goes down, if you have enough, then you can vote on which, pulling the data back, if it’s coming back from multiple nodes, you can have a voting thing which determines what the data should be.
Ben: Well, that’s the Cassandra quorum. It is an eventually consistent system. Is the replication factor of three good because you’re allowed to vote on which copy of the data is best? And, I like talking about Cassandra, because Cassandra works in a very simple way that makes these questions easy to illustrate. So, the quorum mode of Cassandra is, if you’re reading from two copies of the node, when you have three, you pick whichever one has the newer time stamp, because if it was written to two there. you would use two, the older one must be the wrong one. That’s not exactly eventually consistent, that’s fully consistent, that’s fully consistent, because you’ve written to two copies, and you’re always getting the latest data. Eventually consistent is the first case, where you write to one, even if you read from two, neither one of them might have the data yet. So the fast scenario is still eventually consistent, and the reliable scenario is fully consistent. And again, consistency ends up being sort of the wrong thing, because MongoDB and Couchbase will always return a consistent answer, unless something goes down, in which case they won’t, well, I’m starting to repeat myself here. I think I said . . .
I: In Cassandra can you select how many nodes either you’re reading from or writing to on a call by call basis?
Ben: The question is, in Cassandra, can you choose the number of nodes being read to or written from on a call by call basis? The answer is yes, that’s true for Cassandra, for MongoDB, and for Couchbase. For Aerospike it is a server configuration parameter.
J: After having the initial load of data, are you actually having this test player actually only a reads data, or is it going to read and write at the same time, too? Is that player that it actually reads and writes, or only reads?
Ben: I’m sorry, can you repeat the question?
J: After having the initial write of the initial test data, are these latencies for only the read, or are they with the writes going on?
Ben: Okay. That’s a great question.
J: Because the reason I’m asking is the last time I checked, there is a certain read to write ratio MongoDB suggests, otherwise it gives high latencies. So, some databases have a certain read to write ratio that they suggest, and maybe in my situation, I respect, or want a really good aspect of what the read or write ratio is.
Ben: So, the question is, are there read and writes occurring on these databases concurrently, and what is the ratio of reads to writes? And the latencies we report, are they read latencies or write latencies? The answer is, we ran several scenarios here. The ones we’re focusing on here focus on two different scenarios. On is a concurrent blend of 95% reads and 5% writes. That’s sort of a read heavy, very common application scenario. The other one is a 50-50 mix of reads and writes, and so when I show ‘balanced workload’ and ‘read heavy workload,’ that’s what those two columns are.
What are the latencies we’re reporting on that? The answer is, we have a lot of those graphs, and in the interests of time and boringness, I have only included a very small number on this report. If you’re interest in exactly which read latency and write latency for each scenario, we have the full report. It’s a long report that has all the deeds, so you can examine them to your heart’s content. I actually don’t remember which ones I included here, whether it’s read latencies or write latencies. I think I included some reads, like just one scenario with just reads and writes, top to bottom.
K: I’m curious as to why you covered two versions of Couchbase.
Ben: The question is, why did we cover two versions of Couchbase? The answer was, we were doing Couchbase 1.8, and then we learned that this study was a lot harder than we thought it was going to be, and they released Couchbase 2.0, so we did both.
K: What about comparing experiences before and after?
Ben: They performed really similarly, actually.
K: 2.0 has cross data replication capabilities [inaudible].
Ben: Right. The point is that Couchbase 2.0 has cross data center replication capabilities. That’s true, but since we did not have multiple data centers, it was not one of our test suites. Couchbase 2.0 also has secondary indexes and a bunch of other features. Couchbase 2.0 also completely rewrote their storage engine, which made us think that it would have a dramatic impact on performance, especially in the synchronous case, at least. We saw no real significant difference there. Neither one was really worked.
Now, again, this goes when you test 2.0.0. None of these products were bug free, I’ll put it that way. Cassandra’s probably the least buggy, but all of them had issues that cropped up as you sort of structured your kinds of tests, but in all of the cases except that one, we were able to work through issues with the relevant vendors. Yes.
L: This db presentation that we have here mentioned that HBase was also going to be part of the discussion. Did you do that? Number one. Number two, I look at the size of the error spike that you produced. When you talk about big data, I’m thinking tera terabytes. To the gentleman who asked earlier, these tests are in SSDs. I can not do that. So in NoSQL I’m thinking tera terabytes, because it’s hard to get. My io would kill me from that perspective. So, did you do HBase, or, in that dem paper that’s on the website, do you have HBase also? Because that would be a true comparison in my opinion.
Ben: All right, so I think there were two questions there. One is, did we do HBase in our tests. The other one is, is there something . . .
L: SSD, because if I look at the activation in this error spike sheet, it’s pretty small, I mean it’s just two gigabytes.
Ben: Yes. The second question is, gee, you used probably half a terrabyte of SSD capacity, is what we used. So, the first question of did we do HBase? No, this was part of an original grand plan before we decided to simplify down. As a company, we do use HBase. As a company, we’re actually in the process of trying to quantify it along these same dimensions. HBase itself is built on top of the Hadoop file system, which leads to all kinds of performance issues, and there are companies like Manpower [SP] Technologies try to put it on top of a different file system to get some better performance.
These are all things we’re actually working for in our next report. So, the short answer is, no for HBase, but we have some anecdotal evidence that you’re not going to, sort of, see numbers like that. The SSD size, again, I’m going to have to plead poverty. Half a terabyte of SSDs is expensive, but terabytes and terabytes in SSDs is even more expensive. So, we put in 500 million rows, relatively small rows, for two reasons. One, to get a lot of rows in without conflicting, fitting in those SSDs. And two, network bandwidth. When you’re talking about hundreds of thousands of transactions per second, it’s pretty easy to saturate your network and not learn anything interesting.
M: Just by the way, suggestion wise, VoltDB would be a perfect example to do [inaudible]
Ben: The question is, did we look at VoltDB for this test? The answer is, we certainly looked at VoltDB. Using VoltDB as a key value store didn’t seem like a prime use case for VoltDB. Using VoltDB in a sort of more sophisticated comparison based on secondary index queries versus SQL is an interesting statement. VoltDB, for those who don’t know is a distributed NewSQL database that allows you to run much of, you know, ANSI SQL and RUNSQL queries across a cluster if you relax a few things. It just wasn’t the best fit for this piece.
N: You mentioned a cost effective means? When was the last time you looked at the cost effectiveness? Because, I’m thinking more from the perspective of cloud. So this is a 2G-2B or any activation server model, where your sending the data, requiring tera and terabytes, exabytes of data.
O: With exabytes, we probably can not do it. With a good amount of terabytes, you can do SSDs very fast, you’d be surprised.
N: Im just thinking of the main stream of capturing everything and anything. I wanted to use market research, market kind of research, for ads. I just was thinking of that part.
Ben: I would think of it this way. If you have a large amount of data back, and you need answers extremely quickly, you really kind of have two choices. You either use rotational drives, and put enough stuff in RAM, put like as much of your stuff in RAM as possible, that you’re rarely hitting those drives, or you put it on SSDs. If that’s not your use case, and you just have tons of data, and latency isn’t the problem, that’s not the question we’re trying to answer here.
N: Okay, that’s alright. Thanks again.
Ben: Yes.
O: This is for people who need a seconds, consistent response, high speed in general. This is not the case that of our meeting so far, you know, have to fight the databases.
P: So, I realize you didn’t do this on an EC2, but how hard would it be to set these systems up for SSDs only?
Ben: The question is, how hard would it be to set these systems up for SSDs on EC2? The answer’s that it’s not hard to set them up. And EC2 has high io instances that you can, you know, play with. EC2 has other properties that make some of these things complicated to analyze. Because it’s virtualized, there’s sort of unpredictable latencies and so forth. We are doing, sort of, EC2 follow up, but when you’re studying, you know, low latency performance, EC2 is not a natural place to start, because it’s got unpredictable latency. So, believe me, it would have made our lives a lot easier if we could have done EC2 for this test.
P: But, in practice, I think most of us are going to be using EC2. For many of us, anyway.
Ben: Well, I will say learn your cluster sizes, you know, it’s on our radar. We’re working on that right now. We’re using EC2 for that, but I don’t have meaningful results for you yet.
R: So you spoke earlier about having to do a lot of configuration, and it’s often a source of a problem in comparing these different systems. How would you recommend, like, an engineer to go about doing that, if you didn’t have access to the partner, or direct access to. . .
Ben: The question is, given that we do a lot of configuration, and given that we talk trash about earlier studies, I’m paraphrasing, how do you actually configure these systems? This is one of the reasons why we put this together. So, we go into extreme detail on the papers themselves, of how we set these systems up. So, that’s a good place to start.
The vendors, themselves, will help you. If you’re installing Aerospike, and you talk to Aerospike and say, ‘How do I get better performance using this kind of hardware?’ they’re going to come in, and they’re going to help you, because this is their bread and butter. For Cassandra, they just access a ton of free documentation on their site. What parameters to relax, in their experience, for EC2 and so forth. And so, Couchbase has a lot of information on how to size your cluster and so forth. So it’s there, but it takes some work. It takes some work. So, I hope that the papers we release gives you a head start to getting there.
S: Similar question in terms of, about the time it took to learn and get operational with each of these four stacks. Probably a good thing that you didn’t throw HBase in there, otherwise it would have been crazy, but was there any learnings there?
Ben: The question was, were there learnings on the amount of time it takes to learn these databases well? From our perspective, that was the biggest learning curve. We were, like I said, wildly optimistic. We’re like, ‘We’ll be done in a month.’ You know, put a couple engineers on it. And, it took us like three and a half months to get that first paper out, and that was dramatically reducing the number of databases. So, the big learnings are, gee, trying to get fair answers is really hard. Luckily, you probably won’t have that exact problem. You don’t care if it’s fair, you just want to learn your individual database, so that’s going to be, divide by four, at least. The other learning from our side is, be prepared to throw out the majority of your data, which is painful, but you’ve got to do it. So, we did it. We gathered a lot of data, understood why that data was wrong, and then moved on from it, and, it’s not here.
S: Oh, you mean in the sense of many intermitant tests that would fail or were inconsistent for one reason or another.
Ben: A lot of things like,’Gee, that looks kind of funny. Why would that be the case? It doesn’t make a whole heck of a lot of sense. Let’s investigate on a deeper level what’s going on here in terms of replication.’ ‘That’s probably not what you want to do under high load.’ ‘Let’s try messing with this a bit.’ ‘Oh, that makes more sense.’ There’s a lot of that going on.
S: I’ve got another question. Do you have any experience, cursory or otherwise with, I guess some of the relational databases who are trying to, like, bridge the gap, with, like, Postgres’s adding native JSON support, and internal JavaScript support, and the benchmarks seem to be fairly close with at least some of them, and do you see that as kind of a, I don’t know. If you don’t have that much hardware and you want to kind of, like, consolidate, and pick one, does it make sense, like, to quantalize the user relational database in that kind of use case, or is that not a good use case?
Ben: The question is, does it make sense to use a relational database for this kind of use case? Well, that’s the question, right? And so, the answer is, it’s certainly not crazy. In fact, before NoSQL was a thing, that’s exactly what we did. You know, a client would come in, and their SQL databases were sort of collapsing, and they’re joining everywhere, and we just worked with the application and denormalized, and started destroying things in key value-like ways, and it made it much, much easier to shard the data to multiple servers.
You can use replication to go across. In fact, this is what you know, this is what, you know, Twitter does, for example. If you’ve looked into something like Gizzard and FlockDB. Have you looked at that at all? So, what these are, is these are, you know, Twitter basically wants to maintain a graph, all right, and so it’s who’s following whom? Just writes, essentially, single key value records into lots and lots of MySQL, and then has a layer to distribute the load, and balance that load. It’s a totally reasonable thing to do. I would argue that this is a lot easier to do, because these have all these features built in.
We did not have to write something like Gizzard to get these clusters to come up and down when we messed with a node. And, when you do something like that with a relational database, you need to frame your operations fairly carefully. So, as an example, you know, Twitter has a fairly simple use case. They have one directional relationships between people. Facebook, also using MySQL, has bidirectional relationships. You friend someone, they’re your friend back. That means every write needs to go to two different places, and those might be inconsistent, but they have to handle that eventual inconsistency themselves. So, they have repair processes running, trying to find in consistent data, and get it back into sync. That’s a lot to deal with.
S: Do you see other, if you were to do that, even if it’s stupid? Is there, like some more performance between the two?
Ben: It’s not stupid. It’s not stupid. The question is, is there similar performance if you do that? We do do that as well, but I’d say manual sharding, we have not gotten performance like hundreds of thousands of operations per second doing that. We have gotten performance of, you know, more like ten thousand. Those systems have been in production. They work fine, but they’re a pain to manage.
So, for instance, we have an application which, again is like a, if you’re familiar with Facebook and we’re making your social graph. we have to hit the shards in the right way. Each shard is using standard master-slave replication somewhere else. Now, we want to change the schema, somehow. Do we break replication, update it one place, update it the other. We just opted triplication. It’s a lot of maintenance. The advantage is, it’s a very well understood technology. So, you’re not going to hit that tippy top level of scale. It might be enough scale for what you want to do. But I would seriously consider the cost of maintaining such an infrastructure.
S: Thanks for looking at me and stopping. [sounds like]
T: The question was, is it crazy to use a relational database in the way we use a NoSQL database? So, in Will’s report, we’re talking about NoSQL database, and the no scale layer around that. The fact is, in our system, we actually use a relational MySQL database as one of the components working in the same framework. Actually, that allows us to start one project much faster than just using the experienced developer schedule and just developing regular SQL databases. So we just use that experience; we’re starting faster. I we’re absolutely ready to switch that product to a NoSQL database. The way we’re using My Scale now, is the NoSQL way, so it’s absolutely not crazy, and it absolutely works.
You can not have that in memory slaves. These things, you can do. Should you? Very rarely, but sometimes.
Ben: Yeah, sometimes.
U: What kind of perfomance differences do you see between the relational and NoSQL?
T: It’s pretty good. We’re not talking, here, about hundreds of thousands of transactions per second, but it’s sufficient for the task.
U: Okay.
T: Right. So, it’s thousands of transactions per second is pretty good. But their table structure is very simple. It’s key value storage, pretty much the same type we use in MySQL databases.
U: SQL data is actually one huge factor, you can actually see what’s going on there,
S: so that’s what makes it a good start just to develop your application against it, and, you know, verify the data, you know, did it even, wright, make sense? And then swap with your high performance. [cross talk] And then you can swap it for higher performance engines.
V: I have this question. Just hearing about the SQL and NoSQL thing. One of the I was recently looking for, again, I can’t remember the name right now, but it was developed by the company Stripe that makes online payments, and they created a project that takes MySQL’s tail logs, and replicates it to Postgres, so that they have their back-end database, and they can do all the analysis and all that, and they use MongoDB as a front end for data, and then they keep replicating it real-time to Postgres, so that they have the back-end data.
[cross talk]
Ben: Essentially, what we intend, is there’s a company that produces something that can take MongoDB replication logs, or MySQL replication logs and replicate to Postgres or other databases. That’s what’s traditionally called polyglot persistence. You know, we can talk about some of the disadvantages of NoSQL to begin with, or maybe the advantages of SQL databases, which is, SQL databases are very, very good at running arbitrary queries joined against things, and so forth.
These are facilities that you give up when you move to NoSQL, and it’s what you trade to get some of this scalability. So, you know, most organizations can’t live without that, in the end. So, when we’re talking about these kinds of things, if we’re talking about hundreds of thousands of transactions in the front end, you’re not doing that in a SQL database, and you’re certainly not joining. But you do want to analyze that data, so at some point, you know, most everyone I know is taking that data off that, bringing it into some kind of analytical storage, either a SQL database, or a data warehouse of some kind, and analyzing the data there. It’s extremely common. Most people I know just, matter of fact, just sort of home grown solutions to pull out and write in, but I’m sure there are a ton of products that do, as well.
T: Right. So, before we probably best way of doing that, we’ve found we look at this extensively, is you end up with this huge dataset available, very, very, highly available, and very highly available, and very [inaudible] with low latency, and at the same time, you know, latent [sounds like] storage, is you stream the data to one, as you write it to the NoSQL engine, and there’s more that’s being fed, you know, through your real-time processing system into your data warehouse. [inaudible] Otherwise, you have to extract a snapshot, save it, it takes a long time. By the time you’re done, it’s already changed, so it’s very, very painful. So, a little plug. In a month, we’re going to be hosting a Vertica meetup here in this office, which is a very hard core, and very fast SQL database, you know, a data warehouse type managment database. So, we actually use boss [inaudible].
W: Is that what you use primarily? NoSQL, when you said SQL do you use MySQL and Vertica, or just Vertica and [inaudible]?
T: MySQL and Vertica. Vertica is good for [inaudible] operations, and you can’t do the transactional stuff there, so anything that has to have transactional-like consistency, we use MySQL.
X: Have you looked at Hadapt?
T: Yes. We have nothing to do with with the current system.
X: No, Hadapt. H-A-D-A . . .
T: Ah, anything that has an H in its name . . .
[laughter]
Ben: I’m going to resume the presentation, but I presume the next slide is asking for questions. Not quite, but this is probably stuff we talked about. Finite rope was a serious problem in a lot of the tests, um, do you recognize that? But it’s a problem we had. Another thing is, when we created these failures, we tried to create them a lot of different ways. We did network split brain, we killed processes rudely, we killed them nicely. Didn’t learn too much interesting.
These applications tended to work unless you did really bizarre things. One way network failures. But, it’s a lab, it’s not the real world. I’m sure you’re going to find ways to break this, we just didn’t find them. And, I already mentioned a bit, some of the issues with trying to track replication delays and so forth. Again, I think some of the questions addressed about this. We do need to evaluate larger clusters. We’re doing that in EC2. The results won’t be directly comparable, because we don’t have the same latency, but at least we can do larger clusters. And then, we’re going to be on key values store, which is another whole can of worms which we have to find the right baselines for, but we’re researching it.
Y: I have a question.
Ben: Sure.
Y: Have you met anybody who’s using the EC2 server disaster recovery kernel environment? Would it accept more performance but, you know, consistent from somewhat if they had to bring it up? So, it’s not their main production, but, you know, Get it to a point where it still works.
You don’t get 5 milliseconds, you get, you know, 15 milliseconds, but it doesn’t die on you?
Ben: The question is, do we know someone who is using EC2 as sort of a hot spare in production? I don’t, but I know when, for instance, Netflix runs their whole infrastructure off Cassandra on EC2.
Y: You can’t do that, but the question is [inaudible]
Ben: I don’t know anybody who does that. Sound reasonable. I don’t know anyone who does that. So, if there is a specific database, I heard HBase, which we’re already looking at, but just let me know and we’ll try to get it in. So lastly, again, with Thumbtack, this is what we do. If you want the presentation, you can download it. It’s slightly outdated, I’ll update it soon. The papers are available at the other link. They’re long, but they have really a lot of detail in them. If you want to contact me, there seems to be only one Ben Engbur on the Internet, and my handle is bengber for pretty much everything. Any last questions?
Y: Have you looked at Reock [SP]?
Ben: No. I am, personally, very interested in Reock. It is, it’s an interesting technology. It’s something that almost never comes up in our conversations, so it kind of got deprioritized. I kind of pushed for Reock to be included in this study. It just kind of didn’t make sense, because we just never hear interest in it. All right, thank you. Oh, sorry, one more question.
Z: Have you heard anything about some of these real time aggregating tools, Google’s Dremel, or [inaudible] screw it, or real time data stores?
Ben: The question is, do I know much about real time data stores like Google’s Dremel? Oh, I already forgot the second one, because the answer is no. This is not, it’s just leaving my area of expertise. I don’t know too much about it. I know companies who are doing real time analyitics on top of Mongo, actually. Chart Beat does that, and a number of companies do that. I’m not to familiar with the technologies I’m talking about.
NoSQL Database Examples
One example of a NoSQL database is CouchDB. Here is a tech talk on CouchDB architecture. Some of the common NoSQL solutions are Couchbase, Aerospike, MongoDB, Cassandra, HBase.