Original post with audio and slides is here.
Hey guys, you’re listening to g33ktalk. Today, we’ve got a talk by Ted Dunning from the Hadoop meet up in New York last night. Ted is going to talk to us about recent developments in Mahout. Stay tuned.
Ted Dunning: Okay, what I want to do is talk about news from Mahout, as if Mahout were a foreign land. Sometimes I think it is. I am chief application architect for MapR. Cool hat, huh? These hats are on offer. They go to employees— at least the black belt version goes to employee. And so, if you think you qualify for the black belt version, get in touch with us. We’re definitely hiring. You can get in touch with me a lot of ways. First initial, last name at maprtech. First initial, last name at apache dot org. Or, ted.dunning@gmail, ted_dunning @ Twitter. Any way you like. I try to answer everything, if anybody sends me anything, which sometimes means I work too late.
The slides are available tonight at Slide Share. There will also be a landing page that I’ll tweet afterwards, for other information that we have in addition to the slides. Hashtags, in case anybody cares about Twitter— I ought to get that out of my pocket right away, otherwise everybody will hear that going. Let’s do it.
So the biggest news in Hadoop is that 0.8 is coming. That means that the standard refrain on the mailing list, that’s fixed in trunk, will be at least moderated for a little while. There’s an awful lot of bug fixes. There’s some real significant speed ups- the QRD composition code that’s in the numerical linear algebra part has be sped up by 10x, and that has follow-on effects for net speed speedups of 2 or 3x in a lot of the large scale social analysis parts of the system. That’s exciting. Kind of. Kind of like numerical linear algebra always is.
I want to talk tonight a bit about Bayesian Bandits, which is a form of real-time learning, which is in real-time. It’s kind of cool. And super fast K-means. K-means is a form of clustering, and this new algorithm actually can be deployed in an online setting. You can do online learning about a data distribution, which means that, and the online means finite amount of memory, finite amount of time per data sample. So, you do not keep a history of all of the data in memory. You keep a small summary of that. And at any time you can get a high quality K-means clustering of all of the data. Add some more data- bang, you can get a new clustering of all of the data, including the brand new data. And this is kind of a novelty. K-means is almost never that way. So, this is kind of cool. This is why this algorithm is so fast, is because in fact it’s an online algorithm. And, I’ll talk about how fast.
And, in other news, there may be a new edition of Mahout in Action coming up. Manning is talking to the authors about that. That would involve a big revamp of the clustering and unsupervised learning section, and small fixes all through in terms of collecting all the code samples into a GitHub, where it can be kept up to date. And Japanese and Korean version are now released just in time for the new English edition now. But, who knew that that could happen.
So, tonight it’s Baysien Bandits and superfast K-means. Just all kinds of magic coming up.
So, in real-time learning- here’s an example, we have a product to sell. Some people here may be that way. It may even be from a website. We’ve all seen that. So, here’s our brand new wire frame for the very cool website that we’re building in order to sell bogus dog food, okay? It’s the best. It’s now handy one ton packages. You could buy five of them here. Now, we might want to put different pictures in there to make it pretty, make people want to buy it. Maybe it’s a picture of a golden retriever. Maybe it’s a happy family with a child and two parents. Maybe it’s just a pretty girl. Depends on who we’re appealing to, and we don’t always know.
So, which picture do we put? And at the same time, what tagline? There may be an unpredictable problem with the tagline we have there. It may not appeal to everybody. What’s the call to action? Buy five of them may be the wrong thing to say. So, the challenge then is how do we find the best design for this website? How do we- you know, cheesiness is not so good, it doesn’t always work. The best designers should be allowed to fail. If they are not wrong 90% of the time, you’re short changing yourself because if you force them to be right, they’ll be right about half the time and they’ll produce a very small fraction of the good ideas that they would otherwise. So, the important thing here is to get to where they’re wrong 90% of the time, so that we can kill the 90% quickly and we get net 10 times more good ideas. So, that’s what we’re going to do. We want those, those failure rates to be high. Freedom to fail, and that means freedom to succeed.
Now, there’s a problem here. Suppose we have five pictures, 10 taglines- because these guys were being creative- four calls to action, that’s 600 designs with different background colors. And if we just have 10 of these variables, suddenly we have 10 thousands, or millions of designs that we’re trying to test. Now we can’t get 500 conversions on every one of a million designs. We don’t have that much traffic. So we need to learn about what’s good, what’s reasonable, very quickly. It isn’t just websites. It could be online advertising, could be all kinds of things. And the key here is to learn very, very quickly. But we also have to explore. So we kind of want to do A/B testing on absolute steroids in real-time. We have bunch of different versions. We, we assign each visitor a version. We want to have a conversion, a behavior, something we’re trying to cause to happen. And we want to learn very quickly. And we don’t want to give the losers, which we do want to deploy in test, we don’t want to give them very much traffic. We want to [inaudible] that.
Now, before we go further, how many here are Bayesians? So, good, good. The goal for the next five minutes is to increase that fraction substantially. I have a coin. If I were to flip this coin, what’s the probability of heads? The man says 50%. That is not a very controversial answer. What do you think the probability of heads is? Okay. The consensus seems to be that the best estimate is 50%. Okay. Tails. Would you change your mind?
Audience Member: No
Ted: No. No. He has the strength of his convictions. I do it again. Tails. Would you change your mind yet? No. How many times tails would you change your mind? Tails. Okay? So, that’s three in a row. Now, here’s another question. If I flipped the coin already, before I asked if I was going to flip it, what would the probability of heads be? So, now I have already flipped it. It’s one way or the other. You, in the nice white sweater. What’s the probability it is heads. Now, that sounds just like what he says. I now look at it. I know what it is. What would you say? Can you say whether I would say the same thing? Would I say 50%?
Audience Members: No.
Ted: Why would we give different estimates of probability about the same physical event? Hmm? Because I know something you don’t know. Right? If I say it’s now tails, what do you think the probability is? Because you now know something that you didn’t know before, and so, does everybody agree that what you say is probability has to do- well that is so scary when I flip that into the lights- would you agree that your assessment of probability is based on what you know? Is that a fair bet? Okay, so our estimate of probability is based on what we know. It, it’s knowledge. Therefore you are now all Bayesians. So what is the probability of heads?
Audience Member: How many flips?
Ted: Uhh, four tails. Come on. Come on. Somebody.
Audience Member: Four times three times two times one?
Ted: Probability of heads now. Fifty percent the man says. Should always reserve a little bit for surprise. Okay. So you are now all Bayesians. Thank you. You have been anointed. Do not tell frequentists. They may become agitated. You don’t want that conversation. You may be Bayesians, but you are not prepared to face the foe. Okay. Now, the rest of the talk is open source. It’s not MapR. So, let’s go.
Here’s, mathematically, how we express I don’t know. So before I flip it, before I even really tell you it’s a coin, you have no idea if it’s going to be heads or tails. You, you could be all tails. I could be a cheat. Who would imagine that? It could be all heads. It could be somewhere in the middle. This is how we say I don’t know mathematically. The probability of the probability is equal for all possible values. Here’s what happens if we get five heads out of five tails, unlike what we did. Although we could have gone there. There’s a peak around a half, where we started with perhaps an assumption that all possibilities were equal. And now we see this. We might have started with a presupposition, a prior estimate that only 0, 50%, and 100% were plausible numbers, in which case now we know neither 0 nor 100 are plausible, because we’ve seen both, and the only thing left would be a peak in the middle. It’s a nice way of inferring things from evidence. Evidence and presuppositions.
Here’s two heads out of 12 throws. So, there you go. you got it. The idea is we have some range of possibilities, we assign probability to each one of the possibilities, and we adjust our thoughts according to reasonable inference models. And, in fact, we can demonstrate them for pretty weak assumptions. These are the optimal inference. So, even if you must masquerade as a non-Bayesian in polite company, you can now say, yes, but the inference is optimal. So, I merely pretend to be a Bayesian among bad people. So, there you go.
So, here’s another quick thing. So, how do we learn from evidence when we have to choose? And now, in the first case, with the coin, which it appears at odd intervals, I chose when to sample whether to sample when it would come up heads or tails. I chose the coin. But I could have chosen the other coin that you didn’t see. And then I could have chosen either one, and you could have learned about both coins. But what if I let you choose which coin you’re going to flip?
Or, perhaps, let’s pause for a short con game. So, we have now here three shells. Each has a probability of having a pea under it. And so, I’m going to nominate two recorders. You have paper out. You record the right-most shell. Anybody in the middle can record the middle shell, perhaps just in your head. You, in the orange. Remember how many tries, how many peas. Okay. You have paper, you record the one on this side. Okay, consensus. Which one should we try first?
Audience Member: Middle.
Ted: Middle, okay. Awe. He doesn’t have a pea. So, zero for one. Next?
Audience Member: Left.
Ted: Why? Why? We have no data. We have no evidence. We have some evidence that the middle one sucks. Zero for one. We have no evidence for the one on the left being better than anything else. Oh. Write down one for one. Okay, this is getting exciting. Now, which one now? Do you want the on the left, which has a demonstrated 100% record? Or the one on the right that has no data at all?
Audience Member: Right.
Ted: Yeah. Thank you. I needed somebody to say that. Oh. Sadness. So, two of them had a zero for one record. This is kind of slow. Get ready, we’re going to go a little bit faster. You ready? Okay. Did you get that? No. No you didn’t. Let’s go a little big again. You practice. More data. No, no. He’s not watching very carefully. Let’s, let’s help him out. Oddly enough, I arranged for a cheat here. Here we go. So, this is how many we’ve seen. 69% with the ones on the left, 20-30 something percent of the ones in the middle, the one on the right quite rarely. So at this point, with this much data, we’ve had a hundred trials, and these are the percentages. Which one would you lift? Would you never lift the right one again? Small probability for the one the right, more in the middle, more on the left. Is that what you’re saying? Yeah. You got it. This is the seminal theorem that was proven in the 30′s by a guy named Thomson.
Okay, so we’re coming back now. We’re done with the con game. That was not a con game. That was an honest experiment. Unlike the coin, which was a con game. So, we can’t identify winners and losers without trying them. We have to, as the wise man says, we have to try them, even if they look pretty bad. Eh, give them a few more tries. Most of the ones to the apparent winner. And we can never really eliminate a bad streak one.
Okay, so now you understand [ 0:17:08.9] bandits and Bayesian logic, so you know we could never identify with absolute certain we have a correct algorithm that always gives a tiny bit of traffic, because things may not be as we thought with the coin that vanished. And we keep trying the apparent losers.
So, my next question: is there an optimum strategy? It turns out there is. And it is an interesting one. It is based on Thomson’s work. The idea is that instead of estimating the probability, I forced you guys to say estimates. 50% was the universal estimate. But if you had been sampling from a distribution with unknown, just value, you would have said .1 or .9 with equal probability. Sampling is different than estimating. So, this algorithm samples the estimated values, samples P1 and P2 from these distributions. And P3, sorry. And then we pick the shell that has the highest sampled value. This is interesting, because what it does is it has a non-deterministic part at the beginning, but then all the rest is simple and deterministic. So it’s an algorithm that incorporates the non-determinism that’s required to do optimal sampling, but it has nice deterministic methods after that one sampling step. So, it’s as simple as a deterministic algorithm, and yet gives optimal results.
And, in fact, here’s a picture of how it does. The red one is something that’s commonly recommended on the web. It’s called Epsilon Greedy with a tuned parameter. And the bottom one is the Bayesian Bandit with no tuning what-so-ever. It learns all the tuning. And it kicks ass. It’s, you know, roughly three times better by now, at this point, which means that it is- the other one is diverging at a slope three times higher. So, it’s getting better and better and better relative to the other. Here’s a picture, a moving picture.
So, in this experiment, we have five coins, five shells, with conversions of probability that go from zero to one, along this axis. And you can probably just barely see these little arrows. And, the purple one has an abysmal conversion rate. It’s way over on the left. The red one here, and the nearly invisible yellow one right next to it have pretty good and nearly identical conversion probabilities. And so what this system is going to do is it will pick one of these to sample, it will sample it, and then it will use that feedback. Here in the middle is the bandwidth that’s being assigned to each option. The- up at the top is the posterior distribution as in the previous slides. And at the bottom is a magical number that we cannot really know. But since I built this small universe, I can know. It’s called the regret. This is how much worse is it doing than an oracular system that knows exactly the right answer every time.
And so if we do just a quick few samples, you can see the, the posterior distributions changing as it makes samples. And the purple and blue one have lost every time and they’re beginning to have a distinctly biased estimate. Over on, I don’t know, a dozen trials there. Green, or yellow, or whatever it is, is doing quite well. It’s almost perfect record. Red, which is the best, is kind of a 50/50 record. It’s only had a few impressions. If we make a lot more of these, you can see by here red and light green are the acknowledged front runners. And if you look down here at the bandwidth being assigned to them, red and light yellow, light green are being given almost all the bandwidth. The others have been almost conclusively determined to be losers at this point.
And if we go zipping along, red is in the lead here. The others are getting almost no bandwidth, even light green, well, light green comes back. I have to admit that I chose this for just this sort of a, you know, random streak, where it would come back, and so on. But you can see that the arrow bars on red are beginning to become quite narrow, because it’s getting almost all the data. The others have very large arrow bars, except that we know regardless of the arrow bars, they suck. That’s- the algorithm has killed them off almost entirely. You can see it doing just tiny bits of experiment on the losers here. Eh. Never say never sort of thing. And, you know, you’ll occasionally see blue take a tiny bump when it, when it updates. Op! Green just took a hop. And these two have become pretty well characterized to where it says red is like 90% and the regret is now nearly 0, because it’s only ever really giving light green any options, and then only rarely. And then, it only loses a little bit of regret. And so the system has learned, in this case.
But we can do even better. So, let’s move back to the presentation. Here’s the code. This is the code that generated that- Well, this is the code that computed the numbers that were drawn there. The code to draw that was considerably larger. But, aside from that, the idea there is you select the alternative by sampling here, in this line, from the data distribution, and here we select the highest, and we sample it. We, we ask the universe if it’s heads or tails, and we update our statistics. Updating the statistics is as complicated as counting. I can do that. Some days. And so this is really, really a simple algorithm. Simple, but the best. Go with it. Now, it can be more complicated, but the basic idea of we have some uncertainty, we encode that uncertainty in our further actions by sampling, and then somehow pick the best option based on those samples.
So, here’s our original problem: We had the website. We had several variables. We weren’t just picking the best of five websites. We were picking the best of 600 websites with all of these variables involved. So what we really want to do is say what is the conversion rate for each of these 600 websites, or nearly an infinite number if we have lots of variables. How do we pick that based on a parametric description of that? And so what we can do is we can do something like logistic regression, or a neural net. And so we have in this parameters, theta for each design variable, we have values x for the actual value of the designed variable in a particular design. And we’re going to model the probability of winning, or converting, or selling, according to this non-linear rated linear combination. So that the combination function is usually some soft threshold like that. This is sufficient to model all kinds of things. We may later want to model interactions variables as well. There’s- the other name for marketing is variable interaction. That’s- what marketers try to do is they try to come up with synergy, things that add to each other, multiply, propitiate each other, rather than just odd on.
So what we do then is we can sample from a distribution of theta instead of which one we want to convert directly. Then that defines a sample on our expected value for each option. We can now pick the largest and continue as we did. At the cost of that one extra sampling step, and then a conversion to conversion rate, and a more complex learning law, we get these results. And we can even have context variables. So, not just the designed variables of before, but we can say where is the user coming from? We can’t choose where the user’s coming from, but we might have different pictures for different locations: maybe in the Southwest, a rugged He-man sort of thing. In the city, more congenial domestic bliss would be better pictures. We can say what time is it? We can say what day of the week is it? We can say what about the week, you know, is it on a weekend? And we can still learn this parameters theta for all the extra variables as well as for the design variables. This algorithm is sometimes known as random probability matching, sometimes as Bayesian Bandits, sometimes other things, but now has been adopted by Google, by Yahoo!, by Microsoft over the least three or four years because it uniformly produces much superior results.
All of the code to implement this is in Mahout. This could easily be in the 0.8 release if we get enough feedback and participation of people who want this about how to integrate it. How they would like to see it. So this is kind of a call for action. You could make this happen. You either implement this, or you can come to the Mahout mailing list and shout out hey I want that! And this is how I could deploy this. If it doesn’t happen, it won’t be there. So, that’s the first big thing. Everybody cool with that? I mean, as newly anointed Bayesians and all that. You’re cool with that? Okay, good.
Second big feature is something that will be in Mahout in at least the batch form. Not necessarily in the real-time form, but the real-time form, I don’t what it would be used for, it’s so cool. Just other than that, I don’t know. But the batch form has a lot of use. The rationale for K-means is not necessarily a robust clustering. It is not necessarily that we get the same clustering every time we run it. It is not necessary, and this would be corollary- robustness would be a corollary. It is not necessary that we produce a clustering that matches human presuppositions. If I were to cluster a bunch of text, there’s nothing that says I should come up with the same twenty news groups that those texts came from. It might be kind of fun if it did, but that is not what we’re after here.
What it- we are trying to do is characterize a data distribution in a way that generalizes to unseen data, and which compacts, compresses the data, is an economical representation of the data. And if it is an economical representation that generalizes is just as economical for held out data. It is a good representation. That is all I mean by quality here. Partly because I could measure this. And the other part is really difficult to measure, and is in fact an extraordinarily hard thing to do. To recreate the last many thousand years of human culture and language development and come up with exactly the same answer as English speakers on the Internet, in the early 90′s, is much harder than just coming up with a decent set of variables. So let’s do the decent variables. Gold standards are not the issue.
Here’s an example. People say this often: This doesn’t look like something you should cluster, because it’s got the spirally things. And if I had two intertwined spirals, it’s a classic counter-example for spiral- for K-means clustering. But I think that really there’s missing a point there, because, remember we don’t- to put it rustically: We don’t give a shit about how many clusters we do. All we care about this quality estimate. How well does it quantize, how much does it represent the data? And this clustering into 20 clusters represents the data pretty well. If I tell you which cluster it’s in, there’s, you know, there’s the distribution, the error’s going to be that big. Which is pretty small because the diagram is not big, okay. And space is really, really big. So this is a really big representation in four and a skosh bits, I can get an error this big. And maybe if I went to 40 of them, maybe they would do some kind of fancy hopping back and forth, and I might cut the error in half. Of course, that would add one more bit, it would be five bits for not quite having the error that might be better for whatever purpose it would have. And in particular, these 20 clusters do a really good job of diagonalizing the data. If I just express the distance to clusters, cluster is on the vertical axis. And so point number 1,518 is close to two clusters, and really quite far from all others. So with only two or three distances, we can incur this data very, very compactly and very, very accurately. And moreover, if there’s something about this end of the tail, if there’s something about distance along the spiral, if it really is a one dimensional manifold with noise on it, we can model this data and produce a predictive model that is very accurate. Because the clustering has unwound that non-linear structure of that data into a very simple form. That’s where clustering really has high value.
And even if we have what really to most humans looks like four clusters there, we could throw a whole lot of clusters at it. And what we’re left with, and we aren’t going to see the weighting, so the clusters here in the middle of red are going to be much heavier weighted, because they have lots of members, whereas the ones further out are going to have far fewer, but what’s left is decidedly still a nice representation of the data. It records the data representation, the data distribution in a very nice way. It encodes it for us. And that is the basis of this very, very fast algorithm. That a lot of clusters, more than you think are necessary, gives us a very nice approximation of the data distribution. That’s the key intuition.
Here’s some theory. Theory requires Greek alphabet. So, here we go. The idea here is that data is well-clusterable if for some number, K, of clusters, the total sum of the squared distances is a lot less than the sum of the squared distances for a smaller value of K. So if we had five clusters there, and we clustered it into five clusters. Then, if went to find the best four clustering, we would have to put these two together, because that would cause the less, least error there. And now we have an error this big. Before we had an error this big. And so five clusters is distinctly better than four. Now, the reason that this is important is because people over the last ten years, in clustering, have a new insight. Before, people tried to find good clustering algorithms that would find good, robust, repeatable, matches human intuition, on any data. And that turned out to be insane. Because it’s a NP-hard problem, and it would only admit certain kinds of approximations. And it isn’t like anything we ever have to analyze.
So instead the theoreticians have turned to analyzing data like this, which they classify as well-clusterable. And that’s cool, because it has led to a whole new set of theory results which show that we can cluster this data to within an epsilon of the best clustering possible. The epsilon depends on sigma, there. And we can do it in reasonable amounts of time. That’s cool. And, it turns out, that real data isn’t quite this nice. But it isn’t so bad, either. So what works on this theoretical data- I’ll point this way. I see it there, you see it there. I’ll point this way so you’ll be less confused. This guy, his seat is going womp, womp, womp, womp, and that’s just not fair.
Anyway, on well-clustered data, these theoretical constructs work extremely well with very high probability. But, on practical data, they also work pretty dang well, in the sense of generalizable error. So, here’s some algorithms. Here’s, well, not so much an algorithm. Here’s how K-means breaks. What happens is we start putting seeds into the data somewhere. When we’re assigning everything to the closest seed, and we address the seed positions. This is called Lloyd’s Algorithm. It was in folklore until the 80′s, but it was invented a very long time ago. But if we put two seeds into this cluster here, and one seed here, and a seed here, and a seed here, what will happen is Lloyd’s Algorithm can never separate those two seeds. They will stay in that cluster, and the resulting cluster will have to do something like put these two into the same cluster and give us a crappy result. Because it has error like this instead of like that. This is how K-means breaks. And in high dimensions and a large number of clusters, this is almost certainly going to happen. K-means will almost always break. In two dimensions, you can do some fixes and things like that. In high numbers of dimensions with random assignment, it will almost always break.
Now, there is some heuristics, and one of the best is for ball K-means. What we do is we assign the initial centroids in a way that with high probability we get one seed in each real cluster. Not in certainty, but with high probability, we get one seed in there. And if it’s well-clusterable data, BAM!, we find the best, just instantly. In fact, the ball K-means algorithm only assumes we make one pass through the data after a very expensive initialization function.
So, in order to make the algorithm fast, we’ve now made initialization extraordinarily slow. Eh. Can’t have everything. And these are theoreticians speaking. But it turns out that ball K-means is practically speaking, a very good algorithm. But it’s not good enough. It has very high probability of doing a perfect job with two clusters. Everybody who wants two clusters out of their data need not raise their hand, because it’s everybody. Who doesn’t want that? And the probability of successful seeding drops exponentially with k. Eh. What- what’s the good of this. Even at 10, it’s like one chance in 1,000 of working. It’s almost as bad as K-means.
But, there’s an alternative strategy, which is great except that it’s cost has this K cubed term in there. K cubed, for- is the number of clusters cubed. If we want 1,000 clusters, that’s a billion. It doesn’t matter how fast we do it, that’s getting kind of slow. And if we want 100,000 clusters, it’s really big. We don’t even need to say how big. But, you know, like peta instead of giga. It’s really bad.
So this is not good enough for big data. But it’s good enough for kind of sort of big data, where we kind of sort of want a lot of clusters. Okay. That’s cool. That’s good. And for 20,000 documents on 20 news groups, this is good. It’s really good. Demonstratively better than anything in Mahout now.
So, the next trick. You remember how we can see the data structure for lots of clusters? So the idea is let’s do the lot of cluster trick, let’s just go leaping through the data and find a lot of really sucky clusters. We end up kind of slab-dash constructed. And the cool theoretical results are that this gives us a surrogate of the original data for which ball K-means will give us something very good, close to a very good clustering of the real data. Now we’re getting there. The reason that that’s cool is that the costs are low. So we need the surrogate to have K log n, where n is the number of clusters, surrogate clusters. If we guarantee that, then the rest follows through. And the cost of finding the surrogate is yadda, yadda, yadda. All that matters is this. Log of K. Log of K. And log log of n. That’s how fast that the per point sort of thing is. Well there is a d times that, the dimensionality of the data, or the average number of months parts elements. And the cost then of a kind of sloppy but pretty good ball K-means is like that. So, this logs means feasible. In fact, log log means damn near online. Small enough- think about it. A billion. The log of a billion is 30. The log of 30 is 5. Ish. There’s ishes all over these math terms. That’s how you pronounce big O. So, 5-ish is the sort of cost here. Whereas we were looking at billions, or we were talking, you know, a billion times 1,000 cubed before. This is better. This is much, much better. And even the final clustering there, d log k times log k, not so bad. If we had 10,000 clusters, log is 13. 13′s small. Log log of n is small.
In fact, again, if we just kind of blur our eyes and go to the punch line, lots faster. That’s how I pronounce 3,000. Lots faster. And if these are just the orders of magnitude better, it turns out we can implement this effectively, very fast to- and I’ll show how fast in a moment. Basically works- this is the sloppy clustering algorithm. New data point. We have old centroids for the surrogate. Find the nearest one-ish. If it’s within the current threshold, then depending on how much within that, we probabilistically assign this point to that cluster- that centroid. Otherwise it’s a new centroid. The number of centroids is going to grow. It’s going to happen. And so when it gets too big, determined on the fly by how many data points we’ve seen, we will recursively cluster the centroids, and if the number of centroids doesn’t collapse a lot, half, something like that, we will increase the threshold to make things stickier. And eventually the threshold will get big enough so that it either collapses or we don’t’ grow very fast. This algorithm is safer than the published one, and it is guaranteed with high probability to produce a surrogate that will give us good clustering.
Now, to implement this, we still have a problem. We still have to find the nearest centroid. Which, if the number of centroids is really large, because it’s a surrogate, that could suck, because we have to search everyone. But the theorem goes further and it says that we can do a sloppy search. And so, the- in fact the theorem says, if we just project the centroids down to one random direction, one dimensional search, that’s good enough for well-clusterable data. I am, as I said earlier to a customer, a belt and suspenders, and superglue kind of guy. That’s how I like to keep, you know, the wheels on the machine. And so we use four or more of these projections just to probabilistically get better ideas. And the basic idea here is that things that are close are going to be close enough projections. Things that are far apart, like this and this, have a decent chance of being far apart in the projection. And things that are far apart have a small chance of being projected to be close together. And if we do a lot of projections, we’re going to get a pretty good answer. We’re probably going to find the nearest- the actually nearest cluster.
Another way to do it, which turns out slower, and I was surprised by that. I think I’ve- I’ve done something sloppy somewhere- is we can take a bunch of projections. Say, in this case, 64 projections, random direction that way. So now, if I have another vector here, the dot products is positive with this first one because they kind of go the same direction, positive dot product, we re-encode as a 1. Negative dot product encode as a 0. New random direction. This original thing is still a 1. And this one is also a 1, because they’re both kind of toward the audience. Do that 64 times, it gives me a 64 bit number. If I exhort two of these numbers for two vectors, if they are nearly collinear, they will have the same sign in random projections a lot of the time. If they are nearly anti-co-linear, they will have opposite sign, 0 for 1 and 1 for 0 a lot of the time. If they are at right angles, they will have uncorrelated bits.
So, at right angles, that means that the cosine is 0, 32 is the most common XORed bit count. At -1 cosine, 64 is kind of where we are. And so this is a surrogate for finding the cosine between these and if we cash the lengths of these vectors, the cosine is all we need to compute the difference between those two. And so, basically one XOR and a few arithmetic options, we can take the place of an arbitrary dimension, dot product and subtraction, and things like that. So we can replace 1,000 floating point operations with a few projections, which are constant. We just do it once. So, that’s kind of cool.
But in fact, this does not turn out to be much faster. Though there’s 300 projections. Because we project so many fewer times here, this turns out to be faster, and the quality is identical in practice.
Here is some results. So, the first question is, does this submit itself to a parallel speedup? Because otherwise, who cares? Right? It’s going to one core thing is just not, not- it’s so 90′s. So, it turns out if I do a surrogate here of a bunch of clusters, and I do a surrogate on this data over here, I can just union those and I have a surrogate for the data together. It isn’t quite as economical as if I built the surrogate on both of them, but it’s not that much bigger. So, the parallel algorithm, take the trenches of data and do the surrogates on them, union them and take the clustering. Even I could write a Map Reduce like that. We could even do it in PIG, which is great. And on one machine we could even thread it, trivially.
Here’s the results: So, for one core, about 200 microseconds. For ten dimensional data, this is a million data points. So, 200 seconds measured is 200 microseconds per point. Two quarters, 100. Four cores, 50-ish. And the gray line here is perfect speed-up. The red and the blue are two different experimental ones, so there’s not much variation. Once we get up to around 12 to 14 cores, it means we’re probably running into the concurrent garbage collector, or the operating system, audio threads, things like that. So we quit being linear speed-up at that point, but up to that point we’re within a small factor of linear speed-up.
And this applies to MapReduce as well. We could run this on a thousand machines with 16 cores. And at the fastest here, we’re at 20 microseconds per data point for that same speed-up across 1,000 machines. We should be able to cluster a billion data points in 20 seconds plus not produce overhead which is bigger than 20 seconds. This is cool. Record coming up, I’d say. So, we have parallel speed-up.
In terms of quality, the ball K-means compared to the current canopy and mean shift Mahout algorithms gives uniformly better results on 20,000 news group articles. Okay? So ball K-means is good. It’s better than what we have. Streaming K-means to produce a surrogate plus ball K-means on the surrogate produces identical results as far as we can tell to the ball K-means alone. Wow. These theoreticians got it right. That means that the streaming surrogate thing that is so fast to do produces a surrogate that gets us clustering that’s just as good as the ball K-means alone. And both of them are better, either ball K-means alone or streaming followed by ball K-means, are both better than the existing algorithms, which are slightly sophisticated initializations for seeds.
All of these are on small data sets. But that’s the only way we can compare ball K-means against the others. This is with 20 and 100 clusters. But it does give some pretty persuasive evidence that this is a good clustering algorithm. And the figurative merit was mean distance to nearest cluster and median distance to nearest cluster, looking for outliers and things like that. And it is just uniformly better.
So these are some of the things that are coming in Mahout. I’m happy to have questions. I’m happy to have questions later, or now. Some people are shy. Either way. If you’re shy, I’ll call on you now, and torture you. That- that’s fine. This is all open source software. MapR does a mixture of open source and close source software. We try to get the best of both worlds. And we also got the line there, if there are university people here, we do often make our software available for free for research use. You can get this code as part of the trunk on Mahout, or for GitHub. And I love for feedback. Real data, real application is always the best test of these things. So, share the news. Share the love. Tweet. Whatever.
Audience Member: Bayesian Bandit, you call it real time, like does it have a way of accepting your stream of data in, or?
Ted: So the question is, is Bayesian Bandit really real-time, or is just kind of sort of real-time, like everybody pretends is real-time.
Audience Member: Is it, it’s supposed to run on a Hadoop cluster?
Ted: It doesn’t need a Hadoop cluster. It could run on my iPhone. We could do it, but no. It’s much more appropriate in something like a storm cluster, because it accepts feedback in various ways. It’s a- it’s a real-time, it is a real real-time algorithm, because the data is finite. The sufficient statistics are just counts. And the delay- the feedback can be delayed arbitrarily. And so finite memory, very simple, straightforward finite operation is the recipe for real-time. And it is really real-time. Does not need Hadoop to run. Mahout is not about Hadoop machine running. Mahout is about scalable machine learning. Bayesian Bandits are scalable. So, it fits right in. If anybody wants it. Is that a fair question? Fair answer, I mean. It’s a great question.
Anybody else? Come on. We can do better than this. Let’s do it. I’ll assign questions. You, in the white. You, you answered about the coin. Nominate somebody to ask a question. I can tell you didn’t want to ask one. Okay. Somebody else, save this poor lady. Yeah? Man who’s scratching his head. Oh, okay. Somebody back there. Back there. Yeah.
Audience Member: You mentioned Epsilon Greedy, would you compare that to A/B Testing?
Ted: Compare Epsilon Greedy to A/B testing. No. This is a confusion of terms. A/B testing is a general framework for having multiple versions of something and testing it. Many times people use a general term for testing stuff. As a surrogate term for frequentists tests of significance type of testing. Which, is silly. And it is. It’s just silly. It’s- it wastes- it wastes time and it makes it very costly to make mistakes. And so if you want to compare Epsilon Greedy to t-tests and things like that with Bonferroni corrections, there- the t-test is, is abysmal. Just uniformly abysmal. And Steve Scott from Google has a great posting on this recently where he compared t-tests with, well not with Epsilon Greedy, but with Bayesian Bandits, and shows that just substantial over-waste of data with the traditional t-tests.
And so, yeah, there’s a distinct advantage there. Even if you’re doing epochal testing where you, you make a decision every day about how to continue. The Bayesian Bandits give you a very, very natural way to do sequential testing and early termination. They’re beginning to be used in clinical trials for that exact reason. Because if you continue with the test, somebody is going to die. And they probably have an opinion about that. Certainly whoever will sue you has an opinion. So the Bayesian Bandits and variants of them are becoming the new standard of care in clinical trials for exactly that sort of reason.
Now, Epsilon Greedy has been proposed as a super simple algorithm for solving Bayesian Bandits, and with the right magical tuning of epsilon, it can come with an inconstant. But in every practical application, I’ve never seen it come within a small constant of optimal. Most people implement it poorly and don’t understand it. And if you have a large number of options, t-tests- well, like with a parameterized test or contextual test- t-test and Epsilon Greedy become infeasible. They just- instead of just wasting five times as much data as you need, you now waste 10^5 times more data than you need. And that’s why rpm or Thomson sampling, the Bayesian Bandits is now the standard in ad targeting across all the major advertisers. Is that a fair answer?
Audience Member: Yeah.
Ted: Who is this? Yeah.
Audience Member: Hi. I had a question about Bayesian Bandits. Most specifically the five areas which you showed. There it was assumed that there conversion rates were static?
Ted: Oh. Very good question. I shouldn’t even repeat it, it’s so good. So, the question is, in this example that I gave, which I didn’t let it finish, but, but we should, the five areas were assumed to be static. You’re right. This is a toy example. And in real life, they are not. Shit happens. The world changes. 2008 happens. 9/11 happens. The world changes. Weekends happen. Or whatever. Summer vacation starts.
So, the way to handle that here- there are many ways of handling this. But one simple way- very, very simple way- is if you had a simple Bayesian Bandit like this, you could have it forget. So instead of adding, you could take the cold count, multiply it something like .99 and add some fraction of the value you have. So the values could get up to like 100, and then they would kind of stop there. And if things changed, you would have a pretty strong prior, within 100 or so experiments, you would begin to see a radical shift, and over a longer period of time you would converge on the new optimum. And it would give more credence to the possibility that horrible examples were still good.
Now, there’s another way to do that, because here we have five options, and we have nice convergence properties. These five options do not have to be the actual thing we are testing. They could be algorithms for doing the thing that we want to do. So, each one of these five options could be itself a Bayesian Bandit, which was started at different times. Say, say exponentially gradated times. So one would be the wise old elder Bayesian Bandit. Another one could be a much younger Bayesian Bandit; rambunctious and exploratory. It had no idea of the wise and the elder. But if the world changes, that naive one will perform much better quickly than the old one who has a very, very strong presupposition about the world. Now, these Bandits should perform kind of similarly, but the young one should always perform a little bit worse in a static world. But that can change very quickly and dramatically if the world changes interestingly. And so the meta-Bayesian Bandit can have a pretty short forgetting period, much shorter than the long bottle. So it doesn’t have to have as long a period as that. It can have a much, much, much shorter forgetting time. All it has to do is, eh, this one guy really knows what he’s talking about. It has to come to that conclusion pretty- whoa! He’s an idiot. You know that- just as long as it makes those decisions reasonably quickly.
Now there’s knobs. I hate knobs. There’s a knob there for how long is the basic decisioning time. There really isn’t a knob for how long the different Bandits run, because you run them at many scales. And you just kind of- the longest is whatever the longest plausible thing is, usually all of your data. And the shorter one is about as short as it can be. You know, several examples. You can’t imagine learning anything faster than that. And so, the one knob is how fast is the meta model forget. That’s the simplest way that I know of to do this. There are other adoptive things. There’s a long history of radar tracking and Kalman filters and things like that. But the basic intuition is if you make errors that are uncorrelated over time, you probably got a good model and should start tightening down on the model. If you start making correlated errors- and what correlated errors means here is not a clear thing. But it could be related to the entropy of the new examples. If, if weird shit starts happening, that’s equivalent to correlated errors. Weirder than average stuff. That can allow you to adjust the forgetting rate on a model like that. That’s much more subtle. This other way is much more direct, and so I think it’s easier to get right. But that was a good question. I’m sorry I didn’t mention that earlier, except it gives you a chance to ask a good question. Yeah. Cool.
That man stands up. He’s scratching his head, he’s standing up. He must secretly want to ask a question. Well I think I’d like to- Oh, yeah?
Audience Member: [inaudible]
Ted: The question is, how do we handle these interaction terms in some pseudo-principled way that we could argue to our friends who work in statistics still. Yeah. So. He didn’t say all of that. The right way to do that is we have some experience. Even if it’s a start-up, we’ve done some things. We’ve done some ads, we’ve done some placements and designs. That tells us- you know we’ve seen how those behave. That puts bounds on what’s plausible. It’s implausible to have an advertisement that has 10^-5 conversion rate. It’s implausible because even spammers do better than that. And it’s implausible to have one that has better than 10% conversion rate. Ehhh. You know 20-40 or whatever your implausibility bounds are. So that can define our prior distribution. But similarly, in a design sort of thing, we have experience about how big a difference interactions can be. Certainly, they can make things twice as good, if we do two things that match.
I mean, based on interactions, suppose we put the headline in Danish and the call to action in Swahili. Well this is, there’s an interaction variable there. Yeah. And I think it’s probably pretty big. The people who understand the headline probably don’t understand the callt o action, so probably better to have them in the same language. So there’s an interaction variable that’s probably a pretty big effect. But it’s probably not 100x, because the people who guess what the button at the bottom means just by cultural context on the internet. It’s probably big enough that the Danish speakers can pick the African button and vice versa. Or there’s Danish speakers of Swahili. You know, I’m guessing. But you get my point.
So you can put a plausibility bound on how big the interaction effects are. We also know that, based on how absurd my example was, that’s probably pretty rare. So we cannot, again, tighten those bounds down. We can say how big an interaction effect would we just consider like a 101- 100 to 1 sort of chance. It’s not very big. How small could it be? Well, it could be 0, or negative. But again, probably kind of symmetrical.
Andrew Geldman has written a very nice paper which goes through a lot of that sort of reasoning to come up with a fairly universal kind of informative prior for this kind of fitting. And he has a nice recommendation for that. Of about, you know, a t-distribution with one or two, or two and a half degrees of freedom. And that works pretty well, because all that’s doing is saying that we have kind of a few samples, like five or 10 or 20 samples worth of history. And that’s what- how good our experience is. That’s kind of- all we have to do is offset the counts to include these priors. So it’s very simple to implement. And the intuition is that if I were to give you experience on one system, and then give you actual data on another system, how much of that first system could I give you that would give you kind of useful information, and how much would I give you before it harms you. That’s the size of the prior we want. So that’s the mechanism for picking that. Just think about how implausible things are, or think about how much experience is generally useful, and how much is counter-productive. Making you too wise hurts to handle real data. Until you’re stuck, anyways. So that’s how that is.
Now, there is a procedural problem here. I didn’t get any pizza before the talk. So what I’d like to do is, unless there’s somebody with an urgent question, I’d like to adjourn and socialize. Meaning, I get to eat. Does that sound good? Thank you.