MapReduce - tech talks, tutorials, presentations, and community

In this talk, “How RethinkDB Works,” , Lead Engineer at RethinkDB will discuss the value of RethinkDB’s flexible schemas, ease of use, and how to scale a RethinkDB cluster from one to many nodes. He will also talk about how RethinkDB fits into the CAP theorem, and its persistence semantics. Finally, Joe will give a live demo, showing how to load and analyze data, how to scale out the cluster to achieve higher performance, and even destroy a node and show how RethinkDB handles failure. This talk was recorded at the meetup at StumbleUpon Offices.

Bio…

November 8, 2013
RethinkDB, Video
Distributed Database, json, MapReduce, RethinkDB

QAing New Code with MMS: Map/Reduce vs. Aggregation Framework

(Contributor article by Alex Giamas, Co-Founder and CTO of CareAcross. originally appeared on 10gen Blog)

When releasing software, most teams focus on correctness, and rightly so. But great teams also QA their code for performance. can also be used to quantify the effect of code changes on your MongoDB database. Our staging environment is an exact mirror of our production environment, so we can test code in staging to reveal performance issues that are not evident in development. We take code changes to staging, where we pull data from MMS to determine if feature X will impact performance.

As a working example, we can use MMS to calculate views across a day using both Map/Reduce and the aggregation framework to compare on their performance and how they affect overall DB performance.

Our test data consists of 10M entries in a collection named views in the database named CareAcross with entries of the following style:

{
userId: “userIdName”, date: ISODate(“2013-08-28T00:00:01Z”), url: “urlEntry”,  
}

Using a simple map reduce operation we can sum on our documents values and calculate the sum per userId:

 db.views.mapReduce(function () {emit(this.userId, 1)}, function (k,v) {return Array.sum(v)}, {out:"result"})

The equivalent operation using Aggregation framework looks like this:

db.views.aggregate({$group: {_id:"$userId", total:{$sum:1}}})

The mapReduce function hits the server at 18:54. The aggregation command hits the server at 19:01.

If we compare these two operations across our data set we will get the following metrics from MMS:

More…

In this talk, from Hortonworks, discusses YARN architecture and how to get started developing for the next generation of Hadoop. This talk was recorded at the New York Hadoop User Group meetup at Gilt.

Hadoop 2.0 is approaching. Abhijit talks on the a defining characteristic of Hadoop 2.0, its next generation resource management framework called YARN. YARN enables Hadoop to grow beyond its MapReduce origins to embrace multiple workloads spanning interactive queries, batch processing, streaming & more.

Slides…

September 10, 2013
Hortonworks, Video
big data, Hadoop, MapReduce, YARN

In this talk Manager of Data Platform Architecture Jeff Magnusson from Netflix discusses Lipstick, a tool that visualizes and monitors the progress and performance of Apache Pig scripts. This talk was recorded at Big Data Guru meetup at Samsung R&D. Comments are available here.

While Pig provides a great level of abstraction between MapReduce and dataflow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. The recently open sourced Lipstick solves this problem. Jeff emphasizes the architecture, implementation, and future of Lipstick, as well as various use cases around using Lipstick at Netflix (e.g. examples of using Lipstick to improve speed of development and efficiency of new and existing scripts).

Slides and Bio…

August 21, 2013
Netflix, Video
big data, Hadoop, MapReduce, Pig

mathbabe.org logo

In this talk, we’ll see how recommendation systems are created from data. What’s the algorithm? What’s the evaluation method? What’s the optimization procedure? When does it converge? We’ll talk about parallelizing in order to scale up to “big data” size via the MapReduce framework. Finally, we’ll think about priors and how they are overloaded. Content from this talk draws from chapters in Doing Data Science contributed by David Crawshaw and Matt Gattis.

Podcast: Play in new window | Download

Bio, etc…

May 15, 2013
Articles
big data, data science, MapReduce, rec sys

How RethinkDB Works by Joe Doliner

QAing New Code with MMS: Map/Reduce vs. Aggregation Framework by Alex Giamas

QAing New Code with MMS: Map/Reduce vs. Aggregation Framework

Hortonworks – Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele

Netflix – Pig With Lipstick by Jeff Magnusson

Mathbabe: A recommendation system and MapReduce by Cathy O’Neil

DATA ENGINEERING NEWSLETTER

Upcoming NYC Tech talks

Workflow Engines for Hadoop

Upcoming SF Tech talks

Categories

Archives