This is an Apache Zookeeper introduction – In this talk, , from Rent The Runway, gives an introduction to ZooKeeper. She talks on why it’s useful and how you should use it once you have it running. Camille goes over the high-level purpose of ZooKeeper and covers some of the basic use cases and operational concerns. One of the requirements for running Storm or a Hadoop cluster is to have a reliable Zookeeper setup. When you’re running a service distributed across a large cluster of machines, even tasks like reading configuration information, which are simple on single-machine systems, can be hard to implement reliably. This talk was recorded at the NYC Storm User Group meetup at WebMD Health.

Interested in the Tech Challenges at Rent the Runway?
If you’re looking for a super smart team working on significant problems in the areas of data science and logistics don’t miss this opportunity to connect directly with an engineer inside Rent the Runway.

 

 

The ZooKeeper framework was originally built at Yahoo! to make it easy for the company’s applications to access configuration information in a robust and easy-to-understand way, but it has since grown to offer a lot of features that help coordinate work across distributed clusters. Apache Zookeeper became a de-facto standard for coordination service and used by Storm, Hadoop, HBase, ElasticSearch and other distributed computing frameworks.

Slides & Bio…

 

In this talk, from Tumblr gives an “Introduction to Digital Signal Processing in Hadoop”. Adam introduces the concepts of digital signals, filters, and their interpretation in both the time and frequency domain, and he works through a few simple examples of low-pass filter design and application. It’s much more application focused than theoretical, and there is no assumed prior knowledge of signal processing. This talk was recorded at the NYC Machine Learning Meetup at Pivotal Labs.

Adam also works through how they can be used either in a real-time stream or in batch-mode in Hadoop (with Scalding).  He also has some examples of how to detect trendy meme-ish blogs on Tumblr.

Slides & Bio…

 

In this talk, Terence Yim, from Continuuity, discusses Weave, a simple set of libraries that allow you to easily manage distributed applications through an abstraction layer built on Hadoop YARN. Weave allows you to use YARN’s distributed capabilities with a programming model that is similar to running threads. This talk was recorded at the Big Data Gurus meetup at Samsung R&D.

 

Slides and Bio…

 

In this talk, , from Cloudera discusses Cloudera’s new open source project, Cloudera Development Kit (CDK), which helps Hadoop developers get new projects off the ground more easily. The CDK is both a framework and long-term initiative for documenting proven development practices and providing helpful doc and APIs that will make Hadoop application development as easy as possible. This talk was recorded at the Video  Big Data Gurus meetup at Samsung R&D.

Slides and Bio…

 

(Original post with video of talk here)

Adam Illardi: Hi, I’m Adam Ilardi.  I work here at eBay.  I’m an applied researcher.  Why do I choose eBay?  It’s a pretty cool company.  They sell the craziest stuff you’ll ever believe.  There’s denim jean jackets with Nick Cage on the back, and this kind of stuff is all over the place.  So it’s definitely cool.

The New York office is brand new.  It’s less than a year.  What does the New York office do?  Well, we own the homepage of eBay, so the brand-new feed is developed right over there.  You might even see one of the guys.  He’s hiding.  Okay.  And also, all the merchandising for eBay is going to be run out of the New York office.  So that’s billions of dollars worth of eBay business run right out of here.  It’s a major investment eBay has made in New York, which is really cool.

So why you’re here is to find out how we use Scala and Hadoop, and given all the data we have, the two pair very nicely together, as you will see.  All right, so let’s get started.  Okay, these are some things we’ll cover—polymorphic function values, higher kinded types, Cokleislis Star Operator, some use of macros.

Continue reading »

 

In this talk, Joe Crobak, formerly from Foursquare, will give a brief overview of how a workflow engine fits into a standard Hadoop-based analytics stack. He will also give an architectural overview of Azkaban, Luigi, and Oozie, elaborating on some features, tools, and practices that can help build a Hadoop workflow system from scratch or improve upon an existing one. This talk was recorded at the NYC Data Engineering meetup at Ebay.

Building a reliable pipeline of data ingress, batch computation, and data egress with Hadoop can be a major challenge. Most folks start out with cron to manage workflows, but soon discover that doesn’t scale past a handful of jobs. There are a number of open-source workflow engines with support for Hadoop, including Azkaban (from LinkedIn), Luigi (from Spotify), and Apache Oozie. Having deployed all three of these systems in production, Joe talks about what features and qualities are important for a workflow system.

 

Slides and Bio…

 

In this talk, from Hortonworks, discusses YARN architecture and how to get started developing for the next generation of Hadoop. This talk was recorded at the New York Hadoop User Group meetup at Gilt.

Hadoop 2.0 is approaching. Abhijit talks on the a defining characteristic of Hadoop 2.0, its next generation resource management framework called YARN. YARN enables Hadoop to grow beyond its MapReduce origins to embrace multiple workloads spanning interactive queries, batch processing, streaming & more.

 

Slides…

 

In this talk Manager of Data Platform Architecture Jeff Magnusson from Netflix discusses Lipstick, a tool that visualizes and monitors the progress and performance of Apache Pig scripts. This talk was recorded at Big Data Guru meetup at Samsung R&D. Comments are available here.

While Pig provides a great level of abstraction between MapReduce and dataflow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs.  The recently open sourced Lipstick solves this problem. Jeff emphasizes the architecture, implementation, and future of Lipstick, as well as various use cases around using Lipstick at Netflix (e.g. examples of using Lipstick to improve speed of development and efficiency of new and existing scripts).

 

Slides and Bio…

 

In this talk Senior Software engineer  from  discusses Morphlines, the easy way to build and integrate ETL apps for Hadoop. This talk was recorded at  at SumbleUpon offices for Cloudera.

Cloudera Morphlines is a new open source framework that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards.

 

Slides and Bio…

 

This talk is by , Solutions Engineer Manager at Cloudera, recorded at the 10gen headquarters in NYC.

Abhijit will explain what exactly the Stinger Initiative has done for Hive, such as fast interactive Query´s and complete SQL compatibility.

Want to hear from more top engineers?
Our weekly email contains the best software development content and interviews with top CTOs. Enter your email address now to stay in the loop.

Bio…

Proudly hosted by WPEngine