In this talk, Joe Crobak, formerly from Foursquare, will give a brief overview of how a workflow engine fits into a standard Hadoop-based analytics stack. He will also give an architectural overview of Azkaban, Luigi, and Oozie, elaborating on some features, tools, and practices that can help build a Hadoop workflow system from scratch or improve upon an existing one. This talk was recorded at the NYC Data Engineering meetup at Ebay.

Building a reliable pipeline of data ingress, batch computation, and data egress with Hadoop can be a major challenge. Most folks start out with cron to manage workflows, but soon discover that doesn’t scale past a handful of jobs. There are a number of open-source workflow engines with support for Hadoop, including Azkaban (from LinkedIn), Luigi (from Spotify), and Apache Oozie. Having deployed all three of these systems in production, Joe talks about what features and qualities are important for a workflow system.

 

Slides and Bio…

 

Continue reading »

Proudly hosted by WPEngine