Data Driven Growth at Airbnb by  – As Airbnb’s VP of Engineering, Mike Curtis is tasked with using big data infrastructure to provide a better UX and drive massive growth. He’s also responsible for delivering simple, elegant ways to find and stay at the most interesting places in the world. He is currently working to build a team of engineers that will have a big impact as Airbnb continues to construct a bridge between the online and offline worlds. Mike’s particular focus is on search and matching, systems infrastructure, payments, trust and safety, and mobile.

Bio…

 

This is an Apache Zookeeper introduction – In this talk, , from Rent The Runway, gives an introduction to ZooKeeper. She talks on why it’s useful and how you should use it once you have it running. Camille goes over the high-level purpose of ZooKeeper and covers some of the basic use cases and operational concerns. One of the requirements for running Storm or a Hadoop cluster is to have a reliable Zookeeper setup. When you’re running a service distributed across a large cluster of machines, even tasks like reading configuration information, which are simple on single-machine systems, can be hard to implement reliably. This talk was recorded at the NYC Storm User Group meetup at WebMD Health.

Interested in the Tech Challenges at Rent the Runway?
If you’re looking for a super smart team working on significant problems in the areas of data science and logistics don’t miss this opportunity to connect directly with an engineer inside Rent the Runway.

 

 

The ZooKeeper framework was originally built at Yahoo! to make it easy for the company’s applications to access configuration information in a robust and easy-to-understand way, but it has since grown to offer a lot of features that help coordinate work across distributed clusters. Apache Zookeeper became a de-facto standard for coordination service and used by Storm, Hadoop, HBase, ElasticSearch and other distributed computing frameworks.

Slides & Bio…

 

(Contributor article “Schema Design for Time Series Data in MongoDB by by ,” Solutions Architect at MongoDB and , Director of Product Marketing at MongoDB. Originally appeared on MongoDB blog)

Data as Ticker Tape

New York is famous for a lot of things, including ticker tape parades.

For decades the most popular way to track the price of stocks on Wall Street was through ticker tape, the earliest digital communication medium. Stocks and their values were transmitted via telegraph to a small device called a “ticker” that printed onto a thin roll of paper called “ticker tape.” While out of use for over 50 years, the idea of the ticker lives on in scrolling electronic tickers at brokerage walls and at the bottom of most news networks, sometimes two, three and four levels deep.

Today there are many sources of data that, like ticker tape, represent observations ordered over time. For example:

  • Financial markets generate prices (we still call them “stock ticks”).
  • Sensors measure temperature, barometric pressure, humidity and other environmental variables.
  • Industrial fleets such as ships, aircraft and trucks produce location, velocity, and operational metrics.
  • Status updates on social networks.
  • Calls, SMS messages and other signals from mobile devices.
  • Systems themselves write information to logs.

This data tends to be immutable, large in volume, ordered by time, and is primarily aggregated for access. It represents a history of what happened, and there are a number of use cases that involve analyzing this history to better predict what may happen in the future or to establish operational thresholds for the system.

Time Series Data and MongoDB

Time series data is a great fit for MongoDB. There are many examples of organizations using MongoDB to store and analyze time series data. Here are just a few:

  • Silver Spring Networks, the leading provider of smart grid infrastructure, analyzes utility meter data in MongoDB.
  • EnerNOC analyzes billions of energy data points per month to help utilities and private companies optimize their systems, ensure availability and reduce costs.
  • Square maintains a MongoDB-based open source tool called Cube for collecting timestamped events and deriving metrics.
  • Server Density uses MongoDB to collect server monitoring statistics.
  • Skyline Innovations, a solar energy company, stores and organizes meteorological data from commercial scale solar projects in MongoDB.
  • One of the world’s largest industrial equipment manufacturers stores sensor data from fleet vehicles to optimize fleet performance and minimize downtime.

In this post, we will take a closer look at how to model time series data in MongoDB by exploring the schema of a tool that has become very popular in the community: . MMS helps users manage their MongoDB systems by providing monitoring, visualization and alerts on over 100 database metrics. Today the system monitors over 25k MongoDB servers across thousands of deployments. Every minute thousands of local MMS agents collect system metrics and ship the data back to MMS. The system processes over 5B events per day, and over 75,000 writes per second, all on less than 10 physical servers for the MongoDB tier.

More…

 

Big Data and Wee Data – We all know MongoDB is great for Big Data, but it’s also great for work on the other end of the scale — call it “Wee Data”. In this talk, MongoDB expert an Principal at Bringing Fire Consulting, Avery Rosen, talks on how this type of data is far more common than Big Data scenarios. Avery discusses how just about every project starts with it. In this domain, we don’t care about disk access and indices; instead, we care about skipping past the wheel inventing and getting right down to playing with the data. MongoDB lets you persist your prototype or small-working-set data without making you deal with freeze-drying and reconstitution, provides structure well beyond csv, gets out of your way as you evolve your schemas, and provides simple tools for introspecting data and crunching numbers. This talk was recorded at the New York MongoDB User Group meetup at About.com

Correction note: At minute 24:30 - Shutterfly’s (not photobucket’s) migration to MongDb

Correction note: At minute 49:50 (regarding CouchDB) - CouchDB offers some very similar facilities to MongoDB, being a JSON-document storing database, and it does offer aggregation. However, it seems to have more config overhead in the form of views, and requires error-prone and difficult to diagnose javascript based map reduce instead of aggregation operations, and as such I maintain MongoDB is a superior choice for wee data projects.

“MongoDB and Wee Data: Hacking a Workflow” will start with theory and proceed to walk through ruby code that shows MongoDB’s place in a working ecommerce site’s data ecosystem.

Slides…

 

In this talk, from Tumblr gives an “Introduction to Digital Signal Processing in Hadoop”. Adam introduces the concepts of digital signals, filters, and their interpretation in both the time and frequency domain, and he works through a few simple examples of low-pass filter design and application. It’s much more application focused than theoretical, and there is no assumed prior knowledge of signal processing. This talk was recorded at the NYC Machine Learning Meetup at Pivotal Labs.

Adam also works through how they can be used either in a real-time stream or in batch-mode in Hadoop (with Scalding).  He also has some examples of how to detect trendy meme-ish blogs on Tumblr.

Slides & Bio…

 

(Contributor article “How We Measured America’s Most Hospitable Cities” by Riley Newman, Head of Analytics/Data Science at Airbnb. originally appeared on Airbnb Blog)

By: Andrey Fradkin, Riley Newman & Rebecca Rosenfelt

Lately, we’ve been thinking about how we can promote and share exceptional hosting practices. We know that some hosts on our site consistently receive exceptional reviews. What are the common characteristics of these hosts?

As a first step in our investigation, we created a “Hospitality Index” that measures host quality across cities. Immediately, we saw stark regional trends in host quality and hospitality.

Methodology

To build the index of America’s most hospitable cities, we looked to reviews, our richest source of data about how a trip went. After each trip, we ask guests to rate a number of specific dimensions:

Reviews2

  • Cleanliness — a foundational aspect of any travel experience.
  • Check In — a crucial moment that affects the entire trip.
  • Communication — the primary factor in resolving queries and forestalling any issues.
  • Value — This one is a bit tricky because in some ways it encompasses all the other measures. But capturing a guest’s sense of the overall value of the experience is an important metric.
  • Accuracy — expectation management is key to a smooth Airbnb experience.

There’s a long history of criticism surrounding 5-star review systems. For example, scores tend to be binary (5 or 1). But we can be confident that a 5-star score is a good experience, at minimum. So for the index we looked at the percentage of trips (not reviews, which would be biased by review rates) where guests give 5-star scores for all of the above criteria.

Bio…

 

In this talk, Terence Yim, from Continuuity, discusses Weave, a simple set of libraries that allow you to easily manage distributed applications through an abstraction layer built on Hadoop YARN. Weave allows you to use YARN’s distributed capabilities with a programming model that is similar to running threads. This talk was recorded at the Big Data Gurus meetup at Samsung R&D.

 

Slides and Bio…

 

In this talk, , from Cloudera discusses Cloudera’s new open source project, Cloudera Development Kit (CDK), which helps Hadoop developers get new projects off the ground more easily. The CDK is both a framework and long-term initiative for documenting proven development practices and providing helpful doc and APIs that will make Hadoop application development as easy as possible. This talk was recorded at the Video  Big Data Gurus meetup at Samsung R&D.

Slides and Bio…

 

(Original post with video of talk here)

Adam Illardi: Hi, I’m Adam Ilardi.  I work here at eBay.  I’m an applied researcher.  Why do I choose eBay?  It’s a pretty cool company.  They sell the craziest stuff you’ll ever believe.  There’s denim jean jackets with Nick Cage on the back, and this kind of stuff is all over the place.  So it’s definitely cool.

The New York office is brand new.  It’s less than a year.  What does the New York office do?  Well, we own the homepage of eBay, so the brand-new feed is developed right over there.  You might even see one of the guys.  He’s hiding.  Okay.  And also, all the merchandising for eBay is going to be run out of the New York office.  So that’s billions of dollars worth of eBay business run right out of here.  It’s a major investment eBay has made in New York, which is really cool.

So why you’re here is to find out how we use Scala and Hadoop, and given all the data we have, the two pair very nicely together, as you will see.  All right, so let’s get started.  Okay, these are some things we’ll cover—polymorphic function values, higher kinded types, Cokleislis Star Operator, some use of macros.

Continue reading »

 

In this talk, from Hortonworks, discusses YARN architecture and how to get started developing for the next generation of Hadoop. This talk was recorded at the New York Hadoop User Group meetup at Gilt.

Hadoop 2.0 is approaching. Abhijit talks on the a defining characteristic of Hadoop 2.0, its next generation resource management framework called YARN. YARN enables Hadoop to grow beyond its MapReduce origins to embrace multiple workloads spanning interactive queries, batch processing, streaming & more.

 

Slides…

Proudly hosted by WPEngine