Big Data - Open source software engineering developer community and events

In this talk, “How RethinkDB Works,” , Lead Engineer at RethinkDB will discuss the value of RethinkDB’s flexible schemas, ease of use, and how to scale a RethinkDB cluster from one to many nodes. He will also talk about how RethinkDB fits into the CAP theorem, and its persistence semantics. Finally, Joe will give a live demo, showing how to load and analyze data, how to scale out the cluster to achieve higher performance, and even destroy a node and show how RethinkDB handles failure. This talk was recorded at the meetup at StumbleUpon Offices.

Bio…

November 8, 2013
RethinkDB, Video
Distributed Database, json, MapReduce, RethinkDB

Data Driven Growth at Airbnb by – As Airbnb’s VP of Engineering, Mike Curtis is tasked with using big data infrastructure to provide a better UX and drive massive growth. He’s also responsible for delivering simple, elegant ways to find and stay at the most interesting places in the world. He is currently working to build a team of engineers that will have a big impact as Airbnb continues to construct a bridge between the online and offline worlds. Mike’s particular focus is on search and matching, systems infrastructure, payments, trust and safety, and mobile.

Bio…

(Contributor article “Tracking Twitter Followers with MongoDB by André Spiegel,” Consulting Engineer at MongoDB. Originally appeared on MongoDB blog)

As a recently hired engineer at MongoDB, part of my ramping-up training is to create a number of small projects with our software to get a feel for how it works, how it performs, and how to get the most out of it. I decided to try it on Twitter. It’s the age-old question that plagues every Twitter user: who just unfollowed me? Surprising or not, Twitter won’t tell you that. You can see who’s currently following you, and you get notified when somebody new shows up. But when your follower count drops, it takes some investigation to figure out who you just lost.

I’m aware there’s a number of services that will answer that question for you. Well, I wanted to try this myself.

The Idea and Execution

The basic idea is simple: You have to make calls to Twitter’s REST API to retrieve, periodically, the follower lists of the accounts you want to monitor. Find changes in these lists to figure out who started or stopped following the user in question. There are two challenging parts:

When you talk to Twitter, talk slowly, lest you hit the rate limit.
This can get big. Accounts can have millions of followers. If the service is nicely done, millions of users might want to use it.

The second requirement makes this a nice fit for MongoDB.

The program, which I called “followt” and wrote in Java, can be found on github. For this article, let me just summarize the overall structure:

The scribe library proved to be a great way to handle Twitter’s OAuth authentication mechanism.
Using , we can retrieve the numeric ids of 5,000 followers of a given account per minute. For large accounts, we need to retrieve the full list in batches, potentially thousands of batches in a row.
The numeric ids are fine for determining whether an account started or stopped following another. But if we want to display the actual user names, we need to translate those ids to screen names, using . We can make 180 of these calls per 15 minute window, and up to 100 numeric ids can be translated in each call. In order to make good use of the 180 calls we’re allowed, we have to make sure not to waste them for individual user ids, but to batch as many requests into each of these as we can. The class net.followt.UserDB in the application implements this mechanism, using a BlockingQueue for user ids.
More…

November 7, 2013
Articles, MongoDB

“Understanding and Managing Cassandra’s Vnodes + Under the Hood: Acunu Analytics” - In this talk, , Founder and CTO at , and , Software Engineer at Acunu Analytics, share the concept, implementation and benefits of virtual nodes in Apache Cassandra 1.2 & 2.0. They also go over why virtual nodes are a replacement to token management, and how to use Acunu Analytics to collect event data, build OLAP-style cubes and ask SQL-like queries via a RESTful API, on top of Cassandra. This talk was recorded at the DataStax Cassandra SF users group meetup.

More Info…

November 6, 2013
Video
API, cassandra, REST, SQL

This is an Apache Zookeeper introduction – In this talk, , from Rent The Runway, gives an introduction to ZooKeeper. She talks on why it’s useful and how you should use it once you have it running. Camille goes over the high-level purpose of ZooKeeper and covers some of the basic use cases and operational concerns. One of the requirements for running Storm or a Hadoop cluster is to have a reliable Zookeeper setup. When you’re running a service distributed across a large cluster of machines, even tasks like reading configuration information, which are simple on single-machine systems, can be hard to implement reliably. This talk was recorded at the NYC Storm User Group meetup at WebMD Health.

Interested in the Tech Challenges at Rent the Runway?

If you’re looking for a super smart team working on significant problems in the areas of data science and logistics don’t miss this opportunity to connect directly with an engineer inside Rent the Runway.

The ZooKeeper framework was originally built at Yahoo! to make it easy for the company’s applications to access configuration information in a robust and easy-to-understand way, but it has since grown to offer a lot of features that help coordinate work across distributed clusters. Apache Zookeeper became a de-facto standard for coordination service and used by Storm, Hadoop, HBase, ElasticSearch and other distributed computing frameworks.

Slides & Bio…

(Contributor article “Schema Design for Time Series Data in MongoDB by by ,” Solutions Architect at MongoDB and , Director of Product Marketing at MongoDB. Originally appeared on MongoDB blog)

Data as Ticker Tape

New York is famous for a lot of things, including ticker tape parades.

For decades the most popular way to track the price of stocks on Wall Street was through ticker tape, the earliest digital communication medium. Stocks and their values were transmitted via telegraph to a small device called a “ticker” that printed onto a thin roll of paper called “ticker tape.” While out of use for over 50 years, the idea of the ticker lives on in scrolling electronic tickers at brokerage walls and at the bottom of most news networks, sometimes two, three and four levels deep.

Today there are many sources of data that, like ticker tape, represent observations ordered over time. For example:

Financial markets generate prices (we still call them “stock ticks”).
Sensors measure temperature, barometric pressure, humidity and other environmental variables.
Industrial fleets such as ships, aircraft and trucks produce location, velocity, and operational metrics.
Status updates on social networks.
Calls, SMS messages and other signals from mobile devices.
Systems themselves write information to logs.

This data tends to be immutable, large in volume, ordered by time, and is primarily aggregated for access. It represents a history of what happened, and there are a number of use cases that involve analyzing this history to better predict what may happen in the future or to establish operational thresholds for the system.

Time Series Data and MongoDB

Time series data is a great fit for MongoDB. There are many examples of organizations using MongoDB to store and analyze time series data. Here are just a few:

Silver Spring Networks, the leading provider of smart grid infrastructure, analyzes utility meter data in MongoDB.
EnerNOC analyzes billions of energy data points per month to help utilities and private companies optimize their systems, ensure availability and reduce costs.
Square maintains a MongoDB-based open source tool called Cube for collecting timestamped events and deriving metrics.
Server Density uses MongoDB to collect server monitoring statistics.
Skyline Innovations, a solar energy company, stores and organizes meteorological data from commercial scale solar projects in MongoDB.
One of the world’s largest industrial equipment manufacturers stores sensor data from fleet vehicles to optimize fleet performance and minimize downtime.

In this post, we will take a closer look at how to model time series data in MongoDB by exploring the schema of a tool that has become very popular in the community: . MMS helps users manage their MongoDB systems by providing monitoring, visualization and alerts on over 100 database metrics. Today the system monitors over 25k MongoDB servers across thousands of deployments. Every minute thousands of local MMS agents collect system metrics and ship the data back to MMS. The system processes over 5B events per day, and over 75,000 writes per second, all on less than 10 physical servers for the MongoDB tier.

More…

Big Data and Wee Data – We all know MongoDB is great for Big Data, but it’s also great for work on the other end of the scale — call it “Wee Data”. In this talk, MongoDB expert an Principal at Bringing Fire Consulting, Avery Rosen, talks on how this type of data is far more common than Big Data scenarios. Avery discusses how just about every project starts with it. In this domain, we don’t care about disk access and indices; instead, we care about skipping past the wheel inventing and getting right down to playing with the data. MongoDB lets you persist your prototype or small-working-set data without making you deal with freeze-drying and reconstitution, provides structure well beyond csv, gets out of your way as you evolve your schemas, and provides simple tools for introspecting data and crunching numbers. This talk was recorded at the New York MongoDB User Group meetup at About.com

Correction note: At minute 24:30 - Shutterfly’s (not photobucket’s) migration to MongDb

Correction note: At minute 49:50 (regarding CouchDB) - CouchDB offers some very similar facilities to MongoDB, being a JSON-document storing database, and it does offer aggregation. However, it seems to have more config overhead in the form of views, and requires error-prone and difficult to diagnose javascript based map reduce instead of aggregation operations, and as such I maintain MongoDB is a superior choice for wee data projects.

“MongoDB and Wee Data: Hacking a Workflow” will start with theory and proceed to walk through ruby code that shows MongoDB’s place in a working ecommerce site’s data ecosystem.

Slides…

October 29, 2013
Video
big data, MongoDB

This is a friendly Lambda Calculus Introduction by Dustin Mulcahey. LISP has its syntactic roots in a formal system called the lambda calculus. After a brief discussion of formal systems and logic in general, Dustin will dive in to the lambda calculus and make enough constructions to convince you that it really is capable of expressing anything that is “computable”. Dustin then talks about the simply typed lambda calculus and the Curry-Howard-Lambek correspondence, which asserts that programs and mathematical proofs are “the same thing”. This talk was recorded at the Lisp NYC meetup at Meetup HQ.

Slides & Bio…

October 28, 2013
Video
Calculus, data science, Lambda Calculus, LISP, machine learning

Rails Challenges and Demystifying Rest APIs – First Talk by : This talk is for Rails initiates as well as the managers and mentors who help them gain skills and improve. Rails can be difficult for new developers. In this talk Daniel Kehoe talks on the challenges of learning Rails and how to help all Rails developers overcome the obstacles.

Second Talk by : Struggling with integrating Web Services? Kirsten Jones from 3Scale will give you a whirlwind tour of the entire web services stack, from the HTTP layer to authentication models, and demonstrate methods you can use to debug issues when integrating with APIs. Common issues and resolutions will be covered as well, to make sure you can get those integrations up and running with a minimum of hair-pulling.

Bios…

October 24, 2013
3scale, Video
API, REST, Ruby on Rails

In this talk, from Tumblr gives an “Introduction to Digital Signal Processing in Hadoop”. Adam introduces the concepts of digital signals, filters, and their interpretation in both the time and frequency domain, and he works through a few simple examples of low-pass filter design and application. It’s much more application focused than theoretical, and there is no assumed prior knowledge of signal processing. This talk was recorded at the NYC Machine Learning Meetup at Pivotal Labs.

Adam also works through how they can be used either in a real-time stream or in batch-mode in Hadoop (with Scalding). He also has some examples of how to detect trendy meme-ish blogs on Tumblr.

Slides & Bio…

October 22, 2013
Tumblr, Video
big data, Digital Signal Processing, Hadoop, scalding

How RethinkDB Works by Joe Doliner

Data Driven Growth at Airbnb by Mike Curtis

Tracking Twitter Followers with MongoDB by André Spiegel

(Contributor article “Tracking Twitter Followers with MongoDB by André Spiegel,” Consulting Engineer at MongoDB. Originally appeared on MongoDB blog)

The Idea and Execution

Understanding and Managing Cassandra’s Vnodes + Under the Hood: Acunu Analytics by Tim Moreton and Nicolas Favre-Felix

Apache Zookeeper Introduction By Camille Fournier

Schema Design for Time Series Data in MongoDB by Sandeep Parikh

Data as Ticker Tape

Time Series Data and MongoDB

Big Data and Wee Data by Avery Rosen

Lambda Calculus by Dustin Mulcahey

Rails Challenges and Demystifying Rest APIs

Introduction to Digital Signal Processing in Hadoop by Adam Laiacano

DATA ENGINEERING NEWSLETTER

Upcoming NYC Tech talks

Workflow Engines for Hadoop

Upcoming SF Tech talks

Categories

Archives