In this talk, , senior data scientist at YPlan, introduces both the outlier selection and one-class classification setting. He then presents a novel algorithm called Stochastic Outlier Selection (SOS). The SOS algorithm computes for each data point an outlier probability. These probabilities are more intuitive than the unbounded outlier scores computed by existing outlier-selection algorithms. Jeroen has evaluated SOS on a variety of real-world and synthetic datasets, and compared it to four state-of-the-art outlier-selection algorithms. The results show that SOS has a superior performance while being more robust to data perturbations and parameter settings. Click for the link to Jeroen’s blogpost on the subject, it contains links to the d3 demo! This talk was recorded at the NYC Machine Learning meetup at Pivotal Labs.

What is common in a terrorist attack, a forged painting, and a rotten apple? The answer is: all three are anomalies; they are real-world observations that deviate from what is considered to be normal. Detecting anomalies is of utmost importance because an undetected anomaly can be dangerous or expensive. A human domain expert may suffer from three cognitive limitations: fatigue, information overload, and emotional bias. The cognitive limitations will hamper the detection of anomalies. Outlier-selection and one-class classification algorithms are capable of automatically classifying data points as outliers in large amounts of data. During Jeroen’s Ph.D. he studied to what extent outlier-selection and one-class classification algorithms can support domain experts with real-world anomaly detection.

Slides and Bio…

 

About the talk: NoSQL databases seem to be everywhere you look these days, whether it’s 10gen becoming MongoDB, AWS exposing DynamoDB as a service, or a heated argument overheard at a meetup pinning Riak against Voldemort. In all the hubbub, there is one key-value store replete with name-spacing support, backed by an open standard and supporting a robust and battle-tested authorization scheme that is consistently overlooked — POSIX filesystems.

In this Lyceum, Matt Story will start by introducing the UNIX system calls for file I/O and manipulation. Using the knowledge we’ve gained at the OS level, we’ll then cover the higher-level interfaces for different kinds of files provided by python, learning how to work with the file-system optimizing for both performance and readability, debunking the myth that the file-system is not fast, scalable or easily distributed.

Matt will end by tying together the concepts we’ve learned so far with a live demo of Axial’s file-system backed .

Click here to register for the event

Speaker Bio: Matt is currently Director of Engineering at Axial, where he built fsq as a general-purpose replacement to RabbitMQ, which was both a single-point-of-failure and provided lack-luster introspection and debugging capabilities. Prior to Axial, he collaborated on several specific file-backed message queues as an engineering lead at Tablet.

 

In this talk, , from Sematext, gives an Introduction to Elasticsearch.  Radu starts out by talking on what Elasticsearch is and how it can act as your NoSQL data-store while providing quick, flexible and scalable search. For example, indexing logs or storing product information so that customers can search on them. Radu also delivers a demo which will display the most important functions of Elasticsearch. Some key talking points will be indexing and searching for documents, text analysis for tweaking the relevance of your searches and the facets that allow for pulling statistics out of documents as well as scaling out which offers for more capacity and fault tolerance. He will also touch base on performance tuning for indexing and monitoring as well as administering your cluster in production. This talk was recorded at the NYC Search, Discovery and Analytics meetup at Gilt.

& Slides…

 

Golang Series, part 2 of 2: Profiling Go Programs

In this talk, , from CloudFlare, gives in-depth presentation on profiling Go programs. This talk was recorded at the GoSF meetup at Cisco SF.

Slide & Bio…

 

In this presentation, from Errplane gives and introduction to InfluxDB, an open source distributed time series database that he created. Paul talks about why one would want a database that’s specifically for time series and also covers its API as well as some of the key features of InfluxDB, including:

• Stores metrics (like Graphite) and events (like page views, exceptions, deploys) • No external dependencies (self contained binary) • Fast. Handles many thousands of writes per second on a single node  HTTP API for reading and writing data  SQL-like query language • Distributed to scale out to many machines  Built in aggregate and statistics functions  Built in downsampling

This talk was recorded at the New York Open Statistical Programming meetup at Knewton.

& Slides & Bio…

 

We recently sat down with Daniel Dubrovkine, head of engineering at Artsy to chat about some software engineering good practices, and to learn from Daniel what tips he can share from his experience building Artsy from day one to its present size of around 50 employees.

Technical Debt

As engineers, we are used to objective yes or no answers. But the topic of technical debt is quite fascinating because there is no real yes or no answer or a formula for how much technical debt is ok to accumulate before it begins to damage the business. And there is no business that has absolutely no technical debt.

Daniel talks about two kinds of technical debt. There is the technical debt that occurs from having engineers who are not too qualified to create a particular part of the application, and make architecture mistakes. The danger of having such technical debt is that when you add features upon features on top of a poorly architected foundation, at some point it all crashes and the software just doesn’t work.

But technical debt does not have to be all bad. Yes, there is good technical debt. Good technical debt is planned by the software architect to leave ambiguity and unimplemented pieces of the software because the business case has not been completely defined. Once the business case has been completely defined, engineers can then go and write the code to get rid of that planned technical debt. Here is Daniel’s video about technical debt.

& More Info…

 

In this talk, from Datomic, gives an introductory talk on Datomic as a functional database. He talks on a Datomic as value Database, which means that you can write functions that take values as arguments and similarly can return a database value as its result. He also talks on the importance of a durable, consistent Database that can be shared across processes.  Rich also offers some hands on use from Clojure. This talk was recorded at the LispNYC meetup at Meetup HQ.

Bio…

 

Golang Series, part 1 of 2: Talk 1, Go after 2 Years in Production. Talk 2, Using Sourcegraph to Navigate Go Code on Github

Two part video: Talk 1, , from Iron.io, provides some in-depth details on why Go turned out to be the right choice for the Iron.io backend. He talks about issues related to performance, memory usage, concurrency, reliability, and deployment ease and goes through key areas in the architecture where Go made the difference. Talk 2,  from Sourcegraph will show off a tool for navigating GitHub to everywhere a Go function is used or a Go interface is implemented. He will also show you how to use it to make your own open source projects better. These talks were recorded at the GoSF meetup at Cisco SF.

Bio…

 

In this talk on Machine Learning Distributed GBM, Earl Hathaway, resident Data Scientist at 0xdata, talks about distributed GBM, one of the most popular machine learning algorithms used in data mining competitions. He will discuss where distributed GBM is applicable, and review recent KDD & Kaggle uses of machine learning and distributed GBM. Also, Cliff Click, CTO of 0xdata, will talk about implementation and design choices of a Distributed GBM. This talk was recorded at the SF Data Mining meetup at Trulia.

& More Info…

 

In this introduction to Cassandra, , Chief Evangelist for Apache Cassandra at DataStax, will be presenting on why Cassandra is a key player in database technologies. Both large and small companies alike choose to use Apache Cassandra as their database solution and Patrick will be presenting on why they made this choice. Patrick will also be discussing Cassandra’s architecture, including: data modeling, time-series storage and replication strategies, providing a holistic overview of how Cassandra works and the best way to get started. This talk was recorded at the Big Data Gurus meetup at Samsung R&D.

& Bio…

Proudly hosted by WPEngine