Strata Hadoop World NY 2016 - Data science & advanced analytics

Strata Hadoop World NY 2016 had following interestinig talks in its Data science & advanced analytics track.

R for big data by Garrett Grolemund and Nathan Stephens

Garrett Grolemund and Nathan Stephens explore the new sparklyr package by RStudio, which provides a familiar interface between the R language and Apache Spark and communicates with the Spark SQL and the Spark ML APIs so R users can easily manipulate and analyze data at scale.

Machine learning in Python by Andreas Mueller

Scikit-learn, which provides easy-to-use interfaces to perform advances analysis and build powerful predictive models, has emerged as one of the most popular open source machine-learning toolkits. Using scikit-learn and Python as examples, Andreas Mueller offers an overview of basic concepts of machine learning, such as supervised and unsupervised learning, cross-validation, and model selection.

Practical machine learning by Michael Li and Robert Schroll

Tianhui Li and Robert Schroll of the Data Incubator offer a foundation in building intelligent business applications using machine learning, walking you through all the steps to prototyping and production—data cleaning, feature engineering, model building and evaluation, and deployment—and diving into an application for anomaly detection and a personalized recommendation engine.

Deep learning with TensorFlow by Martin Wicke and Josh Gordon

Martin Wicke and Josh Gordon offer hands-on experience training and deploying a machine-learning system using TensorFlow, a popular open source library. You'll learn how to build machine-learning systems from simple classifiers to complex image-based models as well as how to deploy models in production using TensorFlow Serving.

Guerrilla guide to Python and Apache Hadoop by Juliet Hougland and Sean Owen

Juliet Hougland and Sean Owen offer a practical overview of the basics of using Python data tools with a Hadoop cluster, covering HDFS connectivity and dealing with raw data files, running SQL queries with a SQL-on-Hadoop system like Apache Hive or Apache Impala (incubating), and using Apache Spark to write more complex analytical jobs.

Interactive data applications in Python by Bryan Van de Ven and Sarah Bird

Bryan Van de Ven and Sarah Bird demonstrate how to build intelligent apps in a week with Bokeh, Python, and optimization.

Why should I trust you? Explaining the predictions of machine-learning models by Carlos Guestrin

Despite widespread adoption, machine-learning models remain mostly black boxes, making it very difficult to understand the reasons behind a prediction. Such understanding is fundamentally important to assess trust in a model before we take actions based on a prediction or choose to deploy a new ML service. Carlos Guestrin offers a general approach for explaining predictions made by any ML model.

Data science at eHarmony: A generalized framework for personalization by Jonathan Morra

Data science has always been a focus at eHarmony, but recently more business units have needed data-driven models. Jonathan Morra introduces Aloha, an open source project that allows the modeling group to quickly deploy type-safe accurate models to production, and explores how eHarmony creates models with Apache Spark and how it uses them.

Iterative supervised clustering: A dance between data scientists and machine learning by June Andrews

Clustering algorithms produce vectors of information, which are almost surely difficult to interpret. These are then laboriously translated by data scientists into insights for influencing product and executive decisions. June Andrews offers an overview of a human-in-the-loop method used at Pinterest and LinkedIn that has lead to fast, accurate, and pertinent human-readable insights.

How the Washington Post uses machine learning to predict article popularity by Eui-Hong Han and Shuguang Wang

Predicting which stories will become popular is an invaluable tool for newsrooms. Eui-Hong Han and Shuguang Wang explain how the Washington Post predicts what stories on its site will be popular with readers and share the challenges they faced in developing the tool and metrics on how they refined the tool to increase accuracy.

Using parallel graph-processing libraries for cancer genomics by Crystal Valentine

Crystal Valentine explains how the large graph-processing frameworks that run on Hadoop can be used to detect significantly mutated protein signaling pathways in cancer genomes through a probabilistic analysis of large protein-protein interaction networks, using techniques similar to those used in social network analysis algorithms.

Conditional recurrent neural nets, generative AI Twitter bots, and DL4J by Josh Patterson and David Kale

Can machines be creative? Josh Patterson and David Kale offer a practical demonstration—an interactive Twitter bot that users can ping to receive a response dynamically generated by a conditional recurrent neural net implemented using DL4J—that suggests the answer may be yes.

Unlocking unstructured text data with summarization by Michael Lee Williams

Our ability to extract meaning from unstructured text data has not kept pace with our ability to produce and store it, but recent breakthroughs in recurrent neural networks are allowing us to make exciting progress in computer understanding of language. Building on these new ideas, Michael Williams explores three ways to summarize text and presents prototype products for each approach.

Removing complexity from scalable machine learning by Martin Wicke

Much of the success of deep learning in recent years can be attributed to scale—bigger datasets and more computing power—but scale can quickly become a problem. Distributed, asynchronous computing in heterogenous environments is complex, hard to debug, and hard to profile and optimize. Martin Wicke demonstrates how to automate or abstract away such complexity, using TensorFlow as an example.

Tackling machine-learning complexity for data curation by Ihab Ilyas

Machine-learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas explains why leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions.

Recent advances in applications of deep learning for text and speech by Yishay Carmiel

Deep learning has taken us a few steps further toward achieving AI for a man-machine interface. However, deep learning technologies like speech recognition and natural language processing remain a mystery to many. Yishay Carmiel reviews the history of deep learning, the impact it's made, recent breakthroughs, interesting solved and open problems, and what's in store for the future.

Data science and the Internet of Things: It's just the beginning by Mike Stringer

We're likely just at the beginning of data science. The people and things that are starting to be equipped with sensors will enable entirely new classes of problems that will have to be approached more scientifically. Mike Stringer outlines some of the issues that may arise for business, for data scientists, and for society.

Semantic natural language understanding with Spark Streaming, UIMA, and machine-learned ontologies by David Talby and Claudiu Branzan

David Talby and Claudiu Branzan lead a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, Titan, and Elasticsearch; data science components include custom UIMA annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.

Recent developments in SparkR for advanced analytics by Xiangrui Meng

Xiangrui Meng explores recent community efforts to extend SparkR for scalable advanced analytics—including summary statistics, single-pass approximate algorithms, and machine-learning algorithms ported from Spark MLlib—and shows how to integrate existing R packages with SparkR to accelerate existing R workflows.

Fast deep learning at your fingertips by Amitai Armon and Nir Lotan

Amitai Armon and Nir Lotan outline a new, free software tool that enables the creation of deep learning models quickly and easily. The tool is based on existing deep learning frameworks and incorporates extensive optimizations that provide high performance on standard CPUs.

Model visualization by Amit Kapoor

Though visualization is used in data science to understand the shape of the data, it's not widely used for statistical models, which are evaluated based on numerical summaries. Amit Kapoor explores model visualization, which aids in understanding the shape of the model, the impact of parameters and input data on the model, the fit of the model, and where it can be improved.

A data-driven approach to the US presidential election by Amir Hajian and Khaled Ammar and Alex Constandache

Amir Hajian, Khaled Ammar, and Alex Constandache offer an approach to mining a large dataset to predict the electability of hypothetical candidates in the US presidential election race, using machine learning, natural language processing, and deep learning on an infrastructure that includes Spark and Elasticsearch, which serves as the backbone of the mobile game White House Run.

Machine-learning techniques for class imbalances and adversaries by Brendan Herger

Many areas of applied machine learning require models optimized for rare occurrences, such as class imbalances, and users actively attempting to subvert the system (adversaries). Brendan Herger offers an overview of multiple published techniques that specifically attempt to address these issues and discusses lessons learned by the Data Innovation Lab at Capital One.

Machine intelligence at Google scale by Kazunori Sato

The largest challenge for deep learning is scalability. Google has built a large-scale neural network in the cloud and is now sharing that power. Kazunori Sato introduces pretrained ML services, such as the Cloud Vision API and the Speech API, and explores how TensorFlow and Cloud Machine Learning can accelerate custom model training 10x–40x with Google's distributed training infrastructure.

Evaluating models for a needle in a haystack: Applications in predictive maintenance by Danielle Dean and Shaheen Gauher

In the realm of predictive maintenance, the event of interest is an equipment failure. In real scenarios, this is usually a rare event. Unless the data collection has been taking place over a long period of time, the data will have very few of these events or, in the worst case, none at all. Danielle Dean and Shaheen Gauher discuss the various ways of building and evaluating models for such data.

Predicting patent litigation by Josh Lemaitre

How can the value of a patent be quantified? Josh Lemaitre explores how Thomson Reuters Labs approached this problem by applying machine learning to the patent corpus in an effort to predict those most likely to be enforced via litigation. Josh covers infrastructure, methods, challenges, and opportunities for future research.