Strata Hadoop World NY 2016 - Spark & beyond Track
Strata Hadoop World NY 2016 has following interestinig talks in its Spark & beyond sessions
Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX by Vartika Singh and Jayant Shekhar
Vartika Singh and Jayant Shekhar walk you through techniques for building and tuning machine-learning apps using Spark MLlib and Spark ML Pipelines and graph processing with GraphX.
Spark camp: Exploring Wikipedia with Spark by Zoltan C. Toth
The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Through hands-on examples, Zoltan Toth explores various Wikipedia datasets to illustrate a variety of ideal programming paradigms.
Just enough Scala for Spark by Dean Wampler
Apache Spark is written in Scala. Hence, many if not most data engineers adopting Spark are also adopting Scala, while most data scientists continue to use Python and R. Dean Wampler offers an overview of the core features of Scala you need to use Spark effectively, using hands-on exercises with the Spark APIs.
Architecting a data platform by John Akred and Stephen O'Sullivan and Mauricio Vacas
What are the essential components of a data platform? John Akred, Mauricio Vacas, and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.
The state of Spark and what's next after Spark 2.0 by Ram Sriharsha
Ram Sriharsha reviews major developments in Apache Spark 2.0 and discusses future directions for the project to make Spark faster and easier to use for a wider array of workloads, with an emphasis on API evolution, single-node performance (Project Tungsten Phase 3), and Structured Streaming.
Top five mistakes when writing Spark applications by Ted Malaska and Mark Grover
Ted Malaska and Mark Grover cover the top five things that prevent Spark developers from getting the most out of their Spark clusters. When these issues are addressed, it is not uncommon to see the same job running 10x or 100x faster with the same clusters and the same data, using just a different approach.
Tuning Spark machine-learning workloads by Raj Krishnamurthy
Spark's efficiency and speed can help reduce the TCO of existing clusters. This is because Spark's performance advantages allow it to complete processing in drastically shorter batch windows with higher performance per dollar. Raj Krishnamurthy offers a detailed walk-through of an alternating least squares-based matrix factorization workload able to improve runtimes by a factor of 2.22.
Delivering near real-time mobility insights at Swisscom by François Garillot
Swisscom, the leading mobile service provider in Switzerland, also provides data-driven intelligence through the analysis of its mobile network. Its Mobility Insights team works to help administrators understand the flow of people through their location of interest. François Garillot explores the platform, tooling, and choices that help achieve this service and some challenges the team has faced.
Breaking Spark: The top five mistakes to avoid when using Apache Spark in production by Neelesh Srinivas Salian
Drawing on his experiences across 150+ production deployments, Neelesh Srinivas Salian focuses on five common issues observed in a cluster environment setup with Apache Spark (Core, Streaming, and SQL) to help you improve the usability and supportability of Apache Spark and avoid such issues in future deployments.
A deep dive into Structured Streaming in Spark by Ram Sriharsha
Structured Streaming is a new effort in Apache Spark to make stream processing simple without the need to learn a new programming paradigm or system. Ram Sriharsha offers an overview of Structured Streaming, discussing its support for event-time, out-of-order/delayed data, sessionization, and integration with the batch data stack to show how it simplifies building powerful continuous applications.
Apache Spark in fintech: Building fraud detection applications with distributed machine learning at Intel by Yuhao Yang
Through collaboration with some of the top payments companies around the world, Intel has developed an end-to-end solution for building fraud detection applications. Yuhao Yang explains how Intel used and extended Spark DataFrames and ML Pipelines to build the tool chain for financial fraud detection and shares the lessons learned during development.
Spark Structured Streaming for machine learning by Holden Karau and Seth Hendrickson
Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark's new Structured Streaming and walk you through creating your own streaming model.
Choice Hotels's journey to better understand its customers through self-service analytics by Narasimhan Sampath and Avinash Ramineni
Narasimhan Sampath and Avinash Ramineni share how Choice Hotels International used Spark Streaming, Kafka, Spark, and Spark SQL to create an advanced analytics platform that enables business users to be self-reliant by accessing the data they need from a variety of sources to generate customer insights and property dashboards and enable data-driven decisions with minimal IT engagement.
Spark and Java: Yes, they work together by Jesse Anderson
Although Spark gets a lot of attention, we only think about two languages being supported—Python and Scala. Jesse Anderson proves that Java works just as well. With lambdas, we even get syntax comparable to Scala, so Java developers get the best of both worlds without having to learn Scala.