Strata Hadoop World NY 2016 - Data Innovations Track

Strata Hadoop World NY 2016 has following interestinig talks in its Data Innovations Track.

Parallel SQL and analytics with Solr by Yonik Seeley

Yonik Seeley explores recent Apache Solr features in the areas of faceting and analytics, including parallel SQL, streaming expressions, distributed join, and distributed graph queries, as well as the trade-offs of different approaches and strategies for maximizing scalability.

File format benchmark: Avro, JSON, ORC, and Parquet by Owen O'Malley

Picking the best data format depends on what kind of data you have and how you plan to use it. Owen O'Malley outlines the performance differences between formats in different use cases and offers an overview of the advantages and disadvantages of each to help you improve the performance of your applications.

JupyterLab: The evolution of the Jupyter Notebook by Brian Granger and Sylvain Corlay and Jason Grout

Brian Granger, Sylvain Corlay, and Jason Grout offer an overview of JupyterLab, the next-generation user interface for Project Jupyter that puts Jupyter Notebooks within a powerful user interface that allows the building blocks of interactive computing to be assembled to support a wide range of interactive workflows used in data science.

Designing a location intelligence platform for everyone by integrating data, analysis, and cartography by Stuart Lynn and Andy Eschbacher

Geospatial analysis can provide deep insights into many datasets. Unfortunately the key tools to unlocking these insights—geospatial statistics, machine learning, and meaningful cartography—remain inaccessible to nontechnical audiences. Stuart Lynn and Andy Eschbacher explore the design challenges in making these tools accessible and integrated in an intuitive location intelligence platform.

The future of column-oriented data processing with Arrow and Parquet by Julien Le Dem and Jacques Nadeau

In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, like RDMA, SSDs, and nonvolatile memory.

Big data architectural patterns and best practices on AWS by Siva Raghupathy

Siva Raghupathy demonstrates how to use Hadoop innovations in conjunction with Amazon Web Services (cloud) innovations.

Beyond Hadoop at Yahoo: Interactive analytics with Druid by Himanshu Gupta

Himanshu Gupta explains why Yahoo has been increasingly investing in interactive analytics and how it leverages Druid to power a variety of internal- and external-facing data applications.

The Netflix data platform: Now and in the future by Kurt Brown

The Netflix data platform is constantly evolving, but fundamentally it's an all-cloud platform at a massive scale (40+ PB and over 700 billion new events per day) focused on empowering developers. Kurt Brown dives into the current technology landscape at Netflix and offers some thoughts on what the future holds.

Parquet performance tuning: The missing guide by Ryan Blue

Netflix is exploring new avenues for data processing where traditional approaches fail to scale. Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet's features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he's learned, creating the missing guide you need.

The evolution of massive-scale data processing by Tyler Akidau

Tyler Akidau offers a whirlwind tour of the conceptual building blocks of massive-scale data processing systems over the last decade, comparing and contrasting systems at Google with popular open source systems in use today.

Lessons learned running Hadoop and Spark in Docker by Thomas Phelan

Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale environments poses new challenges, especially for big data applications like Hadoop. Thomas Phelan shares lessons learned and some tips and tricks on how to Dockerize your big data applications in a reliable, scalable, and high-performance environment.

Streaming analytics at 300 billion events per day with Kafka, Samza, and Druid by Xavier Léauté

Ever wondered what it takes to scale Kafka, Samza, and Druid to handle complex, heterogeneous analytics workloads at petabyte size? Xavier Léauté discusses his experience scaling Metamarkets's real-time processing to over 3 million events per second and shares the challenges encountered and lessons learned along the way.

Alluxio (formerly Tachyon): The journey thus far and the road ahead by Haoyuan Li

Haoyuan Li offers an overview of Alluxio (formerly Tachyon), a memory-speed virtual distributed storage system. In the past year, the Alluxio project experienced a tremendous improvement in performance and scalability and was extended with key new features. This year, the goal is to make Alluxio accessible to an even wider set of users through a focus on security, new language bindings, and APIs.

Smart data for smarter firefighters by Bart van Leeuwen

Smart data allows fire services to better protect the people they serve and keep their firefighters safe. The combination of open and nonpublic data used in a smart way generates new insights both in preparation and operations. Bart van Leeuwen discusses how the fire service is benefiting from open standards and best practices.

Data modeling for microservices with Cassandra and Spark by Jeffrey Carpenter

Jeff Carpenter describes how data modeling can be a key enabler of microservice architectures for transactional and analytics systems, including service identification, schema design, and event streaming.

An introduction to Druid by Fangjin Yang

Cluster computing frameworks such as Hadoop or Spark are tremendously beneficial in processing and deriving insights from data. However, long query latencies make these frameworks suboptimal choices to power interactive applications. Fangjin Yang discusses using Druid for analytics and explains why the architecture is well suited to power analytic dashboards.