Seminar - Apache Beam and Google Cloud Dataflow

School of Engineering and Computer Science Seminar

Speaker: Dr. Neal Glew (Google Inc)
Time: Thursday 2nd March 2017 at 11:00 AM - 12:00 PM
Location: Cotton Club, Cotton 350

Add to Calendar Add to your calendar

Abstract

Apache Beam is a recent open source framework for big data processing. It unifies both batch and continuous stream processing into a single programming model and provides primitives for elementwise processing and aggregation of data. Aggregation requires dividing up potentially unbounded data streams into groups that will be aggregated. Beam provides a notion of windowing that allows independent specification of what is to be computed, where in event time is data of interest, when in processing time computation should happen, and how late data should be delt with. The Beam programming model is realised as language-specific SDKs. Currently there is a Java and a Python SDK, and likely more will be developed in the future. Once application developers choose their language they just code against the SDK, and their pipelines will run on any implementation. Beam also has the notion of a runner - something that can run a Beam pipeline. There are several runners based on other open source projects such as Apache Spark and Apache Flink. There is also a runner for Google's cloud, called Google Cloud Dataflow. Pipelines can be moved easily from one runner to another and from on premise systems to different public cloud providers.

Google Cloud Dataflow is a fully managed cloud service for running Beam pipelines. It automatically handles details like spinning up workers or other resources, scaling the number of workers dynamically as the job progresses, partitioning data among workers, monitoring, etc. It also integrates well with other Google Cloud services such as Pub/Sub, Datastore, BigTable, and BigQuery.

In this talk I will provide an introduction to the Beam programming model and go over other interesting aspects of Beam and Dataflow.

Bio

Neal grew up in Wellington, obtained a BSc(hons) from VUW, a PhD (in programming languages, computer science) from Cornell University. He has worked at InterTrust technologies corporation, Intel, and now Google. In his twelve years at Intel he worked on Java virtual machines, programming models for parallelism, and a functional language compiler. In the two years he has been at Google he has worked on the Flume project - an inspiration and earlier version of Dataflow and then Beam - and most recently on the shuffle subsystem for Flume.

Go back Go back to the seminar list