Jeff Dean (Google Senior Fellow) – Large Scale Deep Learning with TensorFlow (Aug 2016)


Chapters

00:00:00 Google Brain Project: Research, Tools, and Applications of Neural Networks
00:08:36 Machine Learning Research at TensorFlow
00:13:19 Advancements in Neural Networks for Speech and Image Recognition
00:18:44 Essential Properties of Machine Learning Systems
00:21:52 TensorFlow: Design, Implementation, and Open Source
00:33:03 TensorFlow: Recent Developments and Adoption
00:36:25 Accelerating Machine Learning Research Through Data Parallelism
00:42:52 Data and Model Parallelism Techniques for Deep Learning Training
00:56:53 Extensible Machine Learning Hardware for Deep Learning
01:04:09 TensorFlow: Graph Representation and Execution
01:09:25 Distributed TensorFlow: Graph Execution and Optimization
01:13:01 Placement of Computation for Distributed Machine Learning
01:15:53 TensorFlow: Execution Kernels and Optimization

Abstract

The Evolution and Impact of TensorFlow in Machine Learning and Research

Abstract

In the rapidly evolving field of machine learning and research, Google’s TensorFlow has emerged as a pivotal framework. This article delves into TensorFlow’s development, its challenges, and aspirations, emphasizing how it has revolutionized machine learning applications like speech and image recognition. The discussion extends to TensorFlow’s scalability, flexibility, and real-world applicability, highlighting its profound impact on the research community and beyond.

Introduction

The merging of computer systems and machine learning expertise has accelerated research, particularly in robotics and language understanding. A key player in this evolution is TensorFlow, an open-source machine learning library. It addresses the challenges faced by traditional computer systems in tasks like image and speech recognition, where humans have excelled. TensorFlow’s introduction marked a significant shift in machine learning research, offering a scalable, flexible platform that has been widely adopted across various domains.

TensorFlow uses a computational graph to represent operations. Each node in the graph is an operation (op) with input and output tensors. Ops have multiple implementations called kernels, optimized for different devices. Python wrappers automatically generate Python code for ops. This allows users to define operations in Python while the underlying implementation is in C++. Optimizer functions extend the graph to calculate derivatives of variables with respect to loss. This enables automatic differentiation for training models. Graphs are communicated through a serialization mechanism called protocol buffers. Protocol buffers efficiently represent graphs and can be transferred across machines.

In each operation in TensorFlow, there is a corresponding kernel defined in C++. The kernel specifies the computation performed on the input tensors to produce the output tensor. The compute method of the kernel is invoked on every step of the computation. In the case of the matrix multiply operation, the kernel launches a matrix multiply operation on the input tensors A and B. Additionally, TensorFlow provides a stream context for GPU devices, allowing for the use of an execution stream for GPU cards and a different implementation for other devices like TPU cards.

Growth and Adoption of TensorFlow

Originally used for unsupervised learning research and speech recognition, TensorFlow’s growth within Google is evident. It has expanded to computer vision and other areas, with a significant increase in machine learning systems usage. Its popularity is reflected in its GitHub statistics, with extensive community engagement and rapid adoption – over half a million installs within months of its release. TensorFlow’s popularity continues to grow due to its ease of expression, scalability, portability, reproducibility, and real-product deployment capabilities. In 2015, it became the most forked new repository on GitHub, and its popularity among open-source deep learning packages was highlighted in a Bloomberg article. The availability of comprehensive tutorials and learning resources further enhances its accessibility to users.

TensorFlow’s Technical Capabilities

TensorFlow’s core is a graph execution engine, executing computational graphs with diverse front ends like Python and C++. It enables simple expression of complex machine learning models and supports large-scale training using clusters of machines or GPUs. TensorFlow’s design facilitates scalability and portability, allowing models to be trained and deployed across various platforms and devices. Its reproducibility features and real-world applicability have made it a preferred choice for translating research ideas into production systems. TensorFlow distributes graphs across multiple devices for parallel execution. Devices can be CPUs, GPUs, or other specialized hardware. Clients create sessions to communicate with the master process. The master process receives the client’s graph, serializes it, and stores it. Clients can request the execution of graphs by sending run step calls to the master. Input values can be provided for specific nodes in the graph. The master coordinates the execution of the graph across multiple workers. TensorFlow allows selective execution of nodes within a graph. By specifying the desired output node, the system determines the necessary dependencies to compute that output. This technique enables efficient execution of specific parts of the graph. TensorFlow allows a static graph to exhibit dynamic behavior. Certain graph components, such as checkpointing operations, can be executed periodically. This approach provides flexibility in managing the execution of different parts of the graph.

The session interface allows users to specify a graph and then call run multiple times. This is useful for training models, as it allows users to run thousands or tens of thousands of steps through the model. The session interface provides an opportunity for significant optimization work. TensorFlow can generate good code for the particular sizes of tensors that are flowing around in the model, even if these sizes are not known at compile time. TensorFlow plans to do more optimization work in the future. The goal is to generate really good code for the particular sizes of tensors that are flowing around in the model. This optimization will be especially important for cases where the tensor sizes are not known at compile time.

Innovations in Machine Learning

TensorFlow has significantly influenced neural networks in speech and image recognition. For example, the error rate in image recognition dropped dramatically from 26% to 3%, surpassing human error rates in some cases. In speech recognition, TensorFlow’s neural networks have evolved into complex models, optimizing end-to-end performance.

Scalability and Distributed Systems

The scalability of TensorFlow is paramount. It can handle large datasets and models, often requiring training on multiple machines to achieve reasonable times. The framework’s approach to distributed systems, including data and model parallelism, enhances training efficiency. TensorFlow’s flexibility allows various computations to be optimized for execution on diverse hardware, including custom machine learning hardware like Google’s TPU. Additionally, TensorFlow provides multiple API options for specifying computations, including Keras and higher-level support, to express models in various ways. Distributed TensorFlow involves multiple processes such as clients, masters, and workers to execute complex computations. Communication between these processes is facilitated through RPC layers like gRPC. Users can specify hints about where they want certain bits of computation to be placed, such as on a specific job task or device. Hints anchor parts of the graph, allowing the rest to be placed correctly. After placement decisions are made, the graph is partitioned into subgraphs. Send and receive nodes are inserted to facilitate communication between subgraphs. A rendezvous mechanism coordinates the activities of send and receive nodes. When a graph is executed, each operation is mapped to a corresponding kernel. The kernel is executed on the specified device, such as a CPU or GPU.

Faster Experimental Turnaround Time with High-Performance Computing

Faster experimental turnaround time is essential for efficient machine learning research. Large-scale distributed systems enabled by TensorFlow can accommodate multiple experiments per day, significantly boosting research productivity. TensorFlow abstracts away raw computing hardware complexities, allowing researchers to focus on model specification, while the framework maps models to available hardware.

Data Parallelism for Enhanced Training Speed

TensorFlow leverages data parallelism to expedite training processes. In this approach, multiple model replicas collaborate to update shared parameters, each replica processing a distinct data subset. The speedup achieved depends on the model type, with dense models showing a 30-40x speedup using 50 replicas. Sparse models can even support up to 1,000 replicas, particularly beneficial for applications with large sparse embeddings. The reduced interference with gradients from multiple replicas and slower shared parameter updates enable the use of more replicas in sparse models. In TensorFlow, parameter servers maintain current model parameters, while model replicas download parameters, compute gradients, and send them back to the parameter servers. The parameter servers then centrally update the parameters. Earlier systems used separate parameter servers with limited update rules, but TensorFlow eliminates this separate system by modeling parameter servers as devices within the computational graph. Data parallelism can be implemented synchronously, where all replicas wait for each other before updating parameters, or asynchronously, where replicas update parameters independently, potentially causing staleness.

Advantages and Drawbacks of Data Parallelism

Data parallelism offers several advantages, such as distributing the training data across multiple machines for simultaneous processing of mini-batches. This is equivalent to processing a larger batch on a single machine, but with reduced gradient staleness since each machine independently computes its gradients. However, data parallelism is less fault-tolerant compared to asynchronous training, and recovery is required if a machine fails. Synchronous data parallelism, where all machines must wait for the slowest machine to complete its step, can hinder performance in cases of significant speed variations among machines.

Hybrid Approach and Factors Influencing Data Parallelism Efficiency

A hybrid approach involving synchronous replica groups that operate independently offers a balance between fault tolerance and performance. The efficiency of data parallelism relies on the ratio of model computation time to parameter transmission time over the network. Models with fewer parameters but extensive floating-point operations, such as convolutional and recurrent models, are well-suited for data parallelism due to parameter reuse.

Real-World Applications of Data Parallelism

Data parallelism is widely used in training large-scale models, including ranking models for search and image models for ImageNet. Significant speedups are achieved by using multiple replicas, reducing training time from days to hours.

Synchronous vs. Asynchronous Training

Synchronous training involves waiting for all replicas to complete their steps before proceeding to the next step, while asynchronous training allows replicas to operate independently. Synchronous training typically achieves higher accuracy but can be more sensitive to noise and gradients. Asynchronous training is less prone to noise and gradients but may result in slightly lower accuracy. Synchronous training involves waiting for all replicas to complete their steps before proceeding to the next step, while asynchronous training allows replicas to operate independently. In synchronous training, backup workers are employed to mitigate the impact of slow machines. By discarding the rest and taking the first few completed steps, the training process can be accelerated without compromising accuracy.

Flexibility and Hardware Implications: Custom ASICs

TensorFlow’s flexibility extends beyond deep learning research, allowing it to express various computations efficiently on different hardware, including heterogeneous systems. The trend towards custom machine learning hardware for specific deep learning operations has emerged, with examples such as Movidius’ low-power ASIC and Google’s custom ASIC for inference acceleration in data centers. General-purpose CPU performance scaling has slowed, leading to specialization of hardware for different workloads. Deep learning’s primitives, like matrix multiplies, are tolerant of reduced precision and noise, enabling specialized hardware design. Google’s custom machine learning ASIC offers an order of magnitude better performance and performance per watt compared to GPUs and CPUs, enabling computationally expensive models at lower latency. This is crucial for interactive applications. Hardware designers are exploring new possibilities given the relaxed constraints of lower precision and noise, opening up opportunities for specialized hardware tailored to deep learning tasks.

Extensibility and Community Contributions

TensorFlow defines standard operations like matrix multiply and element-wise vector add, but users can extend it by defining their operations or using community-developed ones. This extensibility allows TensorFlow to adapt to various domains and applications, fostering a collaborative environment for innovation.

Continuous Improvement and Future Prospects

TensorFlow continues to evolve with user feedback and contributions. Enhancements like Python 3.3 support, GPU performance improvements, and the introduction of high-level APIs demonstrate its commitment to balancing low-level control for researchers with ease of use. The future of machine learning hardware and TensorFlow’s role in it is promising, exploring novel architectures tailored for deep learning applications. TensorFlow’s distributed runtime enables multiple processes to run in a distributed system, and its pre-trained models with released parameters facilitate various tasks, including pre-training for image classification.

Conclusion

TensorFlow has revolutionized the field of machine learning, offering a flexible, scalable platform that enables researchers to push the boundaries of technology. Its impact on research, particularly in challenging areas like speech and image recognition, is significant. As TensorFlow continues to evolve, it holds the potential to further transform the landscape of machine learning and real-world problem-solving.


Notes by: Ain