Jeff Dean (Google) (Aug 2016)

Jeff Dean (Google Senior Fellow) – Large Scale Deep Learning with TensorFlow (Aug 2016)

Chapters

00:00:00 Google Brain Project: Research, Tools, and Applications of Neural Networks

00:08:36 Machine Learning Research at TensorFlow

00:13:19 Advancements in Neural Networks for Speech and Image Recognition

00:18:44 Essential Properties of Machine Learning Systems

00:21:52 TensorFlow: Design, Implementation, and Open Source

Open Source Popularity:
GitHub stars and forks suggest TensorFlow’s popularity, making it the most forked new repository in 2015. Bloomberg’s article highlights TensorFlow’s prominence among open-source deep learning packages.

Tutorials and Learning Resources:
TensorFlow provides comprehensive tutorials that illustrate how to use it to implement various machine learning models. These tutorials cover topics like convolutional neural networks, Word2vec, and sequence-to-sequence models, providing a clear understanding of the underlying mathematical concepts and their expression in TensorFlow.

Core Components:
TensorFlow’s core is a graph execution engine that expresses computation as a computational graph. This graph execution engine handles different kinds of devices and has front ends for specifying and driving computations, with Python and C++ being the most developed.

Data Flow Graph:
TensorFlow uses a data flow graph as its computation model, where nodes represent operations on tensors (n-dimensional arrays). Tensors have types, such as three-dimensional tensors of floats or two-dimensional matrices of integers.

Variables and State Management:
TensorFlow introduces variables and operations that update variables, allowing for state management and holding parameters in the system. This enables the training of models by updating parameters based on loss functions.

Symbolic Differentiation:
TensorFlow supports symbolic differentiation, eliminating the need for manual derivation of derivatives. It allows users to specify the loss function they want to minimize, and the system automatically computes the derivatives.

Graph Execution:
TensorFlow’s graph model cannot be directly executed and requires placement of different computation parts across devices for efficient execution. The system manages communication across device boundaries using send and receive nodes, enabling transparent data transfer.

Send and Receive Nodes:
Send and receive nodes handle communication across device boundaries, encapsulating the communication logic. Different implementations exist for different device pairs, allowing optimizations like GPU-to-GPU memory copies and remote procedure calls.

Continuous Improvements:
TensorFlow has undergone continuous improvements since its initial release in November 2015. Updates include Python 3.3 support, GPU performance enhancements, and the introduction of higher-level APIs for simplified neural network specification.

00:33:03 TensorFlow: Recent Developments and Adoption

00:36:25 Accelerating Machine Learning Research Through Data Parallelism

00:42:52 Data and Model Parallelism Techniques for Deep Learning Training

Advantages of Data Parallelism:
Data parallelism distributes the training data across multiple machines, allowing for simultaneous processing of mini-batches. This approach is equivalent to a single machine processing a larger batch, but with the added benefit of reduced gradient staleness. Since each machine computes its own gradients independently, there is no issue of applying gradients to outdated copies of the parameters.

Drawbacks of Data Parallelism:
Data parallelism is less fault-tolerant compared to asynchronous training. If any machine fails, a recovery process is required to address the issue. In synchronous data parallelism, all machines must wait for the slowest machine to complete its step before proceeding to the next step, which can hinder performance if there are significant variations in machine speeds.

Hybrid Approach:
A hybrid approach involves using groups of synchronous replicas, where each group operates independently. This approach provides a balance between fault tolerance and performance, as groups can continue training even if one or more machines fail.

Factors Influencing Data Parallelism Efficiency:
The model computation time should be large relative to the time taken for parameter transmission over the network. Models with fewer parameters but extensive floating-point operations are suitable for data parallelism. Convolutional and recurrent models are examples of models that benefit from data parallelism due to the reuse of parameters.

Real-World Applications:
Data parallelism is widely used in training large-scale models, such as ranking models for search and image models for ImageNet. Significant speedups can be achieved by using multiple replicas, reducing training time from days to hours.

Synchronous vs. Asynchronous Training:
Synchronous training involves waiting for all replicas to complete their steps before proceeding to the next step, while asynchronous training allows replicas to operate independently. Synchronous training typically achieves higher accuracy but can be more sensitive to noise and gradients. Asynchronous training is less prone to noise and gradients but may result in slightly lower accuracy.

Using Backup Workers:
Backup workers are used to alleviate the impact of slow machines in synchronous training. By taking the first few completed steps and discarding the rest, the training process can be accelerated without compromising accuracy.

Conclusion:
Data parallelism is a powerful technique for scaling deep learning training by distributing data across multiple machines. By leveraging synchronous or asynchronous training approaches, hybrid methods, and backup workers, data parallelism enables efficient training of large models with significant speedups.

00:56:53 Extensible Machine Learning Hardware for Deep Learning

01:04:09 TensorFlow: Graph Representation and Execution

01:09:25 Distributed TensorFlow: Graph Execution and Optimization

01:13:01 Placement of Computation for Distributed Machine Learning

01:15:53 TensorFlow: Execution Kernels and Optimization

Abstract

The Evolution and Impact of TensorFlow in Machine Learning and Research

Abstract

In the rapidly evolving field of machine learning and research, Google’s TensorFlow has emerged as a pivotal framework. This article delves into TensorFlow’s development, its challenges, and aspirations, emphasizing how it has revolutionized machine learning applications like speech and image recognition. The discussion extends to TensorFlow’s scalability, flexibility, and real-world applicability, highlighting its profound impact on the research community and beyond.

Introduction

The merging of computer systems and machine learning expertise has accelerated research, particularly in robotics and language understanding. A key player in this evolution is TensorFlow, an open-source machine learning library. It addresses the challenges faced by traditional computer systems in tasks like image and speech recognition, where humans have excelled. TensorFlow’s introduction marked a significant shift in machine learning research, offering a scalable, flexible platform that has been widely adopted across various domains.

TensorFlow uses a computational graph to represent operations. Each node in the graph is an operation (op) with input and output tensors. Ops have multiple implementations called kernels, optimized for different devices. Python wrappers automatically generate Python code for ops. This allows users to define operations in Python while the underlying implementation is in C++. Optimizer functions extend the graph to calculate derivatives of variables with respect to loss. This enables automatic differentiation for training models. Graphs are communicated through a serialization mechanism called protocol buffers. Protocol buffers efficiently represent graphs and can be transferred across machines.

In each operation in TensorFlow, there is a corresponding kernel defined in C++. The kernel specifies the computation performed on the input tensors to produce the output tensor. The compute method of the kernel is invoked on every step of the computation. In the case of the matrix multiply operation, the kernel launches a matrix multiply operation on the input tensors A and B. Additionally, TensorFlow provides a stream context for GPU devices, allowing for the use of an execution stream for GPU cards and a different implementation for other devices like TPU cards.

Growth and Adoption of TensorFlow

Originally used for unsupervised learning research and speech recognition, TensorFlow’s growth within Google is evident. It has expanded to computer vision and other areas, with a significant increase in machine learning systems usage. Its popularity is reflected in its GitHub statistics, with extensive community engagement and rapid adoption – over half a million installs within months of its release. TensorFlow’s popularity continues to grow due to its ease of expression, scalability, portability, reproducibility, and real-product deployment capabilities. In 2015, it became the most forked new repository on GitHub, and its popularity among open-source deep learning packages was highlighted in a Bloomberg article. The availability of comprehensive tutorials and learning resources further enhances its accessibility to users.

TensorFlow’s Technical Capabilities

TensorFlow’s core is a graph execution engine, executing computational graphs with diverse front ends like Python and C++. It enables simple expression of complex machine learning models and supports large-scale training using clusters of machines or GPUs. TensorFlow’s design facilitates scalability and portability, allowing models to be trained and deployed across various platforms and devices. Its reproducibility features and real-world applicability have made it a preferred choice for translating research ideas into production systems. TensorFlow distributes graphs across multiple devices for parallel execution. Devices can be CPUs, GPUs, or other specialized hardware. Clients create sessions to communicate with the master process. The master process receives the client’s graph, serializes it, and stores it. Clients can request the execution of graphs by sending run step calls to the master. Input values can be provided for specific nodes in the graph. The master coordinates the execution of the graph across multiple workers. TensorFlow allows selective execution of nodes within a graph. By specifying the desired output node, the system determines the necessary dependencies to compute that output. This technique enables efficient execution of specific parts of the graph. TensorFlow allows a static graph to exhibit dynamic behavior. Certain graph components, such as checkpointing operations, can be executed periodically. This approach provides flexibility in managing the execution of different parts of the graph.

The session interface allows users to specify a graph and then call run multiple times. This is useful for training models, as it allows users to run thousands or tens of thousands of steps through the model. The session interface provides an opportunity for significant optimization work. TensorFlow can generate good code for the particular sizes of tensors that are flowing around in the model, even if these sizes are not known at compile time. TensorFlow plans to do more optimization work in the future. The goal is to generate really good code for the particular sizes of tensors that are flowing around in the model. This optimization will be especially important for cases where the tensor sizes are not known at compile time.

Innovations in Machine Learning

TensorFlow has significantly influenced neural networks in speech and image recognition. For example, the error rate in image recognition dropped dramatically from 26% to 3%, surpassing human error rates in some cases. In speech recognition, TensorFlow’s neural networks have evolved into complex models, optimizing end-to-end performance.

Scalability and Distributed Systems

The scalability of TensorFlow is paramount. It can handle large datasets and models, often requiring training on multiple machines to achieve reasonable times. The framework’s approach to distributed systems, including data and model parallelism, enhances training efficiency. TensorFlow’s flexibility allows various computations to be optimized for execution on diverse hardware, including custom machine learning hardware like Google’s TPU. Additionally, TensorFlow provides multiple API options for specifying computations, including Keras and higher-level support, to express models in various ways. Distributed TensorFlow involves multiple processes such as clients, masters, and workers to execute complex computations. Communication between these processes is facilitated through RPC layers like gRPC. Users can specify hints about where they want certain bits of computation to be placed, such as on a specific job task or device. Hints anchor parts of the graph, allowing the rest to be placed correctly. After placement decisions are made, the graph is partitioned into subgraphs. Send and receive nodes are inserted to facilitate communication between subgraphs. A rendezvous mechanism coordinates the activities of send and receive nodes. When a graph is executed, each operation is mapped to a corresponding kernel. The kernel is executed on the specified device, such as a CPU or GPU.

Faster Experimental Turnaround Time with High-Performance Computing

Faster experimental turnaround time is essential for efficient machine learning research. Large-scale distributed systems enabled by TensorFlow can accommodate multiple experiments per day, significantly boosting research productivity. TensorFlow abstracts away raw computing hardware complexities, allowing researchers to focus on model specification, while the framework maps models to available hardware.

Data Parallelism for Enhanced Training Speed

TensorFlow leverages data parallelism to expedite training processes. In this approach, multiple model replicas collaborate to update shared parameters, each replica processing a distinct data subset. The speedup achieved depends on the model type, with dense models showing a 30-40x speedup using 50 replicas. Sparse models can even support up to 1,000 replicas, particularly beneficial for applications with large sparse embeddings. The reduced interference with gradients from multiple replicas and slower shared parameter updates enable the use of more replicas in sparse models. In TensorFlow, parameter servers maintain current model parameters, while model replicas download parameters, compute gradients, and send them back to the parameter servers. The parameter servers then centrally update the parameters. Earlier systems used separate parameter servers with limited update rules, but TensorFlow eliminates this separate system by modeling parameter servers as devices within the computational graph. Data parallelism can be implemented synchronously, where all replicas wait for each other before updating parameters, or asynchronously, where replicas update parameters independently, potentially causing staleness.

Advantages and Drawbacks of Data Parallelism

Data parallelism offers several advantages, such as distributing the training data across multiple machines for simultaneous processing of mini-batches. This is equivalent to processing a larger batch on a single machine, but with reduced gradient staleness since each machine independently computes its gradients. However, data parallelism is less fault-tolerant compared to asynchronous training, and recovery is required if a machine fails. Synchronous data parallelism, where all machines must wait for the slowest machine to complete its step, can hinder performance in cases of significant speed variations among machines.

Hybrid Approach and Factors Influencing Data Parallelism Efficiency

A hybrid approach involving synchronous replica groups that operate independently offers a balance between fault tolerance and performance. The efficiency of data parallelism relies on the ratio of model computation time to parameter transmission time over the network. Models with fewer parameters but extensive floating-point operations, such as convolutional and recurrent models, are well-suited for data parallelism due to parameter reuse.

Real-World Applications of Data Parallelism

Data parallelism is widely used in training large-scale models, including ranking models for search and image models for ImageNet. Significant speedups are achieved by using multiple replicas, reducing training time from days to hours.

Synchronous vs. Asynchronous Training

Synchronous training involves waiting for all replicas to complete their steps before proceeding to the next step, while asynchronous training allows replicas to operate independently. Synchronous training typically achieves higher accuracy but can be more sensitive to noise and gradients. Asynchronous training is less prone to noise and gradients but may result in slightly lower accuracy. Synchronous training involves waiting for all replicas to complete their steps before proceeding to the next step, while asynchronous training allows replicas to operate independently. In synchronous training, backup workers are employed to mitigate the impact of slow machines. By discarding the rest and taking the first few completed steps, the training process can be accelerated without compromising accuracy.

Flexibility and Hardware Implications: Custom ASICs

TensorFlow’s flexibility extends beyond deep learning research, allowing it to express various computations efficiently on different hardware, including heterogeneous systems. The trend towards custom machine learning hardware for specific deep learning operations has emerged, with examples such as Movidius’ low-power ASIC and Google’s custom ASIC for inference acceleration in data centers. General-purpose CPU performance scaling has slowed, leading to specialization of hardware for different workloads. Deep learning’s primitives, like matrix multiplies, are tolerant of reduced precision and noise, enabling specialized hardware design. Google’s custom machine learning ASIC offers an order of magnitude better performance and performance per watt compared to GPUs and CPUs, enabling computationally expensive models at lower latency. This is crucial for interactive applications. Hardware designers are exploring new possibilities given the relaxed constraints of lower precision and noise, opening up opportunities for specialized hardware tailored to deep learning tasks.

Extensibility and Community Contributions

TensorFlow defines standard operations like matrix multiply and element-wise vector add, but users can extend it by defining their operations or using community-developed ones. This extensibility allows TensorFlow to adapt to various domains and applications, fostering a collaborative environment for innovation.

Continuous Improvement and Future Prospects

TensorFlow continues to evolve with user feedback and contributions. Enhancements like Python 3.3 support, GPU performance improvements, and the introduction of high-level APIs demonstrate its commitment to balancing low-level control for researchers with ease of use. The future of machine learning hardware and TensorFlow’s role in it is promising, exploring novel architectures tailored for deep learning applications. TensorFlow’s distributed runtime enables multiple processes to run in a distributed system, and its pre-trained models with released parameters facilitate various tasks, including pre-training for image classification.

Conclusion

TensorFlow has revolutionized the field of machine learning, offering a flexible, scalable platform that enables researchers to push the boundaries of technology. Its impact on research, particularly in challenging areas like speech and image recognition, is significant. As TensorFlow continues to evolve, it holds the potential to further transform the landscape of machine learning and real-world problem-solving.

Notes by: Ain

Jeff Dean (Google Senior Fellow) – Large Scale Deep Learning with TensorFlow (Aug 2016)

Chapters

Abstract

Related posts: