Jeff Dean (Google Senior Fellow) – Systems and Machine Learning Symbiosis (Mar 2018)


Chapters

00:00:17 Machine Learning Systems for Training and Inference
00:10:43 Cloud Tensor Processing Unit Enables Faster Training of Machine Learning Models
00:14:21 Exploring Machine Learning's Impact on Computer Systems and Hardware
00:21:32 Machine Learning and Computer Systems
00:32:13 Optimizing Machine Learning Training with Large Batch Sizes
00:34:14 Hardware and Software Co-evolutions in Machine Learning

Abstract

The Evolution of Computing: Deep Learning, Google’s TPUs, and the Future of Machine Learning Hardware

The relentless advancement of deep learning has ushered in a new era in computing, marked by the insatiable demand for more powerful computational systems. This evolution has been significantly shaped by Google’s development of custom machine learning hardware, particularly the Tensor Processing Unit (TPU), designed for efficient deep learning model training and inference. These developments highlight a future where machine learning is not just a tool for software, but an integral component of hardware design, promising substantial improvements in computational speed, efficiency, and the ability of systems to adapt to changing algorithms and models.

The Demand for Enhanced Computational Power

Deep learning models require substantial computing power, a demand that has outpaced traditional single-thread performance improvements. This gap has spurred the need for hardware optimized for deep learning’s key characteristics: reduced precision and specific linear algebra operations. Moore’s Law’s slowdown in recent years has coincided with deep learning’s insatiable demands for computing power. Training and inference for deep learning models require more computational power than traditional HPC applications. As a result, companies like Google have embarked on creating specialized hardware like TPUs, focusing initially on inference tasks and later expanding to model training. TPUs stand out for their ability to perform linear algebra operations efficiently, backed by a robust memory system and the ability to interconnect multiple chips into powerful pods.

Hardware Acceleration for Machine Learning

Machine learning hardware customization offers exciting prospects for optimizing the performance of large-scale machine learning models. Specialization and focus on specific operations can unleash creativity in computer architecture, beyond traditional Moore’s law improvements.

The Role of TPUs in Accelerating Deep Learning

Google’s TPUs, especially the second version geared towards training, are pivotal in accelerating deep learning processes. These specialized chips, with their matrix multiply units and high-bandwidth memory, cater specifically to the heavy computational requirements of training large models with extensive datasets. Significantly, TPUs’ performance has been evidenced in reducing training times drastically, for instance, decreasing the training duration of search ranking models from 132 to 9 hours.

Applying Machine Learning to Data Structures and Algorithms

Research is exploring the use of machine learning to replace traditional algorithms and data structures. Learned models can approximate the behavior of B-trees, hash functions, Bloom filters, and more, with potential for improved performance and space utilization.

High-Level Programming and Accessibility

To complement the hardware advancements, Google has also developed high-level programming abstractions in TensorFlow, allowing the same programs to run on CPUs, GPUs, and TPUs with minimal changes. This development not only simplifies the programming process but also enhances scalability and accessibility. TPUs have been made available in Google Cloud, with beta access to Cloud TPU accelerators, democratizing access to this powerful technology.

Machine Learning for Heuristic Optimization

Many areas of computer systems, including compilers, networking, and operating systems, rely on heuristics for decision-making. Machine learning can be employed to learn and adapt these heuristics, resulting in improved performance and adaptability to specific usage patterns.

The Challenges and Future of Machine Learning Hardware

Despite these advancements, the rapidly evolving field of machine learning presents significant challenges in hardware design. The unpredictable nature of future algorithmic requirements and the long development timelines of Application-Specific Integrated Circuits (ASICs) like TPUs create a dynamic environment where hardware must be both flexible and future-proof. To address this, research is focusing on new algorithmic ideas and hardware architectures that can handle low-precision training and inference, explore sparsity and embeddings, and increase batch sizes while maintaining accuracy.

Opportunities for Machine Learning in Compiler Optimization

Instruction scheduling, register allocation, loop nest parallelization, and other compiler optimizations can benefit from machine learning techniques.

Applying Machine Learning to Computer Systems

An emerging trend is the application of machine learning to improve the performance of computer systems themselves. This includes using machine learning for resource management, scheduling, and security in operating systems and compilers. Furthermore, machine learning is being explored to replace traditional algorithms and data structures, promising faster performance and improved resource utilization.

Learning in Networking and Operating Systems

Machine learning can be applied to network decisions like backoff strategies and window size adjustments, as well as operating system tasks such as buffer cache management and file system prefetching.

The Interplay of Hardware and Algorithms

The future of machine learning hardware is intricately tied to its algorithms. As hardware designs evolve, they must consider the primitives that allow for the efficient computation of current and potential future algorithms. This co-evolution necessitates a balance between flexibility for experimentation and the efficiency of specialized designs. Moreover, the development of automated placement techniques and high-level programming abstractions is critical to relieve programmers from low-level optimization concerns, raising the level of abstraction and enabling them to focus on more complex aspects of model design and implementation.

Conclusion

The integration of machine learning into hardware design signifies a transformative phase in computing. As this field continues to evolve, it holds the promise of creating systems that are not only faster and more efficient but also capable of adapting to the ever-changing landscape of machine learning algorithms and models. This synergy between hardware and software is poised to unlock new frontiers in computational capability and efficiency, shaping the future of technology.


Notes by: MatrixKarma