John Hennessy (Alphabet Chairman) – Language Consortium Keynote (May 2019)
Chapters
00:01:39 RISC Architecture: A Revolution in Computer Design
John Hennessy’s Leadership at Stanford University: John Hennessy’s leadership at Stanford University had a profound impact on the institution’s reputation and growth. As president, he oversaw the expansion of the engineering quad and established the Knight-Hennessy Scholars Program, the largest fully endowed graduate level scholarship program in the world. Hennessy takes leadership seriously and wrote a book, Leading Matters, exploring the nature and teachability of leadership.
John Hennessy’s Founding of MIPS and Contributions to RISC Architecture: In the early 1980s, Hennessy founded MIPS, a pioneer in the Reduced Instruction Set Computer (RISC) architecture. RISC simplified the instruction set and turned the CPU into a simple pipeline for faster processing. MIPS RISC architecture inspired subsequent processors and domain-specific processors like DSPs, GPUs, and TPUs. Hennessy’s contributions to RISC led to the development of the PISA architecture used by Tofino.
John Hennessy’s Textbooks and the Turing Award: Hennessy authored two bestselling international textbooks in computer architecture, widely used for over 30 years. Along with Dave Patterson, Hennessy received the Turing Award in 2017 for their contributions to RISC architecture and their influential textbook.
John Hennessy’s Early Experience with Micro-Coded Machines: Before RISC, Hennessy and Patterson worked on micro-coded machines, which had hardware and interpretive overhead. This experience motivated their vision of making compilers more powerful and bringing them closer to the architecture.
The Microprocessor’s Dominance and Future Challenges: Hennessy reflects on the unforeseen dominance of the microprocessor in the computer industry. He emphasizes the need to address future challenges such as energy efficiency, security, and the growing complexity of software.
00:06:25 Computing Technology Evolution and Its Impact
Moore’s Law and Dennard Scaling: Moore’s law, which predicted a doubling of transistors every two years, has been a guiding principle in the semiconductor industry for decades. Dennard scaling, which predicted that power per square millimeter would remain constant as transistors got smaller, has also been a key factor in the industry’s progress. However, Dennard scaling has ended, while Moore’s Law is still holding, leading to a slowdown in performance gains.
Instruction-Level Parallelism and Amdahl’s Law: Instruction-level parallelism (ILP) was a key factor in improving performance in the past, but it is now reaching its limits. Amdahl’s Law, which states that the speedup of a parallel program is limited by the fraction of the program that can be parallelized, has not been repealed or overcome. This means that further improvements in ILP will have a diminishing impact on performance.
Shift in Application: The most important computers are no longer on desktops, but in pockets and the cloud. This shift in application has led to a change in how we think about computing.
Performance Trends: Single-core performance has slowed down significantly in recent years, from 52% per year to 3.5% per year. DRAM performance has also slowed down, with each generation of DDR taking longer than predicted to arrive. Moore’s Law is still holding, but the industry is off by a factor of 10x compared to the predicted performance gains.
00:11:35 The End of Moore's Law and the Rise of Energy Efficiency
Moore’s Law Slowdown and Dennard Scaling End: Moore’s Law has slowed down, with the cost per transistor dropping more slowly than before. Dennard scaling, which allowed for constant energy per transistor, has ended, leading to increased energy consumption.
Energy Efficiency and Power Consumption: Energy efficiency has become critical as power consumption increases. Optimizing power becomes a significant issue, especially for battery-powered devices and cloud data centers.
Thermal Power Limit Reached: Processors have reached their thermal power limit, preventing further power increases. Techniques like clock slowdown and core turn-off are used to prevent overheating.
End of the “Magic” Era: The traditional approach of relying on hardware improvements to boost performance without considering efficiency is no longer viable. Software optimizations are now essential for improving energy efficiency.
Cache as an Example: Cache, a beloved concept in computer architecture, is an example of a component that needs to be re-examined for energy efficiency.
00:15:49 Limits of Speculation and Instruction-Level Parallelism
Diminishing Returns in Efficiency: Increasing cache size results in diminishing returns in terms of power utilization. Locality of reference is key to efficient cache utilization. Deep pipelines and high clock rates lead to diminishing returns in efficiency due to speculative execution.
The Dilemma of Speculation: Speculation is essential for high performance in modern processors. Meltdown and Spectre vulnerabilities highlight the security risks associated with speculation. Disabling speculation can significantly degrade performance.
The Challenge of Accurate Branch Prediction: Modern processors rely on accurate branch prediction to achieve high performance. Predicting branches with high accuracy is challenging due to the large number of instructions in flight. Inaccurate branch prediction results in wasted instructions and energy.
The Limits of Speculative Execution: A significant portion of instructions in modern processors are speculatively executed and then discarded. Wasted instructions consume energy and degrade performance. Restoring the state of the processor after incorrect branch prediction also consumes energy.
The Shift Towards Multicore Architectures: The challenges of speculative execution have led to a shift towards multicore architectures. Multicore architectures offer increased parallelism and improved energy efficiency.
00:19:17 Limits of Conventional Multicore Processors
Background: John Hennessy discusses the shift from ILP to multicore processors and the associated challenges.
Energy and Performance Trade-off: Energy consumption increases proportionally to the number of active cores. Performance must scale at a similar rate to avoid energy inefficiency.
Amdahl’s Law and Serialization: Amdahl’s Law limits the speedup achievable by parallelizing applications. Serial portions of the code can significantly reduce the overall performance gains. Data centers often encounter serialization issues that limit complete parallelization.
Packaging and Thermal Constraints: Packaging technology improves thermal dissipation by about 5% annually. High-end 24-core processors can run at 3.4 GHz in turbo mode. However, running all cores at this speed exceeds power and thermal limits.
Power Limitations: Running 24 cores at 3.4 GHz would require 255 watts, exceeding the power limits of PC-like devices. 64-core processors would require even more power, making it challenging to dissipate heat effectively.
Conclusion: The combination of Amdahl’s Law and power constraints limits the scalability of multicore processors. Finding ways to reduce serialization and improve energy efficiency remains a significant challenge.
00:22:08 Challenges and Opportunities in Hardware and Software Efficiency
Thermal Limitations and Instruction Set Efficiency: Thermal dissipation power limits the number of active cores in a processor. Power limitation impacts Amdahl’s law effect, reducing parallel performance. Instruction set efficiency becomes a key driver for power-efficient devices.
The Shortcoming of Modern Programming Languages: Modern programming languages prioritize software productivity over execution efficiency. Python, as an example, can be highly inefficient in execution compared to C.
Hardware-Centric Approach and Domain-Specific Ideas: General-purpose processors face a deadlock in performance improvement. Domain-specific architectures are the only viable solution. Domain-specific languages will be crucial for these architectures.
Efficiency Gains through Optimization: A study by MIT researchers demonstrates significant efficiency improvements in matrix multiplication. Optimizations such as using C, parallel loops, memory blocking, and vector instructions resulted in a 65,000-fold speedup. Potential for substantial performance gains through various techniques.
Tailoring Architectures to Application Needs: Domain-specific architectures aim to achieve performance by closely aligning with application requirements. Examples include GPUs for graphics, network processors, and deep learning accelerators.
00:26:25 Novel Approaches to Processor Design for Domain-Specific Languages
Key Insights: Energy efficiency is crucial for modern computing, and there are significant energy savings to be gained by reducing control overhead and optimizing memory hierarchy usage.
Efficiency Opportunities: Register file access consumes 60 times more energy than a 32-bit addition, and L1 cache access consumes 100 times more energy. Control overhead accounts for a large portion of energy consumption, often exceeding the energy used for the actual arithmetic operations. Caches are efficient when they work, but can cause significant overhead when they miss or have excessive latency.
Domain-Specific Architectures (DSAs): DSAs offer several advantages over general-purpose processors, including: Simpler parallelism models (e.g., SIMD) for greater efficiency. More efficient use of memory hierarchy through user-controlled mechanisms. Elimination of unnecessary accuracy requirements. Programming models that better match the hardware, enabling more efficient compilation.
Domains Suitable for DSAs: Networking and deep learning are promising domains for DSA implementation due to their high interest and rapid growth.
Systolic Arrays: Systolic arrays, which rely on nearest neighbor communication, offer low energy consumption and high performance. They excel in sparse linear algebra operations, which are common in deep learning.
Performance and Energy Efficiency of DSAs: DSAs can deliver significantly better performance per watt compared to general-purpose processors and even GPUs. Roofline models illustrate the relationship between arithmetic intensity, memory bandwidth, and arithmetic bandwidth in determining performance.
Demand for DSA Performance: The demand for high-performance computing in domains such as deep learning is rapidly growing, driven by the need to train large neural networks.
00:38:06 Rethinking Processor Architectures for Specialized Domains
Compute Demand and Training Time: Domain-specific architecture (DSA) enables more efficient training of complex models like GANs and reinforcement learning, which require immense computational power. The training time for these models is significantly higher compared to conventional approaches, demanding specialized hardware with massive compute capabilities.
Object Code Compatibility and DSL Interface: Traditional processors maintain object code compatibility, ensuring backward compatibility with existing software. DSA breaks away from this convention, allowing changes to the underlying architecture as long as the compiler can efficiently compile to it. This flexibility enables rapid development and iteration of domain-specific architectures, such as Google’s Tensor Processing Units (TPUs).
TPUs and Their Advantages: TPUs are specialized hardware designed for training deep learning models. They offer significant advantages in terms of performance and energy efficiency compared to general-purpose processors. TPUs can be stacked to build large-scale supercomputers with high bandwidth memory and liquid cooling.
Rethinking Architecture and Software Models: DSA challenges conventional notions of architecture design and requires rethinking the interface between software models and hardware. It opens up new possibilities for optimizing performance and energy efficiency by tailoring hardware specifically to the needs of particular applications. This approach allows for faster prototyping and experimentation with different hardware designs.
The Importance of Tight Integration: Success in the domain-specific world requires tight integration across multiple levels of the stack, from applications to underlying architecture. It involves understanding application characteristics, compiler optimization techniques, domain-specific languages, and the appropriate hardware architecture. By achieving this integration, it is possible to overcome the limits of Dennard scaling and Moore’s law and deliver remarkable performance for high-performance applications.
Dave Cook’s Observation: Dave Cook, a software architect for the ILLIAC-IV supercomputer, made a poignant observation in 1975. He noted that despite building advanced hardware, they struggled to effectively utilize it, highlighting the need for closer integration between hardware and software.
Conclusion: Domain-specific architecture offers a promising approach to address the challenges posed by the limits of Dennard scaling and Moore’s law. It enables the development of specialized hardware tailored to specific applications, delivering remarkable performance and energy efficiency. Tight integration across multiple levels of the stack is crucial for successful implementation of DSA.
00:43:58 Emerging Techniques in Domain-Specific Architectures
Details:
Exploring domain-specific architectures for improved performance. Utilizing
Approaching Domain-Specific Architectures: Exposing memory behavior of code allows for better compilation. Deep pipelining and parallelism can improve performance. Ideas from vector machines can be adapted and utilized.
Addressing Memory Latency: Multi-threading and software prefetch methods are used to tackle memory latency. These techniques require an understanding of the underlying code.
Approximation Techniques: Approximations, such as using 8-bit arithmetic, can yield significant efficiency gains. Different data formats, like 16-bit floating point with a larger exponent and smaller mantissa, are being explored. Converting codes to single precision can enable finer grid sizes and improved predictive results.
Noise and Statistical Nature of Deep Learning: Deep learning’s inherent noise and statistical nature make it a suitable domain for approximation techniques. Techniques such as removing neurons with insignificant output values are employed during training.
Future Trends: Continued exploration of approximation techniques, particularly in the training phase, is anticipated.
00:48:22 Networking for Artificial Intelligence and Beyond
Training and Networking: AI networks may primarily interconnect systems engaged in learning and training processes. Networks will likely have two distinct applications: handling large amounts of data for training and transmitting video content.
Top-of-Rack Switch Capacity: Current top-of-rack switches can handle Netflix’s peak global data traffic at any given moment. The ratio of data handled for training versus video streaming is significant.
ML and Networking Integration: The discussion shifts to the integration of ML and reduced functions within switches. The question arises whether networking experts or ML specialists will lead this integration or if a merger will occur.
Hardware Fracturing: The technology space is becoming more fragmented, leading to variations in hardware and packaging. Different devices may require specialized ML processors, such as for camera or voice recognition.
Beyond First-Generation ML: The discussion moves beyond first-generation ML, acknowledging that this field is still in its early stages. The future of AI is uncertain, but it is expected to evolve and potentially move beyond supervised learning.
00:51:03 Machine Learning Challenges and Opportunities
Challenges in Scaling Supervised Learning: Supervised learning is effective for specific tasks with sufficient labeled data. However, it falls short in achieving artificial general intelligence.
Energy Efficiency Gap: Current ML systems consume significantly more energy compared to the human brain.
Human Learning and Evolution: Human learning involves observation, trial and error, and benefits from millions of years of evolution. ML lacks this natural learning ability and relies on large labeled datasets.
Opportunities for Material Science: Quantum computing and carbon nanofibers hold potential for future advancements. Exploring 3D stacking and innovative packaging technologies could improve energy efficiency.
Long-Term Investment and Benchmarking: Moore’s Law continues to drive rapid technological progress, setting a high benchmark for innovation. Investments in long-term research (10+ years) are necessary to stay competitive.
Potential of Silicon and Optical Integration: Better integration of silicon and optical materials could enhance communication efficiency. Applications include long-distance communication and possibly board-to-board connections.
Affordability of Silicon Fabrication: Silicon fabrication is becoming more accessible in university environments. This enables greater experimentation and research opportunities.
Availability of Fabs and Architectural Innovation: The slowdown in the advancement of leading-edge fabs has made them more accessible, allowing for more architectural innovation in universities and startups. As long as the bleeding edge is avoided, the availability of fabs is expected to increase. Cost-sensitive applications can benefit from this increased availability.
Challenges in Quantum Computing: Feynman’s work on reversible computing and his inspiration for quantum computing are mentioned. A significant challenge in quantum computing is maintaining a large system state in a coherent mode, as even minor disturbances can disrupt it. Building large machines with low error rates in qubits is crucial for creating quantum computers that are interesting enough for practical use.
Side Effects of Quantum Technology: While the physics of quantum technology is fascinating and there are many potential side effects, the realization of a practical quantum computer capable of performing interesting computations remains uncertain.
Abstract
Exploring the Evolution of Computing: From RISC Innovation to Domain-Specific Architectures, Quantum Computing, and Artificial Intelligence
“Transforming Computing: From RISC Revolution to Quantum Computing, Domain-Specific Architectures, and Artificial Intelligence”
—
In the rapidly evolving landscape of computing, the journey from the pioneering days of Reduced Instruction Set Computing (RISC) to the contemporary focus on Domain-Specific Architectures (DSAs), quantum computing, and artificial intelligence (AI) represents a monumental shift. Central figures like John Hennessy and David Patterson laid the groundwork with RISC, influencing today’s processors, DSPs, GPUs, and TPUs. Their work, culminating in the Turing Award, revolutionized computer architecture. Concurrently, challenges like the slowdown of Moore’s Law and the end of Dennard Scaling have propelled the shift towards energy efficiency and specialized architectures. This article explores the trajectory of these developments, highlighting the critical transition from general-purpose computing to a future dominated by DSAs, quantum computing, and the ever-growing importance of energy efficiency and approximation techniques in processing.
—
John Hennessy’s Leadership and Contributions to RISC Architecture
John Hennessy’s contributions to the computing field are profound and span leadership, research, and education. As president of Stanford University, he oversaw the expansion of the engineering quad and established the Knight-Hennessy Scholars Program, the largest fully endowed graduate level scholarship program in the world. His commitment to leadership extends to his book, Leading Matters, exploring the nature and teachability of leadership.
Hennessy’s influence in computer architecture began with his founding of MIPS in the early 1980s, pioneering the RISC architecture. RISC simplified the instruction set and turned the CPU into a simple pipeline for faster processing. This inspired subsequent processors and domain-specific processors like DSPs, GPUs, and TPUs. Hennessy’s role in RISC led to the development of the PISA architecture used by Tofino. His two bestselling international textbooks on computer architecture, co-authored with Dave Patterson, have been widely used for over 30 years. Together, they received the Turing Award in 2017 for their contributions to RISC architecture and their influential textbook.
The Dilemma of Moore’s Law and Dennard Scaling
While Moore’s Law has been a guiding principle in the semiconductor industry for decades, predicting a doubling of transistors every two years, it has been more an aspiration than a law. The end of Dennard Scaling, which predicted constant power per square millimeter as transistors got smaller, has marked a critical turning point. This slowdown, coupled with the shift in application from desktops to mobile and cloud computing, has emphasized the need for energy efficiency. The thermal power limit of processors, leading to reduced clock speeds and core shutdowns, further accentuates the challenge.
Addressing Performance and Energy Efficiency Crises
The slowdown in single-core performance growth and DRAM development has prompted a reevaluation of processor design. Energy efficiency has become paramount, especially in cloud computing, where the capital costs of servers and cooling/power infrastructure are comparable. Thermal power limits further exacerbate the challenge.
The Paradigm Shift to Multicore Processors and DSAs
Instruction-level parallelism (ILP) and Amdahl’s Law have presented challenges in improving performance. ILP’s limits and Amdahl’s Law, which states that parallelization speedup is limited by the non-parallelizable fraction of the program, have driven the industry towards multicore processors. These processors allow parallel execution of multiple threads or programs, improving performance and efficiency. However, this shift brings thermal dissipation and increased energy consumption challenges. RISC’s focus on instruction set efficiency has become crucial for power-sensitive devices, leading to the rise of domain-specific architectures tailored for specific applications like GPUs for graphics and TPUs for deep learning.
The Role of Domain-Specific Architectures
DSAs, exemplified by Google’s TPUs and GPUs, offer substantial performance gains by tailoring architecture to specific applications. These architectures employ approximation techniques like reduced numerical precision, enhancing efficiency in scenarios ranging from deep learning to weather prediction. The integration of DSAs across application, compilation, DSL, and architecture levels is pivotal for overcoming the limits imposed by traditional computing paradigms.
Quantum Computing and Future Trends
Quantum computing emerges as a frontier in the computing landscape, albeit with significant practical difficulties. Maintaining coherent system states and building large, low-error quantum machines pose substantial challenges. Meanwhile, advancements in material science, like carbon nanofiber and 3D stacking, hold promise for improved energy efficiency and performance in traditional computing.
AI and Networking: A Look at the Future
– AI networks may primarily interconnect systems engaged in learning and training processes.
– Current top-of-rack switches can handle Netflix’s peak global data traffic at any given moment. The ratio of data handled for training versus video streaming is significant.
– The discussion shifts to the integration of ML and reduced functions within switches. The question arises whether networking experts or ML specialists will lead this integration or if a merger will occur.
– The technology space is becoming more fragmented, leading to variations in hardware and packaging. Different devices may require specialized ML processors, such as for camera or voice recognition.
– The discussion moves beyond first-generation ML, acknowledging that this field is still in its early stages. The future of AI is uncertain, but it is expected to evolve and potentially move beyond supervised learning.
Challenges and Potential Solutions in Computer Architecture
Thermal Limitations and Instruction Set Efficiency: Thermal dissipation power limits the number of active cores in a processor. Power limitation impacts Amdahl’s law effect, reducing parallel performance. Instruction set efficiency becomes a key driver for power-efficient devices.
The Shortcoming of Modern Programming Languages: Modern programming languages prioritize software productivity over execution efficiency. Python, as an example, can be highly inefficient in execution compared to C.
Hardware-Centric Approach and Domain-Specific Ideas: General-purpose processors face a deadlock in performance improvement. Domain-specific architectures are the only viable solution. Domain-specific languages will be crucial for these architectures.
Efficiency Gains through Optimization: A study by MIT researchers demonstrates significant efficiency improvements in matrix multiplication. Optimizations such as using C, parallel loops, memory blocking, and vector instructions resulted in a 65,000-fold speedup. Potential for substantial performance gains through various techniques.
Tailoring Architectures to Application Needs: Domain-specific architectures aim to achieve performance by closely aligning with application requirements. Examples include GPUs for graphics, network processors, and deep learning accelerators.
Emerging Trends in Computer Architecture
Energy Efficiency in Computing: Research is actively exploring ways to reduce energy consumption in computing, including optimizing memory usage, minimizing control overhead, and utilizing systolic arrays.
Domain-Specific Architectures (DSAs): DSAs have gained traction due to their ability to deliver superior performance and energy efficiency for specific applications. They offer advantages such as simpler parallelism models, improved memory hierarchy usage, and tailored programming models.
Performance and Energy Efficiency of DSAs: DSAs can achieve significantly better performance per watt compared to general-purpose processors. Roofline models illustrate the relationship between arithmetic intensity, memory bandwidth, and arithmetic bandwidth in determining performance.
Demand for DSA Performance: The demand for high-performance computing in domains like deep learning is rapidly growing, driving the need for specialized hardware with massive compute capabilities.
Rethinking Architecture and Software Models: DSA challenges conventional notions of architecture design, requiring a rethinking of the interface between software models and hardware. This opens up new possibilities for optimizing performance and energy efficiency.
Tight Integration for Success: Successful implementation of DSA requires tight integration across multiple levels of the stack, from applications to underlying architecture. This involves understanding application characteristics, compiler optimization techniques, domain-specific languages, and the appropriate hardware architecture.
Conclusion
The computing world is at a pivotal juncture, transitioning from general-purpose processors to specialized architectures and quantum computing. This shift necessitates a rethinking of architectural and interface designs, emphasizing energy efficiency, parallelism, and domain-specific solutions. The future of computing, influenced by lessons from the past and innovations in the present, looks towards a landscape where domain-specific architectures and quantum computing redefine what’s possible in processing power and efficiency.
The Boston Computer Museum preserves the history of computing technology, showcasing artifacts like the Johnny Ack machine and Cray supercomputers. Hennessy and Patterson's RISC project revolutionized computer architecture, leading to the development of SPARC and RAID technologies....
RISC revolutionized computing efficiency, while modern challenges include energy efficiency, resource utilization, and specialized processors. Philanthropy, quantum computing, and education (K-12 to undergraduate) are also important aspects of Hennessy's vision for the future of computing....
The slowdown of Moore's Law and Dennard scaling demands a paradigm shift in computer architecture and electronic system design, with energy efficiency and interdisciplinary collaboration key to the future of computing. Domain-Specific Architectures (DSAs) offer advantages in performance and efficiency but pose challenges in programming models and system integration....
John Hennessy's leadership emphasized interdisciplinary collaboration, financial aid expansion, and a focus on research and learning. He established the Knight-Hennessy Scholars Program with a $400 million endowment to support 100 fully-funded scholars annually....
John Hennessy, born in New York and raised in Long Island, became a pioneer in computer science through his work on RISC technology and contributions to Silicon Valley's growth. Hennessy's transition from research to administration shows the balance between maintaining scholarly rigor and embracing leadership....
Deep learning has revolutionized various domains, but faces challenges due to Moore's law limitations and the complexity of training deep learning models. Domain-Specific Architectures (DSAs) offer superior efficiency in power consumption and transistor utilization for specific application domains like deep learning, computer graphics, and virtual reality....
AI has revolutionized computing and hardware design, while Silicon Valley transformed from a semiconductor hub to a tech epicenter due to educational institutions' roles and talent concentration. Stanford University, under Hennessy's leadership, fostered innovation, entrepreneurship, and addressed global challenges through technology....