Ingle Hreaded Vs Ultithreaded Here Hould E Ocus: S - T - M: W S W F ?
Ingle Hreaded Vs Ultithreaded Here Hould E Ocus: S - T - M: W S W F ?
.....................................................................................................................................................................................................................................................
Today, with the increasing popularity of multicore processors, one approach to optimizing the processors performance is to reduce the execution times of individual applications running on each core by designing and implementing more powerful cores. Another approach, which is the polar opposite of the first, optimizes the processors performance by running a larger number of applications (or threads of a single application) on a correspondingly larger number of cores, albeit simpler ones. The difference between these two approaches is that the former focuses on reducing the latency of individual applications or threads (it optimizes the processors single-threaded performance), whereas the latter focuses on reducing the latency of the applications threads taken as a group (it optimizes the processors multithreaded performance). The obvious advantage of the singlethreaded approach is that it minimizes the execution times of individual applications (especially preexisting or legacy applications)but potentially at the cost of longer
design and verification times and lower power efficiency. By contrast, although the multithreaded approach may be easier to design and have higher power efficiency, its utility is limited to specifichighly parallelapplications; it is difficult to program for other, less-parallel applications. Of course the multiprocessor versus uniprocessor controversy is not a new issue. My earliest recollection of it is from back in the 1970s, when Intel introduced the 4004, the first microprocessor. I remember almost immediately seeing proposals that said that if we could just hook together a hundred or a thousand of these, we would have the best computer in the world, and it would solve all of our problems. I suspect that no one remembers the result of that research, as the long series of increasingly more powerful microprocessors is a testament to our success at (and the value of) improving uniprocessor performance. Yet, with each successive generationthe 4040, the 8080, all the way up to the presentweve seen exactly the same sort of proposal for ganging together lots of the latest-generation uniprocessors.
.......................................................................
14
0272-1732/07/$20.00
2007 IEEE
So the question for these panelists is, why is todays generation different from any of the past generations? Clearly, it is not just that we cannot achieve gains in instructions per cycle (IPC) in direct proportion to the number of transistors used, because thats never been true. If it had been true, we would have IPCs several orders of magnitude larger than those we have today. Our recent rule of thumb has been that processor performance improves as the square root of the number of transistors, and cache-miss rates likewise improve as the square root of the cache size. Yet, despite these sublinear architectural improvements, until recently, uniprocessors have been the preferred trajectory. Why is the uniprocessor trajectory insufficient for today? Are there no ideas that will bring even that sublinear architectural performance gain? Why arent the order-of-magnitude gains being promised by single-instruction, multiple-data (SIMD), vector, and streaming processors of interest? Is the complexity of current architectures a factor in the inability to push across the next architectural performance step? Or is there a feeling that, irrespective of architecture, weve crossed a complexity threshold beyond which we cant build better processors in a timely fashion? Looking at multiprocessors, is using full (but simpler) processors as the building blocks the right granularity? Are multiprocessors really such a panacea of simplicity? How much complexity is going to be introduced in the interprocessor interconnect, in the shared cache hierarchy, and in processor support for mechanisms like transactional memory? Much of the challenge of past generations has been coping with the increasing disparity between processor speed and memory speed, or the limits of die-to-memory bandwidth. Does having a multiprocessor with its multiple contexts make this problem worse instead of better? And, even if multiprocessors really are a simpler alternative, what is the application domain over which they will provide a benefit? Will enough people be able to program them? As I hope is clear, both approaches have advantages and disadvantages. It is, however, unclear which approach will be more
.....................................................................................................................................................................
heavily used in the future, and should be the major focus of our research. The goal of this panel was to discuss the merits of each approach and the trade-offs between the two.
........................................................................
NOVEMBERDECEMBER 2007
15
.........................................................................................................................................................................................................................
COMPUTER ARCHITECTURE DEBATE
parallelism at the instruction level to significantly improve processor performance, whereas there may be significantly more parallelism at the thread level that can be exploited instead. However, difficult problems still need to be solved, and the computer architecture community will not solve them if the best and brightest are steered away from those problems. Creativity and ingenuity can solve those problems. For example, in the RISC vs. CISC debate, CISC did not win merely because of legacy code. Rather, CISC won the debate because computer architects had the ingenuity to develop solutions such as out-of-order execution combined with in-order retirement, wide issue, better branch prediction, and so on. For example, Figure 1 shows the headroom remaining for a 16-wide issue processor that uses the best branch predictor and prefetcher available at the time the measurements were made.1 Take, for example, the creativity of computer architects with respect to branch prediction. Conventional wisdom in 1990 pretty well accepted as fact that instructionlevel parallelism on integer codes would never be greater than 1.85 IPC. However, even simple two-level branch predictors could improve the processors performance beyond 1.85 IPC. And, today, computer architects such as Andre Seznec and Daniel Jimenez have proposed more sophisticated branch predictors that are better than the
two-level branch predictor. There is still quite a bit more unrealized performance to be gotten from better branch prediction and better memory hierarchies (see Figure 1.) The key point is that the computer architecture community should not avoid the difficult problems, but rather should harness its members creativity and ingenuity to propose solutions and solve those problems. Another potential objection to the singlethreaded focus is power consumption. Power consumption in a 10-billion-transistor processor is a significant problem. However, in a processor with 10 billon transistors, the processor could have several large functional units that remain powered off when not in use, but that are powered up via compiled code when necessary to carry out a needed piece of work specified by the algorithm. I use my Refrigerator analogy to explain this type of microarchitecture. William Refrigerator Perry was a huge defensive tackle for the 1985 Chicago Bears. While the Bears were on offense, Perry sat on the bench until they were on the one yard line. Then, they would power him up to carry the ball for a touchdown. To me, the key question is, How much functionality can we put into a processor that remains powered off until it is needed? I think the processor of the future will be what I call Niagara X, Pentium Y. The
.......................................................................
16
IEEE MICRO
Niagara X part of the processor consists of many very lightweight cores for processing the embarrassingly parallel part of an algorithm. The Pentium Y part consists of very few, heavyweight coresor perhaps only onewith serious hybrid branch prediction, out-of-order execution, and so on, to handle the serial part of an algorithm to mitigate the effects of Amdahls law. To ensure that the processors performance does not suffer because of intrachip communication, a high-performance interconnect will need to connect the Pentium Y core to the Niagara X cores. Additionally, computer architects need to determine how transistors can minimize the performance implications of off-chip communication.
must adopt a different approach to designing our processors. The second wall is complexity. The basic issue with the complexity wall is that it is becoming too difficult, and taking too long, to design and verify next-generation processors, and to manufacture them at sufficiently high yield rates. Although some computer architects think that this is a significant wall, I do not, since I believe that creative people like Joel Emer and Yale Patt can create abstractions to manage the complexity. The third, and final, wall is memory, which I think is a very significant problem. Currently, accessing main memory requires possibly hundreds of core cycles, which can be hundreds of times slower than a floatingpoint multiply. Consequently, accessing main memory wastes hundreds, or even thousands, of instruction execution opportunities. How much instruction-level or thread-level parallelism is available for the processor to exploit while waiting for a memory access to complete? And, even if there is a large amount of parallelism, the power and complexity walls limit how the computer architect can design a processor to exploit that parallelism. Given these three walls, the computer architecture community needs to continue to design processors that exploit parallelism if we want to continue to improve processor performance. However, I do not believe that we can continue to exploit instruction-level parallelism with identical operations, that is, SIMD vectors. Rather, we need to exploit a higher-level parallelism in which we can do slightly different things along each parallel path. Note that this higher-level parallelism could be like the thread-level parallelism that has been spectacularly successful in narrow domains such as database management systems, Web servers, and scientific computing, but less successful for other application spaces. More specifically, instead of exploiting bit-level and instruction-level parallelism with wider, more powerful single-threaded processors, we need to use multicore processors to exploit higher-level parallelism. My approach to this problem is to advocate the development of new hardware
........................................................................
NOVEMBERDECEMBER 2007
17
.........................................................................................................................................................................................................................
COMPUTER ARCHITECTURE DEBATE
system optimize and create great parallelism because of the strong properties of the relational algebra. In the latter case, a limited set of programs allow the user to program even clusters relatively easily. However, these two examples are point solutions only; we need to develop these types of solutions more broadly. The key at the hardware level is to support heterogeneous operations, and not just strict SIMD. Additionally, the computer architecture community needs to add mechanisms that can improve the performance of software models; examples include transactional memory and speculative parallelism (for example, thread-level speculation). Adding these hardware features helps support the software model, which allows the programmer to continue to think sequentially.
and software models (Figure 2). The architecture community needs to raise the level of abstraction at the software layer. Although most people naturally do not think in parallel, there are tremendous success stories, like SQL language for relational databases and Googles MapReduce. In the former case, most people can write declarative queries and let the underlying
Multicore architecture
Hill: [removing faux white beard he has been wearing in imitation of Patt; see Figure 3] Let me take this beard off. Its making me think way too sequentially. So, Yale, is it correct that you think that we should have chips with lots of Niagaras and one beefier core? Are you saying that Ive won the debate?
Figure 3. Panelists Yale Patt and Mark Hill and moderator Joel Emer (left to right). Hill removed his beard during the discussion, claiming it caused him to think too sequentially. Patt and Emer declined to remove theirs.
.......................................................................
18
IEEE MICRO
Patt: No. Hill: Youre trying to say thats the uniprocessor argument? Patt: Im saying that you need one or more fast uniprocessors on the chip. Without that, Amdahls law gets in the way. That is, yes, your Niagara thing will run the part of the problem that is embarrassingly parallel. But what about the thing that really needs a serious branch predictor or maybe a trace cache? I have no idea what people are going to come up with in the future. What I dont want is for people to get in the dont-worry-about-it mind-set. Hill: Okay, I agree that we need to have multiple cores and that one of those cores should be much more powerful, or alternatively several of those cores should be much more powerful. Emer: So now Yale has won the debate. Hill: No, thats multiple cores. Patt: Im not suggesting that having multiple cores is not important. What I am suggesting is that we shouldnt forget the part of the problem that requires the heavy-duty single core. You know, you hear a lot of people say, The time for that has passed. There are so many opportunities to further improve single-core performance. Take the stuff they did at Universitat `cnica de Catalunya (UPC) with Polite virtual allocate, where, at rename time, they allocate a reorder buffer register but dont actually use it until it is really needed. They save the power from the time they rename it until the time they actually use it: doing things virtually until you really need it, turning off hardware until you really need it Theres a whole lot out there that in fact students in this room will work on, maybe, if theyre not told, Forget it; the real action is in multiple cores. The fact of the matter is, [smiling] the real action is in Web design. Hill: So, in your view what is the rate of performance improvement that we can expect to get out of a single core?
Patt: The rate is zero, if we think negatively. Sammy Davis Jr. wrote an autobiography entitled Yes I Can. Thats what I would like to encourage people to think: Yes I can as opposed to just giving up and saying, You know, when its 10 billion transistors, I guess it wont be 20 Pentium 4s; it will be 200 Pentium 4s. Hill: I guess I should remind you that I absolutely think that we can push uniprocessors forward. We can get improvements, but were not going to see the improvements that weve seen in the past. And there are markets out there that have grown used to this opium of doubling in performance. That is what we are going to lose going forward. Emer: How much architectural improvement have we seen? Patt: Weve seen quite a bit. In fact, there was a 2000 Asia and South Pacific Design Automation Conference presentation by a guy at Compaq/DEC [Digital Equipment Corporation] named Bill Herrick, who said that between the [Alpha] EV-4 and EV-8 there was a factor of 55 improvement in performance.2 That represented a timeframe of about 10 years. The contribution from technology, essentially in cycle time, was a factor of 7. Thus, the contribution from microarchitecture was more than the contribution from technology. EV-4 was the first 21064, and EV-8, had it been delivered, would have been the 21464. Take branch predictors. The Alpha 21064 used a last-taken branch predictor. The Alpha 21264 used a hybrid two-level predictor. Or, in-order versus out-of-order execution: The first Alpha was in-order execution. The third one was out-of-order execution. Issue width: You know the first Alpha was two-wide issue. The second one was four-wide issue. I am not saying that we should tell these students, No multiple cores. I agree, we need to teach thinking and programming in parallel. But we also need to expose them to algorithms and to all of the other levels down to the circuit level. Audience member: [interjecting] Because its a bad idea. Abstraction is a good thing.
........................................................................
NOVEMBERDECEMBER 2007
19
.........................................................................................................................................................................................................................
COMPUTER ARCHITECTURE DEBATE
Patt: Abstractions a good thing? Abstraction is a good thing if you dont care about the performance of the underlying entities. You know, many schools teach freshman programming in Java. So whats a data structure? Who cares? My hero is Donald Knuth, who teaches data structures showing you how data is stored in memory. Knuth says that unless the programmer understands how the data structure is actually represented in memory, and how the algorithm actually processes that data structure, the programmer will write inefficient algorithms. I agree. Ask a current graduate in computer science, Does it matter whether or not all of the data to be sorted can be in memory at the same time? How many will say Yes and pick their sorting algorithm accordingly?
tougher problem is parallelizing the work and coming up with the abstractions somewhere in the tool chain so that the work will be parallelized. I think that locality is also important, but it is not the hardest thing to do. Patt: Yes, the developer needs tools to break down the work at large levels, but if you want performance, youre going to need knowledge of whats underneath.
.......................................................................
20
IEEE MICRO
Checkpointed processors
Audience member: Any thoughts on checkpointed processors and how they change the trade-off of processor complexity and performance, and perhaps the functionality for multithreading. Hill: I think checkpointed processors are a very interesting and viable technique that are going to make uniprocessors better, but there are also arguments that they can help multiprocessing. For example, you have a paper accepted to ISCA on bulk sequential consistency (SC) where the checkpointed processors help multiprocessing. I dont see checkpointed processors as fundamentally tipping this debate, but they are a good idea. Emer: So, youre saying that theres research that applies to both uniprocessor and multiprocessor domains, potentially. Patt: Thats right.
to work behind an architectural boundary to make processors go faster, and software people can be largely oblivious to what were doing. Hill: A very concrete example of this is that Microsoft had no interest in architecture research until recently. And suddenly it occurred to them that being ignorant of whats happening on the other side of the interface is no longer a viable strategy. Emer: They probably did get a lot of performance by being oblivious. Thats all the legacy code that we do have. Audience member: Actually, a good analogy is the memory consistency model issue. The weaker models were exposed to the software community for the longest time, and they chose to ignore them. But in the last five or six years, theres been this huge effort from the software community to get things fixed there. So, I do see a hope that the other people will band together with architects, but I think that something needs to be done proactively to enable that synergy to actually happen. Patt: Yeah, so what [the audience member] has opened up is whether this should be a sociology debate rather than a hardware one, which I think is right. I think we are uniquely positioned where we are as architects because were the center. There are the software people up here, and the circuit designers down here, and weif were doing our jobwe see both sides. We are uniquely positioned to engage those people. Historically, you said that software people are oblivious, and yet they still get performance because we did our job so well. I think we can continue to do our job so well. I dont think were going to run out of performance if we address Amdahls bottleneck. For a certain amount of time I think [the audience member] is right, that we need to be engaging people and problems on a number of levels. In fact, I would say, the number one problem today has nothing to do with performance, and that is security. Hill: I just want to add one more comment. I think whats really tough is to parallelize
........................................................................
NOVEMBERDECEMBER 2007
21
.........................................................................................................................................................................................................................
COMPUTER ARCHITECTURE DEBATE
code. You may notice that at Illinois too a lot of software people did some parallel work in the 1980s and early 1990s. But the efforts petered out, and they left! And now there are very few parallel software experts around.
conditionals, encryption/decryption, and so on. Its not the same at the lowest level. Im not convinced that data parallelism is sufficient at the lowest operation level, but it is needed in the programming model. Vectors are a cool solution, but they havent seemed to generalize beyond a few classes of applications these past 35 years. The difference between vector processing and multicore processing is that we may have no alternative with the latter. The reason for this is that the uniprocessors will get faster through creative efforts on the part of people like Yale [Patt], but not at a rate of 50 percent a year. We need not only additional cores, but also additional mechanisms to use them, and that is what computer architects have to invent. Simply stamping out the cores may be okay for server land, but not for clients. Patt: Im not interested in server land although thats what made SMT (simultaneous multithreading). If you have lots of different jobs, why not just have simpler chips and let this job run here and that job run there? Emer: What does SMT have to do with that? Patt: SMT was a solution looking for a problem, and the problem it found was servers. Hill: I completely disagree with Yale. Multiple threads are a way to hide memory latency. Patt: Using SMT to hide memory latency is application dependent, and not always a solution to the memory wall problem. Thus, we come back to Amdahls law, which we need to continue to address. Emer: Thats what SMT allowed you to do!
Data parallelism?
Audience member: Mark [Hill] has seemed to downplay the potential for data parallelism on the hardware level. It is much simpler to exploit data parallelism from a workload than thread-level parallelism, especially when dealing with a large number of cores (about 1,000) on a chip. Emer: This kind of parallelism is much more power-efficient than multiple-core parallelism, because one can save all the bookkeeping overhead. Hill: We need to have data parallelism (not SIMD), but I do not believe that data parallelism is as simple as doing identical operations in the same place due to
.......................................................................
22
IEEE MICRO
they did a good job except when parallelism and locality were in conflict, resulting in too much communication. Hill: Multicore does change things. For example, multicore has much better onchip, thread-level communication than was previously possible. But multicore chips do have a different type of locality, in that the union of the cores should not thrash the chips last-level cache. I would like programmers to manage locality, but it seems very difficult to do so. Audience member: I agree that its hard, but if you dont, it doesnt work. Emer: Yale, [the audience member] is supporting you. Patt: Then I should just sit quietly.
Asia South Pacific Design Automation Conf. (ASP-DAC 00); http://www.aspdac.com/ 2000/eng/ap/herrick2.pdf.
Joel Emer is an Intel Fellow and director of microarchitecture research at Intel, where he leads the VSSAD group. He also teaches part-time at the Massachusetts Institute of Technology. His current research interests include performance-modeling frameworks, parallel and multithreaded processors, cache organization, processor pipeline organization, and processor reliability. Emer has a PhD in electrical engineering from the University of Illinois. He is an ACM Fellow and a Fellow of the IEEE. Mark D. Hill is a professor in both the Computer Sciences and Electrical and Computer Engineering Departments at the University of WisconsinMadison, where he also coleads the Wisconsin Multifacet project with David Wood. His research interests include parallel computer system design, memory system design, computer simulation, and transactional memory. Hill has a PhD in computer science from the University of California, Berkeley. He is a Fellow of the IEEE and the ACM. Yale N. Patt is the Ernest Cockrell Jr. Centennial Chair in Engineering at the University of Texas at Austin. His research interests focus on harnessing the expected benefits of future process technology to create more effective microarchitectures for future microprocessors. He is a Fellow of the IEEE and the ACM. Joshua J. Yi is a performance analyst at Freescale Semiconductor in Austin, Texas. His research interests include high-performance computer architecture, simulation, low-power design, and reliable computing. Yi has a PhD in electrical engineering from the University of Minnesota, Minneapolis. He is a member of the IEEE and the IEEE Computer Society. Derek Chiou is an assistant professor in the Electrical and Computer Engineering Department at the University of Texas at Austin. His research interests include com-
MICRO
................................................................................................
........................................................................
NOVEMBERDECEMBER 2007
23
.........................................................................................................................................................................................................................
COMPUTER ARCHITECTURE DEBATE
puter system simulation, computer architecture, parallel computer architecture, and Internet router architecture. Chiou has a PhD in electrical engineering and computer science from the Massachusetts Institute of Technology. He is a senior member of the IEEE and a member of the ACM. Resit Sendag is an assistant professor in the Department of Electrical and Computer Engineering at the University of Rhode Island, Kingston. His research interests include high-performance computer archi-
tecture, memory systems performance issues, and parallel computing. Sendag has a PhD in electrical and computer engineering from the University of Minnesota, Minneapolis. He is a member of the IEEE and the IEEE Computer Society. Direct questions and comments about this article to Joel Emer, [email protected].
For more information on this or any other computing topic, please visit our Digital Library at http://computer.org/ csdl.
.......................................................................
24
IEEE MICRO