Скачать 253.88 Kb.
2.4.1 Lithography and wire scaling limitations – implications for power:
In an attempt to keep pace with the Moore’s law as well as to improve performance of the CMOS devices, the industry has been making progress with lithographic scaling. Scaling however has hit a limit on several fronts. The devices have become more difficult to control as the threshold voltages lowered. Several short channel effects have become unmanageable. At nodes of 22nm or lower, one faces additional problems with severe leakage currents. Add to that, the performance has also hit a bottleneck with the wires not scaling effectively. The RC delays across the wires have increased. Also with scaling of devices, the distances between logic blocks has also increased and thus requiring longer global wires. Solution to this had been the inclusion of repeaters to reduce the delay or to add more layers of wiring. However as one goes from one technology node to another, the number of repeaters required has been increasing 10 fold each time when seeking a higher clock rate. Evidence for this trend was presented in a slide (by Ruchir Puri of IBM) in September 2007 at a SEMATECH workshop in Albany, New York. This documented a minimum of a 10X increase in the number of repeaters per 33% lithographic shrink. This began at the modest number of 2000 repeaters at 130nm, but at 45nm it had risen to 2.5 million repeaters. At 33nm the number would expected to grow to 25 million if clock increases are sought; at 22nm one can expect 250 million; and at 15nm, 2.5 billion. There are a number of reasons for skepticism for scaling beyond 15nm for other reasons. These repeaters employ the most leaky transistors since they are directed towards enhancing speed. This trend along with additional leakage mechanisms has led to increasing power. This has forced a change from clock rate enhancement in CMOS to multiple cores, initially simpler cores with less instruction level parallelism in favor of independent parallel processors. Wire resistance continues to climb, primarily due to smaller wire cross sections and electron surface scattering adversely affecting resistivity, and ultimately is one of the limiting factors on continued scaling, perhaps the definitive one. The net effect has been projected by the ITRS in their 2008 power projection for microprocessors. Implicit in this plot is the continued increase in the number of cores, and the implicit reliance only on CMOS multi-cores for future improvements. As partial validation, the new IBM z-196 8 core 8 GHz microprocessor dissipates 250 watts, essentially already at the 2015 projection.
Figure 2.4.1-2. ITRS 2008 Power Roadmap for multicore microprocessors demonstrating dynamic switching power as the largest contributor to power, though CMOS leakage is emerging.
Of particular interest is the evolution of non-CMOS devices, particularly in silicon to try to obtain better speed power performance. A frequently overlooked candidate is the humble bipolar transistor. Popular regard for this device is that it is intrinsically a power hog. However, the device has been also shrinking and in the modern form of a SiGe HBT a distinctly new situation has arisen, one in which this device can indeed make a difference.
Figure 2.4.1-3. Cross Section of SiGe HBT (red arrows point to emitter, graded Ge alloy base, collector, and subcollector, Device fT vs. Jc curve up to 8XP/9HP, and typical mixed BiCMOS CML circuit.
A key plot revealing the true power speed situation was presented at a recent DARPA OSD meeting at the IBM facility in Burlington, VT this October. It is shown in the following figure.
Figure 2.4.1-4. Comparison of Power vs. switching speed, for HBT and CMOS CML, for 76fF of wire loading. The comparison changes as a function of wire loading. Clearly there are regions of performance where the HBT is superior.
2.4.2 Multicore approach:
In order to keep up with the demand to increase the throughput of the processors, there were two options; one to increase the clock rate and the other to use multi-core processors. Increasing clock rate has been ignored over the past few years because of limitations in the CMOS devices and problems encountered with wire delays and associated power dissipation. So the alternative had been multicore processors. The industry had gone from single core to dual core at 65nm, to quad cores at 45nm and 8 cores are planned at 33nm. This approach provided some initial benefits of improving the performance by having more processing cores to execute parallelizable threads in a given program. There are however some limitations in this approach in that not all code is completely parallelizable. This is discussed in the next section detailing Amdahl’s law. Also there is the bottleneck when multiple cores attempt to access the memory at the same time. Limited bus width imposes a barrier on this and thus limits the performance gain that can be achieved from just increasing the processor cores. Mitigating this effect with 3D is one of the Technical Merits of the proposed research.
2,4.3. Amdahl’s law:
Amdahl’s law originally discusses the performance improvement that can be achieved in a system by speeding up a certain factor. This law was extended to apply to evaluate the amount of performance benefits that one can achieve from parallelization of code. As discussed in the Project Summary microprocessor system now have fully embraced the multicore approach to continuing the processor throughput issue in an era where the scaling limits for wiring, repeater quantities and associated power dissipation have forced this paradigm change. The objective of the multi-core approach was that code that can be parallelized can be speeded up by introducing more processor nodes to execute them. However most code have a serial component and a parallel component. Forty years ago Amdahl  created a figure of merit (or FOM) given by
Where P is the fraction of the code that can be parallelized, and S is the fraction that cannot be parallelized, and n is the number of cores. As n goes to infinity the figure of merit becomes 1 + P/S. So if S is 33% as Amdahl claimed, the FOM is limited to 3.00. However there are notable applications where S is 4% (the FOM rises to 25.00) or even lower values for S. However, the plot of the FOM versus P reveals just how challenging the problem can be.
Figure 2.4-1. Plot of Amdahl’s Figure of Merit versus P for n = 16, 32, 64, 128 and 512.
Essentially S must be smaller than about 4% to show large improvements in run time. More significantly though, is the steep slope of the curve near 4%, which reveals how sensitive the FOM can be to various assumptions. Of particular interest in this proposal is the fact that the Amdahl FOM assumes ideal parallelism, in which there is no penalty to be paid in actually implementing parallelism.
In reality when devising hardware (and software) to implement parallel code execution there are inevitably lost cycles, L, which are due to memory latency, multithread and multitask code management, inter-processor communication, and memory management among others. Comparing the total code cycles without parallelization and code cycles with lost cycles in parallelization we obtain a modified Amdahl figure of merit:
If we divide L by n and define this ratio as an average “burden” per core for parallelization
then if B tends to be a constant not dependent on n for a given problem, then B masquerades as serial code
If B is moderately large, then from Figure 2.4-1 one can see that these lost cycles can easily push the performance down the steep curve shown in Figure 2.4-1 from its ideal. Mitigation of this effect is one of the Technical Merits of Broad Impact of the proposal.
2.4.4 Amdahl’s Law is Only the Upper Bound – or - “Is that all there is?”
Amdahl’s FOM is only the upper bound on performance. In fact the sensitivity of parallelization is extremely high to assumptions about architecture. Multicore processors have now been available since the 65nm node. Torellas at UIUC has developed a very powerful Multicore simulator and made it available to the general community for exploration of these architectures in source form so that various assumptions about architecture can be explored. We have used this simulator to explore the degree to which multiple cores can execute all the benchmarks provided by the Stanford SPLASH 2 database, and with a few notable exceptions we find performance projections vs. the number of cores available do not look good.
Figure 2.4-2. Sample Performance in wall clock time (compared with one core) of multiple core execution of Stanford SPLASH 2 benchmark for FFT (left) and U/L Decomposition (right). Performance on more than about 4 cores is marginal, and actually worsens beyond 8-16 cores as the serial code increases.
To be sure there are notable exceptions to this observation, including graphics and other applications that do not provoke contention over scarce resources. But it is clear that in many of these applications fewer faster cores with improved memory wall mitigation and faster I/O would help when the degree of parallelism is low. For example in Figure 2.4-2, if a single core could execute code 10 times faster by some breakthrough in higher clock rate, and to operate at the same (or lower) power level, then the execution time with one core would be lower than any of the multiple core solutions shown. But this in effect would require a return to the clock race, and a complete re-examination of all the scaling strategies that have led us to this point. Again when this situation arises having a heterogeneous collection of cores in which one or two can excel in clock rate would make sense assuming that the contention over shared resources can be mitigated. It is an idea in INTEL’s Nehalem which advocates higher clock rates when parallelism is unproductive, but we take it to a higher clock rate still in this proposal.
2.4.4 Proposed 3D memory wall (and ultimately “disk wall”) latency mitigation:
In the recent past, supercomputers would employ single core microprocessors with the bulk of main memory exterior to the chip package. While there is a memory wall problem even for these processors, the problem compounds when clock rates climb or the number of on chip cores increase due to contention and limited bandwidth to the main memory. For example, in a recent benchmark of 4.7GHz PowerPC computers the CPI was measured to be as high as 4.47. Similarly in a recent benchmark comparison of MAC Pro muilticore processors, a 12-core processor only exhibited 30% of additional performance (260 speedmarks vs. 200 speedmarks) for a 300% increase in number of cores relative to 4 cores at comparable clock rates .
Figure 2.4.4.-1 Picture of conventional single core parallel processor board in an early RPI IBM Blue Gene Supercomputer.
One of the strategies we propose to address this is to exploit is the mitigation of memory latency using 3D memory-over-processor technology. Ultimately with the introduction of MRAM-over-processor this can extend this work to disk replacement, mitigating disk latency the same way, eliminating entirely rotational and head seek time, and providing an ultra wide bus for high speed data transport. Figure 2.4-4 shows a hypothetical 16 core processor (left) packaged separately from its companion memory (right). At certain points in time many of these will require access to memory, but the limited package pin-out and bandwidth-per-pin will prevent much of this attempted access from succeeding. The result is memory starvation of many of the cores as they attempt access to the common main memory.
Figure 2.4.4-2. Memory Throttling Effect Due to Package Pin Out Limitations for 16 cores.
As the number of cores increases the memory starvation will also increase. Simulation of this effect is relatively easy using readily available tools such as Dinero , which provides a gross metric in Clocks per Instruction (CPI).
Figure 2.4.4-3 (left): CPI for Multiple Core Processor Showing Increase with the Number of Cores, and Decrease with the Width of the Bus . Figure 4 (right): Ultra Wide Bit Path Bus Enabled by “Intimate” 3D Memory over Processor Chip Stacking.
When cache memory transfers occur, the package limitations force transfers of one to perhaps a many as four words in parallel at transfer rates consistent with ESD protection and package parasitics. The challenge in terms of bus width for these transfers is doubled if LVDS (Low Voltage Differential Swing) pad driver/receivers are used. Most package pins are devoted to power and ground connections, which severely limits the number of pins that can be used for I/O. In addition, each pad driver consumes perhaps 30-40 milliwatts depending on the amount of external wiring capacitance that must be driven. If the bus width were increased to say 2000 bits, the total IO power would already be 80 watts just for pad drivers! In addition, a massive amount of switching noise is implied when using pad drivers to link to memory via external high capacitance connections. Pad drivers also have ESC protection with associated large area diode clamp parasitics that limit their speeds. One of the advantages of 3D chip stacking technology is that the bus width between memory and the processor may be made extremely wide, much wider than permissible using conventional packaging.. It is possible in Dinero  to vary the number of bits transferred simultaneously per cache line transfer, and the result is as shown in Figure 2.4.4-3 (left). The main mitigation mechanism for the multicore Memory Wall using 3D chip stacking is by increasing the transfer rate. In 3D chip stacking, a large numbers of very short vertical via connections called Through Silicon Vias (TSV’s) or for SOI, Through Oxide Vias (TOV’s) becomes possible. Figure 2.4.5-1 shows a cross section of the Lincoln Labs 3D face-to-face wafer-to-wafer (W2W) bonding 3D chip stack. The vertical TSV’s are currently 1.7 microns by 1.7 microns on 7.0 micron pitch and oly a few microns in length. The Rensselaer team designed a fully functional 3 tier 3D SRAM using that Lincoln Labs 3D process which was realized in 180nm FDSOI CMOS. This is very nearly what is desired here, except that if 32nm SOI can now be used for that 3D memory, as work by Mark Bohr at INTEL has shown that the memory access speed is 4 GHz. Hence thousands of bits can fly back and forth between memory at that node in 0.25 ns, without the power dissipation and switching noise of pad drivers.
|Volume I section administrative items cover Sheet||Volume I section I. Administrative items i-a. Cover Sheet|
|University of edinburgh cover sheet for a new or revised course section A||Region 4 utah aquistion support center instructional cover sheet|
|Document cover Sheet-Studies in Avian Biology-tidal marsh vertebrates||13 Limitations on section 43 of the Administrative Appeals Tribunal Act 1975|
|Burridge’s Multilingual Dictionary of Birds of the World: Volume XII – Italian (Italiano), Volume XIII – Romansch, and Volume XIV – Romanian (Român)||Do not search for items in bold print as they are dated between june and september 2002. If your library is in the location field of these items, change the status to missing. Thank you|
|18 (4), 389-395 Milsom, I., Forssman, L., Biber, B., Dottori, O. and Sivertsson, R. (1983), Measurement of Cardiac Stroke Volume During Cesarean-Section a comparison Between Impedance Cardiography and the Dye Dilution Technique. Acta Anaesthesiologica Scandinavica, 27||The following pages. Cover designed by Jack Gaughan first printing, march 1980 123456789 daw trademark registered printed in canada cover printed in u. S. A|