Скачать 253.88 Kb.
SECTION 2. – Technical Details
2.1. PowerPoint Summary
Figure 2.1-1. Penta-Chart showing key elements of the proposed research.
The proposed research seeks to create a quantum leap in computing through the introduction of ultra high clock rate cores into multicore computing, which may be used to augment lower clock rate cores when parallelism is low. The idea is not so strange as the Nehalem (Core 5) INTEL multicore microprocessor chip uses this strategy. However, the amount of clock speed-up possible here will be dramatically bigger than possible in CMOS alone. As such the proposal fits the “heterogeneous cores” area of the OHPC BAA, but it also represents “heterogeneous technology integration” using 3D IC technology because two entirely different process are married in one IC. These are to be heterogeneously combined using 3D integration techniques along with mass memory and (ultimately) mass disk memory replacement using ultra wide data bus structures. Power is saved by eliminating a large number of pad drivers, while actually vastly increasing the data flow rate previously limited by ESD protection and package boundaries. This project builds upon newly emerging 90nm Silicon Germanium BiCMOS, along with recently released 32nm SOI CMOS technology developed at IBM, and experience gained over many years in DARPA HPC, DARPA NGI, DARPA TEAM , DARPA HPCS, DARPA 3DIC and a recent DARPA seedling sponsored by Mike Fritze.
2.2. Innovative Claims for the Proposed Research.
This BAA solicits methods for addressing several problems associated with using existing microprocessor technology in High Performance Computing. We propose to make a breakthrough in this area by combining two very new IBM technologies using Heterogeneous 3D integration. One of these is so new we were only able to obtain MOSIS price quotes from IBM this week.
Of paramount importance is the escalating power consumption found in modern microprocessors. But equally perplexing is the growing complexity of dealing with the trend towards multiple core microprocessors and extracting improvements in performance thereof. While appealing, and reminiscent of earlier generations of single core parallel computing nodes, multiple core microprocessors are significantly different. One of the issues is that multiple cores have increased contention over gaining access to resources that are exterior to the multiple core microprocessor chip package boundary. These resources include large blocks of external memory and disk or network I/O. Conventional packages have limited pin count and bandwidth per pin. In addition the pad driver dissipation and switching noise associated with conventional packaging imposes its own power dissipation and regulation burden. Another significant difference is the clash between the on-chip network or crossbar mechanisms for inter-processor data communication, and external chip-to-chip networks. In creating multiple core microprocessors a conscious decision was made throughout the industry to not exploit the increasing transistor counts associated with the advances of Moore’s Law primarily for larger on-chip microprocessor memory. Instead, a vision of multiple central processing units integrated onto one die for commercial computers made its appearance for commercial and personal computers. This was in part born of the saturation of the clock race, which itself was driven by the Denard scaling rules for CMOS, the implication of this scaling strategy on wire resistance, and ultimately the resulting power dissipation which derives from escalating numbers of repeater circuits. The potential for continuing performance improvements through use of parallelism led to the replacement of the “clock race” by the “core race.” Unfortunately, there are obstacles in this approach. Power dissipation for this multiple core integration strategy still has consequences particularly for algorithms that parallelize marginally. In some cases clock rates for multiple cores have regressed to lower frequencies. INTEL in one of its Nehalem Turbo strategy will de-power cores operating at this lower clock rate when they are idle, and partially redirects some of this power to speeding up perhaps one or two of the cores to execute serial code, or manage I/O. This can help improve the uniformity of performance over time, especially during periods of execution of sections of code that parallelize poorly or are essentially serial or involve thread management. But the turbo increase in clock rate possible for CMOS is limited to perhaps a factor of 2, with the highest reported clock rate near 5GHz without cryo cooling. However, this strategy provides a hint of how the present proposal will attempt to advance the state of the art, for much higher clock rates are possible due to impending advances in technology. The first of these is a new IBM 90nm SiGe HBT BiCMOS process offering which is due to make its appearance in the time frame of the OHPC BAA’s period of performance, starting in mid 2011 and running through 2013. This process is identified by IBM designation as 9HP. Its main offering is a higher transit time frequency for the SiGe HBT and Low Power options for the companion cointegrated CMOS. The higher transit time frequency (300GHz vs 210GHz for IBM’s 130nm 8HP SiGe HBT BiCMOS node) could be used to reach to new higher frequency clock regimes, but the new speed could be traded-off for lower power. Used in this way the HBT could offer between a factor of 4X to 10x reduction in power between two successive technology nodes. IBM plans for accessing this technology and price quotations for this access have only just now become available. As if to anticipate this BAA another newly emerging process offering provides a unique low power opportunity for dense CMOS memory. This is the IBM 32nm SOI CMOS technology (32SO1). It can be used to make extremely dense circuitry. However, while the 32nm 32SO1 CMOS and 9HP 90nm BiCMOS processes are both IBM offerings, these two technologies are fabricated in two entirely different foundries, the former at East Fishkill and the latter at Burlington, VT. Furthermore, these are fabricated on different diameter wafers with the former at 300mm and the latter at 200mm. The proposed research explores a way ultimately to bring these two IBM technologies together through 3D wafer-to-wafer alignment and bonding with subsequent dicing of singulated 3D dies.
Figure 1.2-1 Main Concept Slide for 3D Heterogeneous integration of “Best of Breed” SiGe HBT BiCMOS technology (9HP) and 32nm SOI CMOS.
Since these two exciting technologies (9HP and 32SO1) are on different diameter wafers, at the present time any approach to full wafer bonding would require coring the 12 inch down to 8 as SiGe HBT technology is on 8 inch wafers. The expected impact would be to demonstrate how a High Clock Rate Unit (HCRU) can be fabricated with 3D memory wall mitigation at clock rates of 32GHz for attractive power levels suitable for incorporation into a multiple core processor when parallelism is absent. Although Figure 1.2-1 would be the optimum 3D manufacturing setting, as a practical matter obtaining sufficient funding for full wafers for all tiers in the 3D stack might be too costly for this demonstration. Hence one alternative solution of Chip to Wafer (C2W) is proposed for this research whereby singulated SOI dies could be re-mounted or inlaid into a template with sufficient precision to align and bond these 32nm SOI dies onto a base 90nm SiGe HBT BiCMOS wafer.
Figure 1.2-2 Chip to wafer (C2W) bonding using an Inlay Template.
Our proposal makes some assumptions about access to iBM’s 32SO1 process through DARPA LEAP to provide the dies from that process. Because there is risk of this not happening, a backup plan involves flipping the 9HP wafers and providing 90nm 3D memory that is co-inserted into that lithography reticle plate set with the high clock rate processor. This would provide a 3D flow closer to that in Figure 1.2-1. Additionally various methods of lowering the price of the dedicated full wafer 9HP fabrication (such as stepper blading of MPW shared reticles or surface laser ablation to protect proprietary IP from other “ride sharers”) are being discussed which could end up lowering the cost of the proposal, but owing to the elapse of the OHPC proposal window we have opted to price the worst case cost scenario which still fits the $3M cap. If further information is available and an award is deemed possible some renegotiation of the budget is possible. One other item worth mentioning related to cost. As this is a university effort the “burn rate” of most funding is fixed per month due to the necessity of non-severability of graduate student support packages. The only major increase would come in the third year when the 9HP fab takes place. At that point the bulk of the IBM 9HP fabrication costs would fall due in 2013.
2.3. Proposal Roadmap
As specified in the BAA for OHPC, a single phase of three years is anticipated. Nevertheless it is useful to think of three one-year objectives:
1) Design of basic HCRU building blocks. Continue the development of the C2W
2) Fabrication of small test structures, design of 3D demonstrator, assemble blocks to form a 32 GHz processor.
3) Fabrication of 3D bases wafers using MPW, Inlay C2W die to template for upper tiers of stack, 3D 8-inch wafer bonding, testing.
The goals of the program are to demonstrate that one or two cores of a multi-core processor could operate at ultra high clock rates (32GHz at room temperature), to use 3D memory wall mitigation to sustain a high CPI at modest power dissipation (estimated to be 40 watts). In addition the proposed work would begin to exploit heterogeneous device integration to support heterogeneous multicore architecture in which one or two cores operate at this extraordinary clock frequency.
To accomplish this two brand new technologies, IBM 32SOI and IBM 9HP, will be combined using 3D technology. Each separately is a “best of breed” and presumably no other combination would be better, with 3D heterogeneous integration could exploit whichever technology is appropriate in different sections. 3D integration by full wafer bonding permits intimate stitching between each technology utilizing fine dimension vertical vias between the tiers in the 3D stack.
The 32nm SOI is particularly appropriate for 3D stacked mass-memory, and as an SOI technology it lends itself to face to face wafer to wafer 3D integration, because silicon below the level of the buried oxide layer can be completely removed with selective etching techniques, leading to the thinnest 3D wafer tiers possible by this approach. 3D stacked memory is essential to provide the ultra wide data paths between the memory and processor with minimal CPI. Adapting this concept for the purpose of demonstrating this idea would involve continued development of the C2W approach. As a backup plan intact SOI wafers could possibly be obtained directly from LEAP using a variety of laser ablation or stepper blading approaches and then cored down to 8 inchs for bonding to 9HP wafers.
Another advantage of combining 9HP and 32SO1 is that both will have cell libraries and macros for their CMOS components. Hence porting of systems into this environment will have excellent CAD tool support. However, 9HP and 32SO1 are new processes, and access is extremely expensive right now. However, throughout the period of performance these processes will become more and more cost effective. For this reason the first year of the proposed phase involves only design of a high clock rate processor. One vehicle to provide access to the 32SO1 process would be via the DARPA LEAP program. This will be explained in the subsequent text, if this proposal is funded, one of its side benefits will be establishing a pathway to explore 3D integration relatively inexpensively for other heterogeneous technology combinations. Later efforts could involve integrating FBC DRAM tiers, and even a MRAM disk replacement tier, permitting ultra wide buses between all processors and memory, disk and IO.
2.4. Technical Approach
Moore’s “Law”  was an empirical observation enunciated in 1965 by Gordon Moore that the number of FET transistors implemented on an integrated circuit doubled roughly every few years essentially by a 33% lithographic scaling per generation. Moore’s Law is not a law of physics. Rather it is a statement of economic imperative. Initially this trend towards increasing numbers of transistors was accompanied by increasing clock rates, which were enabled by scaling. The progress was codified in the Dennard rules , which linked device and interconnection scaling strategies. However, even at the outset, wire parasitics were a challenge to continuation of this clock race. Additional improvements beyond scaling were needed through the years, one of which was instruction level parallelism. However, notably the introduction of additional layers of interconnections and lower dielectric constant interlayer dielectrics helped keep the clock race alive. These additional layers and dielectrics were forced due to the fact that wire does not scale well. Wires and contacts increased in resistance. To combat this, repeaters were introduced  to help reshape resistance-capacitance limited charging events. However, the number of these repeaters is now increasing at an alarming rate as shown in this slide presented by IBM’s Ruchir Puri at SEMATECH . At least part of this problem is due to the TaN-Ru clad barrier for Cu diffusion from Cu wires, which has to be no less than 4nm to be effective. This means that at 8nm for the outer width of the wire there would be NO copper left. Consequently, unless there is a breakthrough in carbon nano-tubes, graphenes, or high temperature superconductors, this becomes one of the ultimate defining limitations of scaling.
Figure 2.4-1 Famous Ruchir Puri Slide showing growth of repeaters in CMOS if clock race were to continue.
This repeater explosion has resulted in the introduction of multiple cores at modest clock rate. While appealing, it is not without its own problems. In this proposal we will focus on a few of these issues, one of which has to do with the so-called Memory Wall problem . Whether continued progress in computing results from higher clock rates, or multiple cores (or both), memory starvation of the processors is becoming more evident. One anomaly of the multiple core evolution has been that the additional transistors provided by continued Moore’s law improvements, have not been used for implementation of mass 2D memory within the processor chip, but rather for more cores. The success of this approach then depends greatly on the footprint of data and instructions in memory. When this footprint exceeds internal cache sizes there is a competition for access to memory through the limited pin-out and per pin bandwidth of the microprocessor package and the large external capacitance needed to access additional memory. In addition, memory itself is a product of scaling, and wire resistance is even more of a problem for this type of circuit. Continued DRAM scaling is in doubt. Use of 3D memory over the processor is the only way to mitigate memory starvation, and obtain a kind of Moore’s law for transistors in the third or vertical direction. Another problem are the implications of Amdahl’s Law . One Technical Merit of this proposal is the creation and exploration of a range of 3-D memory multicore chip designs that ultimately break through the Memory Wall problem leading to a quantum advancement in computing performance by enabling very much higher clock rates through heterogeneous integration of advanced CMOS and BiCMOS. Indeed the 9HP technology node is just emerging.
|Volume I section administrative items cover Sheet||Volume I section I. Administrative items i-a. Cover Sheet|
|University of edinburgh cover sheet for a new or revised course section A||Region 4 utah aquistion support center instructional cover sheet|
|Document cover Sheet-Studies in Avian Biology-tidal marsh vertebrates||13 Limitations on section 43 of the Administrative Appeals Tribunal Act 1975|
|Burridge’s Multilingual Dictionary of Birds of the World: Volume XII – Italian (Italiano), Volume XIII – Romansch, and Volume XIV – Romanian (Român)||Do not search for items in bold print as they are dated between june and september 2002. If your library is in the location field of these items, change the status to missing. Thank you|
|18 (4), 389-395 Milsom, I., Forssman, L., Biber, B., Dottori, O. and Sivertsson, R. (1983), Measurement of Cardiac Stroke Volume During Cesarean-Section a comparison Between Impedance Cardiography and the Dye Dilution Technique. Acta Anaesthesiologica Scandinavica, 27||The following pages. Cover designed by Jack Gaughan first printing, march 1980 123456789 daw trademark registered printed in canada cover printed in u. S. A|