Download Advanced Computer Architecture: Parallelism, Scalability, Programmability PDF

TitleAdvanced Computer Architecture: Parallelism, Scalability, Programmability
Author
TagsComputer Studies Computer Architecture Advanced Computer Architecture
LanguageEnglish
File Size82.2 MB
Total Pages749
Table of Contents
                            About the Authors
	Contents
	Foreword to the First Edition
	Preface to the Second Edition
	Preface to the First Edition
Theory of Parallelism
	Parallel Computer Models
		THE STATE OF COMPUTING
		MULTIPROCESSORS AND MULTICOMPUTERS
		MULTIVECTOR AND SIMD COMPUTERS
		PRAM AND VLSI MODELS
		(Viktor Prasanna, 1992)
		ARCHITECTURAL DEVELOPMENT TRACKS
	Program and Network  Properties
		CONDITIONS OF PARALLELISM
		(Wen-Mei Hwu, 1991)
		(Kruatrachue and Lewis, 1988)
		PROGRAM FLOW MECHANISMS
		(Gajski, Padua, Kuck, and Kuhn, 1982)
		SYSTEM INTERCONNECT ARCHITECTURES
	Principles of Scalable  Performance
		PERFORMANCE METRICS AND MEASURES
		Ni, 1993)
		PARALLEL PROCESSING APPLICATIONS
		SPEEDUP PERFORMANCE LAWS
		Ni, 1993)
		1991)
Hardware Technologies
	Processors and Memory  Hierarchy
		ADVANCED PROCESSOR TECHNOLOGY
		SUPERSCALAR AND VECTOR PROCESSORS
		MEMORY HIERARCHY TECHNOLOGY
		VIRTUAL  MEMORY TECHNOLOGY
	Bus, Cache, and Shared Memory
		BUS SYSTEMS
		(Dubois, Scheurich, and Briggs, 1988)
		1 1992)
	Pipelining and Superscalar  Techniques
		LINEAR PIPELINE PROCESSORS
		NONLINEAR PIPELINE PROCESSORS
		INSTRUCTION PIPELINE DESIGN
		(James Smith, 1989)
		ARITHMETIC PIPELINE DESIGN
		SUPERSCALAR PIPELINE DESIGN
Parallel and Scalable Architectures
	Multiprocessors and  Multicomputers
		CACHE COHERENCE AND SYNCHRONIZATION MECHANISMS
		(Hwang and Shang, 1991)
		THREE GENERATIONS OF MULTICOMPUTERS
		MESSAGE-PASSING MECHANISMS
	Multivector and SIMD  Computers
		(Cray Research, 1990)
		MULTIVECTOR MULTIPROCESSORS
		(Smith, Hsu, and Hsiung, 1990)
		COMPOUND VECTOR PROCESSING
		(Courtesy of Cray Research,  Inc., 1985)
		(Hwang and Xu, 1988)
		SIMD COMPUTER ORGANIZATIONS
		(Thinking Ma chines Corporation, 1990)
		THE CONNECTION MACHINE CM-5
	Scalable, Multithreaded, and
		(Daniel
Software for Parallel Programming
	Parallel Models, Languages, and  Compilers
		PARALLEL PROGRAMMING MODELS
		(Justin
		(Gul Agha,  1990)
		PARALLEL LANGUAGES AND COMPILERS
		DEPENDENCE ANALYSIS OF DATA ARRAYS
		(Monica Lam, 1992)
		CODE OPTIMIZATION AND SCHEDULING
		(S. Graham, J. L. Hennessy, and J. D. Ullman,
		(S. Graham, J. L. Hennessy, and    J. D. Ullman, 1992)
		(Alliant   Computer Systems Corporation, 1989)
		(S. Graham,    J. L. Hennessy, and J. D. Ullman, 1992)
		LOOP PARALLELIZATION AND PIPELINING
		(Michael Wolf and Monica Lam, 1991)
	Parallel Program Development  and Environments
		PARALLEL PROGRAMMING ENVIRONMENTS
		SYNCHRONIZATION AND MULTIPROCESSING MODES
		(Courtesy of Cray Research, 1987)
		SHARED-VARIABLE PROGRAM STRUCTURES
		MESSAGE-PASSING PROGRAM DEVELOPMENT
		MAPPING PROGRAMS ONTO MULTICOMPUTERS
		Computers, 1990)
Instruction and System Level  Parallelism
	Instruction Level Parallelism
		INTRODUCTION
		BASIC DESIGN ISSUES
		PROBLEM DEFINITION
		MODEL OF A TYPICAL PROCESSOR
		COMPILER-DETECTED INSTRUCTION LEVEL PARALLELISM
		OPERAND FORWARDING
		REORDER BUFFER
		REGISTER RENAMING
		TOMASULO’S ALGORITHM
		BRANCH PREDICTION
		LEVEL PARALLELISM
		THREAD LEVEL PARALLELISM
	Trends in Parallel Systems
		BRIEF OVERVIEW OF TECHNOLOGY
		FORMS OF PARALLELISM
		CASE STUDIES
		PARALLEL PROGRAMMING MODELS AND LANGUAGES
	Answers to Selected Exercises
	Bibliography
	Index
                        
Document Text Contents
Page 1

n» !|lv:Gm\-P Hiifiompwvm "

ADVANCED IIDHPIITIEII ARCHITECTURE
Parallelism, Scalability, mgranunamuw

Second Edition

Page 2

About the Authors
Kai Hwang is a Proiessor of Electrical Engineering and Computer Science at the University of Southern
California. Prior to joining USC, he was a faculty at Purdue University ibr I'D years. He received his
tmdergraduate education at the National Taiwan University in China and eamed his PhD degree from
the University of California at Berkeley. Dr Hwang has been engaged in research and teaching on
computer architecture, parallel processing and network-based computing for well over 3-D years. He
has authored or coauthored live books and 120 journal and conference papers in the Computer Science
and Engineering areas. He is the founding Cocditor-in-Chiefof the .J'oumrn' ofP.nr.r1Hef rmd Distributed
Conn-mring.

He has served as the founding Director of the USC Computer Research Institute and a Distinguished
Visitor of the IEEE Computer Society. He has chaired several international computer conferences and
lectured worldwide on advanced computcrtopics. His researches have been supported by NSF, IBM, AT&T,
AFOSR, ONR, DOT,Alliant, and Intel. He has been a consultant ibr IBM, .IPL, Fujitsu, Japan's ETL, GMD
i11 Germany, and ITRI and Academia Sinica in Chi11a. He is also a member ofthc advisory boards ofseweral
international journals and research organization s.

The Institute of Electrical and Electronics Engineers elected him as an IEEE Fellow in 1986 for his
contributions in computer architectures, digital arithmetic, and parallel processing. He was the holder ofthe
Distinguished CDC Visiting Chair Professorship in Computer Science at the University of Minnesota during
the spring quarter of I939. He has guided over a dozen PhD students at Purdue and USC. At present, he
heads a sponsored research project on Grid Security at USC. His current research interests are i11 the areas of
network-based computing, Intcmet security, and clustered systems. Clver the years, he has received numerous
awards ibroutstanding teaching and research, and delivered invited and keynote lectures in many countries.

Naresh Jotwani is presently serving as Director, School of Solar Energy, Pandit Doendayal Petroleum
University, Gandhinagar. Earlier, he has served as Professor and Dean ('R&D) at DA-IICT, Gandhinagar, and
as Principal at G H Patel College ofEngineering and Technology, Vallabh Vidyanagar.

Dr Jotwani obtained his BTech degree in Electrical Engineering from IIT Bombay, a.nd Doctorate in
Computer Science from Rice University, Houston. His teaching career has spalmed over twenty-five years,
in India, Singapore and the US. He has also worked in the IT indusuy for about five years, in India and
Singapore, with brief stints in the US. In the early I9BD‘s, he worked on system software development for a
64-bit multiprocessor system with microcoded instructions for inter-process communication.

Dr Jotwani has carried out several consultancy assignments, written four books and several research
publications, and delivered numerous invited lecttu'es. His textbook Computer .5:1-‘.S‘I£"J'fl Organisation was
published by Tata Mefiraw-Hill. His current research interests are in the field of solar photovoltaic devices.

Page 374

rt» Mecmw iirttt-...s-,..i.t.¢. '
Ms — _

Table 8.1 Summary ofE.r.trly Sttpcrcotrlputcrs

Adrovrced Computerhrchitecture

511.: tern
model‘

Maximum mnfigaration.
clock rote, GS/Compiler

Uniquefeamres
and remarks

Cray 1S Uniproccssor with 10 pipelines. 12.5
ns. COSIC-F'l"l' 2.1.

First ECL-based super. introduced in
l9T6.

Cray 25
."-l-256

4 processors with 256M-word memory,
4.1 ns. COS or UNIX ICFT7 3.0.

l6l{-word local memory, ported
UNIX V introduced in I985.

Cray X-
416

MP 4 processors with 16M-word memory,
and l2BM-word BSD, 8.5 ns, COS
CFTT 5.0.

Using shared register clusters for
IPC, itflroduced in 1933.

832

Cray Y-MP 8 processors with lllllvl-word
memory. 6 as, EFT! 5.0.

Enhanced fi'om X-ME introduced in
1951‘-8.

C-9!]

Cray Y-MP I6 processors with 2 vector pip-espcr
processor, 4.2 us, UNIOOSICF Tl’ 5.0.

Thc Largest Cray machine, introduced
in 15191.

205
CDC Cyber Uniproeessor with 4 pipelines, 20 ns,

virtual OSIFTN 200.
Met11ory-to-memory architecture.
introduced in I932.

ETA lll E. Uniprooessor tvith 10.5 ns,
ETAVIFTN 200

Successor to Cyber 205, introduced
in i985.

SX-X1‘
NEC

44
4 processors with 4 sets of pipelines
per processor, 2.9 ns, I-i'T?SX.

Succeeded by SX-X Series,
introduced in l9'9l.

Fujitsu
Vl"lfi-CH] ilt)

Unipmeessor with S vector pipes and
dual scalar processors, 3.2 res,
MS?-EX FFT? EX -‘VP.

Used reconfigurable vector registers
and masking, introduced in l99l.

Hitachi
S202’ RD

l3 fitltctionnl pipelines in a
uniprocessor with 512 Mbytes
memory 4 ns, FORT 1?!HAP
V23-~OC.

Introduced in 193? with 64 I.-D
channels providing a maximum
of 288 It-'lhytes."s transfer.

Ten functional pipelines could run simultaneously in the Cray IS to achieve a computing power equivalent
to that of ID IBM 3033's or CDC Cyb-er 1600's. Only batch processing with a single user was allowed when
the Cray I was initially introduced using the Cray Operating System [COS] with a Fortran T7 compiler (CF
T? Version 2.1).

The Cray X-MP Series introduced multiprocessor configurations in I 983. Steve Chen led the effort at Cray
Research in developing this series using one to four Cray I-equivalent CPLls with shared meme-1'y.A unique
feature introduced with the X~MP models was shared register clusters for fast interprocessor eommtmications
without going through the shared memory.

Besides 123 Mbytes of shared memory, the X-MP system had 1 Gbyte of.soi'id-stttre sr-sreg’ {SSD) as
extended shared memory. The clock rate was also reduced to 8.5 ns. The peak performance of the X-MP-
4lti was 840 Mflops when eight vector pipelines for add and multiply were used simultaneously across four
PFDEC-SSCIFS.

Page 375

,,,,,,,,,,,,,,,._,,,,,,,,,,,,, _ ,,,
The successor to the Cray X-MP was thc Cray Y-MP introduced in 1988 with up to eight processors in a

single system using rt 6-us clock rate and 256 Mbytes of shared memory.
The Cray Y-MP C—9(l was introduced in 199!) to ofi'er an integrated system with 16 processors using a

4.2-ns clock. We will stttdy models Y-MP B16 and C-90 in detail in the next section.
Another product line was the Cray 2S introduced in I985. The system allowed up to four processors with

2 Gbytes of shared memory and a 4.1-ns clock. A major contribution of the Cray 2 was to switch from the
hatch processing, COS to multiuser UNIX System V on a supercomputer. This led to the UNICOS operating
system, derived from the UNIXIV and Berkeley 4.3 BSD, variants ofwhich are currently in use in some Cray
DCll"l'l]IlLIlI1'_T S}'SlI'l'_‘l'!TS.

The CyberiETA Series Control Data Corporation (CDC) introduced its first supercomputer, the STAR-I00,
in 1973. Cyber 205 was the successor produced in I932. The Cyber 205 ran at a 2D—ns clock rate, using up to
four vector pipelines in a uniprocessor configuration.

Different from thc register-to-register architecture used in Cray and other supercomputers, the Cyber
205 and its successor, the ETA It), had memory-to-memory architecture with longer vector instructions
containing mcmory addrcsscs.

The largest ETA It) consisted of B CPUs sharing memory and 18 HO processors. The peak performance
of thc ETA lD was targeted for IO Gflops. Both tl'|e Cyber and the ETA Series are no longer in production but
wcrt: in use ibr many years at scvcral supcrcomputcrccntcrs.

Japanese Supercomputer: NEC produced the SX-X Series with a claimed peak performance of22 Gflops
in 1991. Fujitsu produced the VP-2000 Series with a 5-Gtlops peak performance at the same time. These two
machines used 2.9- and 3-.2-ns clocks, respectively.

Shared communication registers and reconfigurable vector registers were special features in these
machines. Hitachi offered the 820 Series providing a 3-Gllops peak performance. Japanese supercomputers
were at one time strong in high—specd hardware and interactive vectorizing compilers.

The NEE SK-X 44 NEC claimed that this machine was the fastest vector supercomputer £22 Gflops peak]
ever huiltup to 1992. The architecture is shown in Fig. 8.5. One ofthe major contributions to this performance
was the use of a 2.9-ns clock cycle based on VLSI and high-density packaging.

There were four arithmetic processors commtmicating through either the shared registers or via the shared
memory of 2 Gbytes. There were four sets of vector pipelines per processor, each set consisting of two add!
shift and two mulfiplyflogical pipelines. Therefore, 6-4-way parallelism was obtained wifit four processors,
similar to that in thc C-9t]-.

Besides the vector unit, a high-speed scalar unit employed RISC architecture with 123 scalar registers.
Instruction reordering was supported to exploit higher parallelism. The main memory was l024~way
interleaved. The extended memory of up to I6 Ghytes provided 21 maximum transfer rate. of 2.75 Ghytes.-’s.

Amaatimtun of four l IO processors could be configured to accommodate a l-Gbytefs data transfer rate per
l/U processor. The system could provide a maztimutn of 256 channels for high-speed network, graphics, and
peripheral operations. The support included l00—l'vll:ytests channels.

Page 748

--H29 Index
synchronization latency 53
synchronization overhead I23, 434, 45B
synchronized protocol I90
synchronous message passing 427, 564
synchronous model 223
synchronous program graph 380
synchronous timing IE5
system area network 635
system attributes I3
system clock driver IE4
system deadlock 557
system efficiency 95
system interconnect IT, 2BI
system performance 535
system sotlware support T
system throughput I4
system-on-a-chip 35 I , 624, 630, 660
systolic arrays I0, T2
systolic program graph 350

T
tag unit 244
tagged-token architecture 62
Tandem multiprocessor B2
target bufliers 243
task parallelism 663
temporal locality I64
Tera computer system 40, I02, I69
Tera dmign goals 452
Tera multiprocemor 452
Tera pipeline structure 455
Tera sparse 3-Dtorus 453
Tera thread state Sr. management 456
testdtset 555
thread context 623
thread management 623
thread-level parallelism 623, 6-6|
three-address instruction format 539
throughput 229, 23I
TI~.-'lSC 4, 263
TI-ASC arrit.iu'netic processor 264
tightly coupled systems I‘?
TILE-64 systemaan-a-chip 659

Tilera 65B
Tilera Uvlmh interconnect 659
tiling 526
tiling for locality 528
time complexity 30, 55
timing protocols IE5
TLB, translation Iookaside buffer I40, I69, I92
TMC Cll-l~2 {see CM-2}
TMC CM-S {see CM-5}
TMC, Thinking Machines Corp-oratio
token-matching 62
Tomasulo‘s algorithm 248, 263, 610
torus network 1'2
tournament predictor 6|?
trace scheduling compilation 5IE
trace-driven simulation 202
transaction IE6, I9l,636
transact ion modes IE9
transaction processing YT, 64I
transactions per second 99
transfer bandwidth I60
transformation matrices 522
transistor count 588
Transputer 7?, 3I5, 565
traveling salesperson problem 580
Ti} weak consistency model 2 I9
turnaround time I2
two~bit predictor 6I6

Li
LJM.-Mnultiprocessor I?
unit ofttanslcr I60
Linivac L.-XRC 4
UNIX I95, 2I2
UNIX BSD 4.0 2l2
UNIXSystem V 2I2
unknotmt dependence 45
USC.-‘(IMP 43l
user partitions 393
utilization 95

V
vacuum tubes 4

tt I20, 392

Page 749

rt -M|:Gm'w Hilff - .- .--. .Index -5., ,,,
vector 34I
vector access memory schemes 345
vector add 323
vector balance poi.nt 352
vector instructions I56, 34l
vector load 372
vector loops and chaining 324
vector memory instructions 343
vector multiply 322
vector operand specifications 3-45
vector pipelines I5?
vector processing 34I
vector processor.-‘computer I0, I I, 25, I56, 34I
vector recurnence 37?
vector reduction 343, 5 I0
vector store 323
vector supercomputers I34
vector units 398
vectortscalar ratio 352
vectorization 49, 34I
vectorization inhibitors 5II
vectorization methods 508
vectorizing compiler 54
vector-scalar instructions 343
vector-vector instructions 342
very large scale integration {see VLSI}
viewing angle 632
virtual address cache I93
virtual address space I6?
virtual channels 322
virtual computing environments 642
virtual cut-through 325, 655
virtual interrupt I89
virtual memory I6?
virtual networks 332
virtual processing element 6'73
virtual systems 66I
visualization support 544
VI_I"Ii' architecture I54
VI_IW, very long instruction word I34, 60]
VLSI complexity model 33

VLSIdesign 598
‘VLSI technology 58?, 630
VLSI, very large scale integration 4
"i’MEbus I82, I89

W
wait protocols 545

' ' - - ‘ 609, 6 I 2‘lll.-‘tilt, write after read dependence 45, 590,
wave fronting 523
nnw, write alter write dependence 45, 590,
weak consistency I24, 2 I 8, 4I9
weighted arithmetic mean execution rate 92
weighted arithmetic mean execution time 93
weighted harmonic mean execution rate 93
weighted harmonic mean speedup 93
Whetstone 92
wide area network 635
window size 2I3,62I
wire length 6?
wired barrier synciu'oni:zation 309
‘ill-'isco1ninlv'lulticube 43l
work performed 90, 646
work-cflicicnt 64'-"
working set I65, 2 I3
workload I05
wormhole routing 3 I4, 320
write butters 409
write-back cache I92, 300
writeback stage 240
nritehtvalidate protocol 299
write-once protocol 30l
write-through cache I92, 299
write-update protocol 299

X
Xerox 63?
X-Net mesh interconnect 39I
X-Y routing 326

Y
Yale Linda 4ll

Similer Documents