C S CCray T3EUser’s GuideJuha Haataja and Ville Savolainen (eds.)Center for Scientific Computing, Finland
10 Cray T3E User’s GuideBesides the portable MPI and PVM message-passing systems, the high-performance SHMEM library is available. This is a Cray-spec
100 Cray T3E User’s Guideprogram must be compiled with the option -g, which disables optimiza-tion and thus increases the run time.More information ab
Chapter 9. Programming tools 101The source code includes the VAMPIRtrace API calls by the preprocessormacro USE_VT. It is recommended to include the d
102 Cray T3E User’s Guide Figure 9.4: A sample of a VAMPIR session. On the lower right corneris the global timeline display showing the com
Chapter 9. Programming tools 103interval by selected processors.• Process view shows the portion of time spent in a given activityclass.All the displa
104 Cray T3E User’s GuideChapter 10Miscellaneous notesThis chapter discusses some additional topics, such as timing of pro-grams and defining the scala
Chapter 10. Miscellaneous notes 105/* Wall clock time in seconds */return (double) _rtc() * cpcycle * 1.0e-12;}This routine can be called either in C/
106 Cray T3E User’s Guide...computation...after = cpused();utime = after - before;printf("CPU time in user space = %ld clock ticks\n",utime)
Chapter 10. Miscellaneous notes 107after = cpused();utime = after - before;printf("ct[10][10] is %.6f\n", ct[10][10]);printf("CPU time
108 Cray T3E User’s GuideThis equation can be normalized by setting W1+ Wp= 1. Here W1= α(the sequential portion) and Wp= 1 − α (the parallel portion)
Chapter 10. Miscellaneous notes 109You can derive the following connection between the parameters α andα0in Amdahl’s and Gustafson’s laws:α =α0p − α0(
Chapter 1. Introduction 11Here the prompt and response of the machine have been typeset withthe teletype font, and the user commands are shown in bold
110 Cray T3E User’s Guide10.3 Scalability criteria at CSCCSC imposes the following scalability criteria for Cray T3E applications:The speed of the app
Appendix A. About CSC 111Appendix AAbout CSCCenter for Scientific Computing, or simply CSC, is a national service cen-ter that specializes in scientific
112 Cray T3E User’s Guidehours you can leave a message. The Help Desk registers the call, writesdown the problem and tries to solve the problem immedi
Appendix B. Glossary 113Appendix BGlossaryANSI American National Standards Institute, organizationdeciding on the U.S. computer science standards.Band
114 Cray T3E User’s GuideHPF High Performance Fortran, a data-parallel languageextension to Fortran 90.HTML Hypertext Markup Language, a language for
Appendix B. Glossary 115Non-malleable Non-malleable programs are fixed at compile time torun on a specific number of processors.NQE Network Queueing Env
116 Cray T3E User’s GuideAppendix CMetacomputer Environment• help topic (CSC help system)• ls (list directory)• less (print a file to the screen)• cp
Appendix C. Metacomputer Environment 117• pine (start the e-mail program)• Reading: choose a message with arrow keys and press return • i (index of
118 Cray T3E User’s GuideBibliography[Craa] Cray Research, Inc. CF90 Commands and Directives Reference Man-ual. SR-3901. 2.6, 5.10[Crab] Cray Research
Bibliography 119[KR97] Tiina Kupila-Rantala, editor. CSC User’s Guide. CSC – Tieteellinenlaskenta Oy, 1997. URL http://www.csc.fi/oppaat/cscuser/.1.7[
12 Cray T3E User’s GuideThe basics of parallel programming are discussed in the textbook De-signing and Building Parallel Programs [Fos95]. Another go
120 Cray T3E User’s GuideIndexSymbols.F, 41, 49.F90, 41, 49.f, 41.f90, 41/bin/sh, 18$HOME, 14, 15$LOGNAME, 15$NPES, 18, 82$SCACHE_D_STREAMS, 22$TMPDIR
Index 121communicator, 64compilerC language, 52C++ language, 52directives, 45features, 44Fortran 90 language, 40options, 32, 42, 43, 55compiler inform
122 Cray T3E User’s Guideinterprocess communication, 62interprocessor communication, 25LLAPACK, 9, 10, 33, 34, 114latency, 25, 62, 114level 1 cache, 2
Index 123PParallel BLAS, 35parallel performance, 107parallel programs, 16Parallel Virtual Machine, 9, 68PAT, 10, 98PBLAS, 34, 35, 115PDF, 115PE, 13, 2
124 Cray T3E User’s Guideshmem_max, 74shmem_min, 74shmem_my_pe, 76shmem_n_pes, 76shmem_or, 74shmem_prod, 74shmem_put, 73, 76, 77shmem_put32, 74shmem_p
Chapter 2. Using the Cray T3E at CSC 13Chapter 2Using the Cray T3E at CSCThis chapter helps you to start using the Cray T3E at CSC: how to log in,wher
14 Cray T3E User’s Guidebecause it uses a secure way to authenticate oneself to the host machine.If you are using an X terminal or an equivalent (a wo
Chapter 2. Using the Cray T3E at CSC 15all files before running a job from the home directory tree to the localT3E disk described below. The home direc
16 Cray T3E User’s Guide2.3 Editing filesYou can use the Emacs or vi editors on the T3E. To start Emacs, give thecommandemacs [options][filename]...Here
Chapter 2. Using the Cray T3E at CSC 17Interactive jobs can use at maximum 16 processors and 30 min parallelCPU time.2.5 Executing in batch modeThe ba
18 Cray T3E User’s Guideenvironment. After this, the commands in the script are executed using/bin/sh. This can be overridden using the option -s shel
Chapter 2. Using the Cray T3E at CSC 19for some information in English.Cray has published several manuals, which help in using the T3E. On-lineversion
All rights reserved. The PDF version of this book or partsof it can be used in Finnish universities as course material,provided that this copyright no
20 Cray T3E User’s GuideChapter 3The Cray T3E systemThis chapter reviews the Cray T3E hardware and operating system.3.1 Hardware overviewThe Cray T3E
Chapter 3. The Cray T3E system 21This configuration may change in the future. Use the command grmviewto find out the current situation.3.2 Distributed m
22 Cray T3E User’s GuideMicro-processorLocalMemorySupportCircuitryNetwork Router-Z +Z+Y-Y-X+XFigure 3.1: The components of Cray T3E node.This is a pro
Chapter 3. The Cray T3E system 23Attribute ValueProcessor type DEC Alpha 21164Physical address base 40 bitsVirtual address base 43 bitsClock rate on t
24 Cray T3E User’s Guide3.5 Local memory hierarchyThe local four-level memory hierarchy of the processing elements isshown in Figure 3.3. Nearest to t
Chapter 3. The Cray T3E system 25two words in each cp. An SCACHE line is 64 bytes. Therefore, data ismoved in consecutive blocks of 64 bytes from the
26 Cray T3E User’s GuideSource NodeDestination Node123+Y-Y-Z+Z-X+XFigure 3.4: A routing example through the 3D torus network of the T3E.Addressing of
Chapter 3. The Cray T3E system 273.7 External I/OThe T3E system has four processing elements per one I/O controller,while one out of every two I/O con
28 Cray T3E User’s Guide3.8 The UNICOS/mk operating systemThe Cray T3E has a distributed microkernel based operating system.This provides a single sys
Chapter 3. The Cray T3E system 29The T3E file systems at CSC are located on striped FiberChannel disks re-siding in one GigaRing, which is attached to
Cray T3E User’s Guide 3PrefaceThis is the second edition of a user’s guide to the Cray T3E massivelyparallel supercomputer installed at the Center for
30 Cray T3E User’s Guide+ APP 0xdd 2 192 0 1 2 0 0 7 375 118 118+ APP 0xde 2 192 0 1 2 1 0 7 375 118 118+ APP 0xdf 2 192 0 1 2 0 1 7 375 118 118+ OS 0
Chapter 3. The Cray T3E system 31You can also use the commandps -PeMfto see what parallel processes are running. Here is an extract from theoutput:F S
32 Cray T3E User’s GuideChapter 4Program developmentThis chapter shows how to compile and run your programs on the CrayT3E at CSC. Fortran programming
Chapter 4. Program development 33The option -Xn or -X n is used to indicate how many processors youwant for your application. If you do not provide th
34 Cray T3E User’s GuidePBLAS, ScaLAPACK, BLACS, and FFT. The most straightforward way toobtain more information on these libraries is through the man
Chapter 4. Program development 354.3.3 BLACSBoth BLAS and LAPACK are developed for single processor computa-tions. In order to solve problems of linea
36 Cray T3E User’s GuideRoutines ExplanationPSGETRF PCGETRF LU factorization and solution of linear generalPSGETRS PCGETRS distributed systems of line
Chapter 4. Program development 374.4 The NAG subroutine libraryThe NAG library is a comprehensive mathematical subroutine librarythat has become a de
38 Cray T3E User’s GuideYou can also use the NAG on-line documentation on Cypress by thecommandnaghelp4.5 The IMSL subroutine libraryThe IMSL library
Chapter 4. Program development 39The manual Introducing CrayLibs [Crad] contains a summary of Crayscientific library routines.You can use help to get s
4 Cray T3E User’s GuideContentsPreface 31 Introduction 71.1 How to use this guide ... 71.2 Usage policy ...
40 Cray T3E User’s GuideChapter 5Fortran programmingThe Cray T3E offers a Fortran 90 compiler which can be used to com-pile standard-conforming FORTRAN
Chapter 5. Fortran programming 415.2 Basic usageThe CF90 compiler is invoked using the command f90 followed by op-tional compiler options and the filen
42 Cray T3E User’s GuideFile extension Type Notes.f Fixed source form (72 columns) No preprocessing.f90 Free source form (132 columns) No preprocessin
Chapter 5. Fortran programming 43Option Explanation-c Compile only, do not attempt to link-r2 Request for standard listing file (.lst)-r6 Request for f
44 Cray T3E User’s GuideOption Explanation-dn, -en Report nonstandard code-dp, -ep Use double precision-er, -dr Round multiplication results-du, -eu R
Chapter 5. Fortran programming 45You can improve the performance by padding the arrays so that thecorresponding elements do not map to the same cache
46 Cray T3E User’s Guide!dir$ splitDOi=1,1000a(i) = b(i) * c(i)t = d(i) + a(i)e(i)=f(i)+t*g(i)h(i) = h(i) + e(i)END DOThe directive is marked with the
Chapter 5. Fortran programming 47a(j,i) = b(j,i) + 1a(j,i+1) = b(j,i+1) + 1END DOEND DOHere we used the inverse operation of loop splitting to decreas
48 Cray T3E User’s GuideThe symmetric directive is useful when using the SHMEM communica-tions library.!dir$ symmetric [var [, var]...]This directive
Chapter 5. Fortran programming 49F90= f90iterate: $(OBJS)$(F90) -o $@ $(OBJS)iterate.o: myprec.o cg.o matrix.ocg.o: myprec.o matrix.omatrix.o: myprec.
Contents 55 Fortran programming 405.1 The Fortran 90 compiler ... 405.2 Basic usage ... 415.3 Fixed and free f
50 Cray T3E User’s GuideAs an example, consider a simple code that computes and prints a rootof a polynomial using IMSL routines ZREAL and WRRRN. The
Chapter 5. Fortran programming 51and four iterations were performed.5.10 More informationCSC has published a textbook on Fortran 90 [HRR96]. A general
52 Cray T3E User’s GuideChapter 6C and C++ programmingThis chapter discusses C and C++ programming on the Cray T3E. Paral-lel programming is described
Chapter 6. C and C++ programming 53The compilation process, if successful, creates an absolute object file,named a.out by default. This binary file, a.o
54 Cray T3E User’s Guide6.3 Calling Fortran from CSometimes you need to call Fortran routines from C programs. In thefollowing, we calculate a matrix
Chapter 6. C and C++ programming 55The fact that Fortran stores arrays in reverse order compared to C needsto be taken into account. Therefore, the ar
56 Cray T3E User’s Guide6.5 C compiler directives (#pragma)The #pragma directives are used within the source program to requestcertain kinds of specia
Chapter 6. C and C++ programming 57In the previous format, var_list represents a list of variable names sepa-rated by commas. In C, the cache_align di
58 Cray T3E User’s GuideThe split directive merely asserts that the loop can profit by splitting.It will not cause incorrect code.The compiler splits t
Chapter 6. C and C++ programming 59symmetricThe symmetric directive declares that an auto or register variable hasthe same local address on all proces
6 Cray T3E User’s GuideB Glossary 113C Metacomputer Environment 116Bibliography 118Index 120
60 Cray T3E User’s GuideThe compiler can be directed to attempt to unroll all loops generatedfor the program with the command-line option -hunroll.The
Chapter 6. C and C++ programming 616.6 The C++ compilerThe Cray C++ compiler conforms with the ISO/ANSI Draft Proposed In-ternational Standard. A revi
62 Cray T3E User’s GuideChapter 7Interprocess communicationThis chapter describes how to use the MPI or PVM message-passing li-braries on the Cray T3E
Chapter 7. Interprocess communication 63and MPI. Latency and bandwidth are not equally transparent to the HPFuser, but in general HPF programs are slo
64 Cray T3E User’s GuideCorrespondingly, for C/C++ programs the format is:#include <mpi.h>void sub(...){int return_code;...return_code = MPI_Rou
Chapter 7. Interprocess communication 65Fortran syntax MeaningMPI_INIT(rc) Initialize the MPI session.This should be the very first call.MPI_FINALIZE(r
66 Cray T3E User’s GuideCALL MPI_COMM_RANK(MPI_COMM_WORLD, id, rc)data = idCALL MPI_REDUCE(data, s, 1, MPI_INTEGER, &MPI_SUM, 0, MPI_COMM_WORLD, r
Chapter 7. Interprocess communication 67t3e% cc -o collect.x collect.cThe program may be executed as in the Fortran 90 case above.7.2.4 Reducing commu
68 Cray T3E User’s GuideSome examples of MPI programs are available in the WWW system, seethe addresshttp://www.csc.fi/programming/examples/mpi/7.3 Pa
Chapter 7. Interprocess communication 69CALL PVMFpack(INTEGER8, j, msglen, stride, rc)to=jCALL PVMFsend(to, tag, rc)END DOEND IFfrom = 0CALL PVMFrecv(
Chapter 1. Introduction 7Chapter 1IntroductionThis chapter gives a short introduction of the Cray T3E system. We alsodescribe the policies imposed on
70 Cray T3E User’s GuidePE#2: tid=393218 nproc=3PE#0: tid=393216 nproc=3PE#1: tid=393217 nproc=3PE#0: message=0PE#2: message=2PE#1: message=17.3.3 Fur
Chapter 7. Interprocess communication 71Routine Descriptionnum_pes Returns the total number of PEs.shmem_add Performs an atomic add operation on aremo
72 Cray T3E User’s Guide7.4.1 Using the SHMEM routinesSHMEM routines can be divided into a few basic categories accordingto their respective tasks. Th
Chapter 7. Interprocess communication 737.4.3 Point-to-point communicationPoint-to-point communication is the most widely occuring form of datatransfe
74 Cray T3E User’s GuideOther point-to-point routinesThe routines shmem_put and shmem_get operate correctly when the databeing transferred consists of
Chapter 7. Interprocess communication 75responding call would beCALL SHMEM_type_op_TO_ALL(target, source, nreduce, &pe_start, logpe_stride, pe_siz
76 Cray T3E User’s Guide7.4.5 Other important routinesThere are two very important routines which a parallel program on T3Ewill almost certainly call,
Chapter 7. Interprocess communication 77INCLUDE ’mpp/shmem.fh’INTEGER, PARAMETER ::n=4INTEGER, DIMENSION(n), SAVE :: &source_pe = (/1,2,3,4/), &am
78 Cray T3E User’s Guide0:c= 101214161:c= 2468The output of the C program is similar.7.5 High Performance Fortran (HPF)The Cray T3E system at CSC has
Chapter 7. Interprocess communication 79• The directive SHARED has been changed to DISTRIBUTE.• The distribution specification : for a degenerate distr
8 Cray T3E User’s Guide1.2 Usage policyAs the Cray T3E is a high-performance computational resource, CSCenforces a usage policy in order to guarantee
80 Cray T3E User’s Guidet3e% module load pghpfThe compiler is invoked with the command pghpf. It accepts files endingwith .hpf, .f, .F,.for or .f90. Fi
Chapter 8. Batch queuing system 81Chapter 8Batch queuing systemThe batch queuing system ensures an optimum load on the computerand a fair distribution
82 Cray T3E User’s Guide# QSUB -l p_mpp_t=7000# QSUBcd $HOME/sn6309mpprun -n $NPES ./mmloop 6000 6000First, the given command shell (here /bin/ksh) is
Chapter 8. Batch queuing system 83The batch job is submitted with the commandqsub [options] jobfileThe output from the command looks like this:nqs-181
84 Cray T3E User’s GuideFigure 8.1: An example of a cqstat session.---------------------------------NQS 3.3.0.4 BATCH REQUEST SUMMARY-----------------
Chapter 8. Batch queuing system 85qstat option Meaning-a Display summary information for all jobs.-b Display summary information for batch jobs.-r Dis
86 Cray T3E User’s Guide8.4 Deleting an NQE batch jobSometimes it is necessary to delete a job before it is finished. For exam-ple, the input may be er
Chapter 8. Batch queuing system 87sn6309 5 0 0 0 0 0 0----------------------- --- --- --- --- --- --- --- --- --- ------------8.6 More informationMore
88 Cray T3E User’s GuideChapter 9Programming tools9.1 The make systemThe make utility executes commands in a makefile to update one ormore targets, wh
Chapter 9. Programming tools 89makecompiles the source codes func1.c and func2.c, and links them withthe NAG library, producing an executable file mypr
Chapter 1. Introduction 91.3 Overview of the systemThe Cray T3E system at CSC has currently 224 RISC processors for paral-lel applications. In additio
90 Cray T3E User’s Guidechanged. The browser may act upon a routine, a file, or an entire pro-gram, which is composed of one or more distinct files, but
Chapter 9. Programming tools 91• The source code pane, located in the middle of the Xbrowse win-dow, is the largest area of the window. This pane disp
92 Cray T3E User’s Guidelevel position the cursor on the box and press the left mouse button. Toclose the tree one level position the cursor on the no
Chapter 9. Programming tools 93The user can move to the source of a subroutine by clicking on the nameof the subroutine with the right mouse button.To
94 Cray T3E User’s GuideFigure 9.2: An example of a Cray TotalView session.
Chapter 9. Programming tools 95error, and it can be used to determine the cause of the problem.The TotalView debugger can be used to examine core files
96 Cray T3E User’s Guidewhen the program is executed. The files are passed to the Apprenticetool for graphical examination.MPP Apprentice is used by th
Chapter 9. Programming tools 97 Figure 9.3: An example of an MPP Apprentice session. The upper paneshows timing statistics and the lower pa
98 Cray T3E User’s Guide9.4.2 The appview commandIn addition to MPP Apprentice, the appview command that a quick sum-mary of the profiling data. Its ou
Chapter 9. Programming tools 99performance information and instruction counts.PAT is able to analyze programs written in Fortran 90, C, C++ and HPF.Th
Comments to this Manuals