7. Optimization .
7.1 General overview:
Performence:
High expectations based on quoted claims and promises of high performance
"540 megaflops for POWER2 superchip"
"Memory capacity of 57 GB for CTC SP"
"High bandwidth/low latency switch"
Different measures of "desired results"
Highest megaflop rate
Best possible turnaround for next run
Shortest time to publication for the project
Most effective use of resources (computer and human)
for the lifetime of the application
Performance ultimately depends on the characteristics of
Application:
Language
Datasets
Libraries
Algorithms
Code Implementation
A. Languages7.2. Optimization Procedures:
Fortran 77Pros
Very efficient compilers
Lots of library support
Cons
Language at dead end
Encourages some poor programming practicesFortran 90
Pros
Significant language improvements over Fortran 77
Better code re-use
Cons
Optimizing compilers on the market, but effectiveness not yet fully understood
Less library support (no full MPI binding yet)C
Pros
Better software engineering features than Fortran 77
Excellent library support
Cons
Injudicious use of some constructs (pointers, for example) will mask potential optimizations
C++, a superset of C, has better code re-use featuresC++ (superset of C)
Pros
Excellent code re-use and extendibility
Class libraries for parallelizing are available
Cons
Compilers not yet fully optimized
Language still evolving
Will it be superseded by a different OOL (Java, SmallTalk)?High Performance Fortran (HPF)
Pros
Broadly-based, vendor-supported standard
High-level directives simplify parallelization
Same code can run across many different parallel architectures
Cons
Compilers still fairly new
May not be suited for irregular data communication patterns
B. Libraries
Math/Stat Subroutine Libraries -- ESSL, LAPACK, IMSL, NAGExcellent CPU performance (roughly in order listed)
On-going efforts to expand and improve
SCALAPACK and PESSL are parallel versions of LAPACK and ESSLMessage-passing Libraries
Message-passing Interface (MPI)
The standard, driven by broadly-based committee
IBM has high-performance implementation
Standard continues to developParallel Virtual Machine (PVM)
Of receding importance
High-performance version (SP2MPI architecture) uses MPI
Continued development, but no widespread standards committeeParallel I/O File System (IBM's PIOFS)
Library of routines for parallel I/O and file management
Available by special request at CTC
Very high performance (over 50 MB/sec are possible)
C. Algorithms
An algorithm is a precise method of solving a problem. A good algorithm:Gives good performance
Can often be far more effective than a highly optimized poor algorithm.
Example: Solution of a convection-diffusion problem via local Jacobi iterative method or a multigrid method.
Jacobi method megaflop rate is 5.6 times greater than multigrid method
However, Jacobi method needs so many iterations to converge that it takes 317 times longer to solve the problem!
D. Code Implementation
Code implementation refers to those practices and strategies used to translate algorithms and code designs into functioning software. Some important considerations:Arrange data arrays so that they are accessed contiguously in memory (by columns in Fortran, by rows
in C)Clarity and effective annotation makes the code usable and intelligible for other users (and yourself months later)
Make use of version control utilities (RCS, SCCS, etc.) to control the software development process
To as great a degree as possible, write machine-independent code
E. Datasets
Size of dataset
Nature of dataset
Datasets used by parallel applications
F. System Factors
The key factors of the computer system that affect performance are:Chip architecture
Memory Hierarchy
I/O Configuration
Node interconnect (switch)
Compiler
System:
I/O Configuration:
Profiling
Profiling is important in parallel code optimization because the performance of a message passing code is closely related to its granularity, defined here as the ratio of the time between communication events to the duration of an event.
To minimize time spent communicating, you should maximize your code's granularity by parallelizing at the highest feasible level. Save yourself the trouble of implementing a parallelization strategy that will result in too fine a granularity by discovering from a profiling run that the time T between message passing events is quite short. A good rule of thumb is that T should be greater than ten times the latency for sending a message.
For HPF codes, profilers should also be used in conjunction with compiler
parallelization reports to
determine the effect of adding HPF directives to your code. You need
to determine whether you've coded correctly, and whether the compiler is
handling the directives as you expected.
Example C code
To debug a message-passing code, you can use IBM's parallel debugger or a serial debugger (if you are not running very many processes). All provide the standard serial debugging actions. All require the -g flag at compile time, to associate the source code you wrote with the assembly language code.
You cannot use these debuggers directly on HPF code since, at some stage,
the compiler translates the code to message-passing. If your compiler allows
the intermediate message-passing code to be saved, and you are desperate,
it is possible to use a debugger on this code. The generated code is usually
not very readable, variables are often renamed, and loop indices re-formulated.
TracingExample:mpcc -g mismatch.c -o mismatch
It is from a trace that you usually get the most information about how well your parallel job is doing. Here that you can find out exactly when tasks stall because they are waiting for messages.
Tracing involves (at least) two steps:
A separate software is needed to do tracing. There are a number of such softwares. NTV is one of them.