Part II. USING MPI
 
 

7. Optimization .

 

 

7.1 General overview:

Performence:

High expectations based on quoted claims and promises of high performance

    "540 megaflops for POWER2 superchip"
    "Memory capacity of 57 GB for CTC SP"
    "High bandwidth/low latency switch"

Different measures of "desired results"

    Highest megaflop rate
    Best possible turnaround for next run
    Shortest time to publication for the project
    Most effective use of resources (computer and human) for the lifetime of the application

Performance ultimately depends on the characteristics of
 

Key components

 

Application:
 
               Language
               Datasets
               Libraries
               Algorithms
               Code Implementation
 
 
                                System
 
                                          Chip Architecture
                                          Memory Hierarchy
                                          I/O Configuration
                                          Node Interconnect (switch)
                                          Compiler
                                          Operating system
 
 
 
A. Languages
 
Fortran 77

    Pros
        Very efficient compilers
        Lots of library support
    Cons
        Language at dead end
        Encourages some poor programming practices

Fortran 90

    Pros
        Significant language improvements over Fortran 77
        Better code re-use
    Cons
        Optimizing compilers on the market, but effectiveness not yet fully understood
        Less library support (no full MPI binding yet)

C

    Pros
        Better software engineering features than Fortran 77
        Excellent library support
    Cons
        Injudicious use of some constructs (pointers, for example) will mask potential optimizations
        C++, a superset of C, has better code re-use features

C++ (superset of C)

    Pros
        Excellent code re-use and extendibility
        Class libraries for parallelizing are available
    Cons
        Compilers not yet fully optimized
        Language still evolving
        Will it be superseded by a different OOL (Java, SmallTalk)?

High Performance Fortran (HPF)

    Pros
        Broadly-based, vendor-supported standard
        High-level directives simplify parallelization
        Same code can run across many different parallel architectures
    Cons
        Compilers still fairly new
        May not be suited for irregular data communication patterns
 

 

B. Libraries
 

Math/Stat Subroutine Libraries -- ESSL, LAPACK, IMSL, NAG

    Excellent CPU performance (roughly in order listed)
    On-going efforts to expand and improve
    SCALAPACK and PESSL are parallel versions of LAPACK and ESSL

Message-passing Libraries

    Message-passing Interface (MPI)
        The standard, driven by broadly-based committee
        IBM has high-performance implementation
        Standard continues to develop

    Parallel Virtual Machine (PVM)
        Of receding importance
        High-performance version (SP2MPI architecture) uses MPI
        Continued development, but no widespread standards committee

Parallel I/O File System (IBM's PIOFS)

    Library of routines for parallel I/O and file management
    Available by special request at CTC
    Very high performance (over 50 MB/sec are possible)

 
 

C. Algorithms
 

An algorithm is a precise method of solving a problem. A good algorithm:

    Gives good performance

    Can often be far more effective than a highly optimized poor algorithm.

Example: Solution of a convection-diffusion problem via local Jacobi iterative method or a multigrid method.

    Jacobi method megaflop rate is 5.6 times greater than multigrid method

    However, Jacobi method needs so many iterations to converge that it takes 317 times longer to solve the problem!

 
 

D. Code Implementation
 

Code implementation refers to those practices and strategies used to translate algorithms and code designs into functioning software. Some important considerations:

    Arrange data arrays so that they are accessed contiguously in memory (by columns in Fortran, by rows
    in C)

    Clarity and effective annotation makes the code usable and intelligible for other users (and yourself months later)

    Make use of version control utilities (RCS, SCCS, etc.) to control the software development process

    To as great a degree as possible, write machine-independent code

 
 

E. Datasets

    Size of dataset
    Nature of dataset
    Datasets used by parallel applications

 
 

F. System Factors
 

The key factors of the computer system that affect performance are:

    Chip architecture

    Memory Hierarchy

    I/O Configuration

    Node interconnect (switch)

    Compiler
 
 

System:

I/O Configuration:


 
 
 
 
 

7.2. Optimization Procedures:
 

Profiling

Profiling is important in parallel code optimization because the performance of a message passing code is closely related to its granularity, defined here as the ratio of the time between communication events to the duration of an event.

To minimize time spent communicating, you should maximize your code's granularity by parallelizing at the highest feasible level. Save yourself the trouble of implementing a parallelization strategy that will result in too fine a granularity by discovering from a profiling run that the time T between message passing events is quite short. A good rule of thumb is that T should be greater than ten times the latency for sending a message.

For HPF codes, profilers should also be used in conjunction with compiler parallelization reports to
determine the effect of adding HPF directives to your code. You need to determine whether you've coded correctly, and whether the compiler is handling the directives as you expected.
 

Example C code
 
Debugging

To debug a message-passing code, you can use IBM's parallel debugger or a serial debugger (if you are not running very many processes). All provide the standard serial debugging actions. All require the -g flag at compile time, to associate the source code you wrote with the assembly language code.

You cannot use these debuggers directly on HPF code since, at some stage, the compiler translates the code to message-passing. If your compiler allows the intermediate message-passing code to be saved, and you are desperate, it is possible to use a debugger on this code. The generated code is usually not very readable, variables are often renamed, and loop indices re-formulated.
 

Example:

mpcc -g mismatch.c -o mismatch
 
 

Tracing

It is from a trace that you usually get the most information about how well your parallel job is doing. Here that you can find out exactly when tasks stall because they are waiting for messages.

Tracing involves (at least) two steps:
 

At runtime, trace records are written when message-passing library routines are called or at set time intervals. The trace records contain the type of event and a timestamp. After the run completes, you can use a tool to graphically display and summarize this data.

A separate software is needed to do tracing.  There are a number of such softwares. NTV is one of them.