No Title

Part II. USING MPI

7. Optimization .

The speed of a message-passing parallel code depends on the performance of both the local hosts and of the message passing environment.

Optimization of parallel code is usually carried out in an iterative process involving several tools to investigate performance issues.

Many of the computational optimizations are no different from the ones needed for a serial code.

7.1 General overview:

Performence:

High expectations based on quoted claims and promises of high performance

    "540 megaflops for POWER2 superchip"
    "Memory capacity of 57 GB for CTC SP"
    "High bandwidth/low latency switch"

Different measures of "desired results"

    Highest megaflop rate
    Best possible turnaround for next run
    Shortest time to publication for the project
    Most effective use of resources (computer and human) for the lifetime of the application

Performance ultimately depends on the characteristics of

Your application

The computer system (hardware and software)

Key components

Application:

               Language
               Datasets
               Libraries
               Algorithms
               Code Implementation

                                System

                                          Chip Architecture
                                          Memory Hierarchy
                                          I/O Configuration
                                          Node Interconnect (switch)
                                          Compiler
                                          Operating system

A. Languages

Fortran 77
    Pros
        Very efficient compilers
        Lots of library support
    Cons
        Language at dead end
        Encourages some poor programming practices
Fortran 90
    Pros
        Significant language improvements over Fortran 77
        Better code re-use
    Cons
        Optimizing compilers on the market, but effectiveness not yet fully understood
        Less library support (no full MPI binding yet)
C
    Pros
        Better software engineering features than Fortran 77
        Excellent library support
    Cons
        Injudicious use of some constructs (pointers, for example) will mask potential optimizations
        C++, a superset of C, has better code re-use features
C++ (superset of C)
    Pros
        Excellent code re-use and extendibility
        Class libraries for parallelizing are available
    Cons
        Compilers not yet fully optimized
        Language still evolving
        Will it be superseded by a different OOL (Java, SmallTalk)?
High Performance Fortran (HPF)
    Pros
        Broadly-based, vendor-supported standard
        High-level directives simplify parallelization
        Same code can run across many different parallel architectures
    Cons
        Compilers still fairly new
        May not be suited for irregular data communication patterns

B. Libraries

Math/Stat Subroutine Libraries -- ESSL, LAPACK, IMSL, NAG
    Excellent CPU performance (roughly in order listed)
    On-going efforts to expand and improve
    SCALAPACK and PESSL are parallel versions of LAPACK and ESSL
Message-passing Libraries
    Message-passing Interface (MPI)
        The standard, driven by broadly-based committee
        IBM has high-performance implementation
        Standard continues to develop
    Parallel Virtual Machine (PVM)
        Of receding importance
        High-performance version (SP2MPI architecture) uses MPI
        Continued development, but no widespread standards committee
Parallel I/O File System (IBM's PIOFS)
    Library of routines for parallel I/O and file management
    Available by special request at CTC
    Very high performance (over 50 MB/sec are possible)

C. Algorithms

An algorithm is a precise method of solving a problem. A good algorithm:
    Gives good performance
    Can often be far more effective than a highly optimized poor algorithm.
Example: Solution of a convection-diffusion problem via local Jacobi iterative method or a multigrid method.
    Jacobi method megaflop rate is 5.6 times greater than multigrid method
    However, Jacobi method needs so many iterations to converge that it takes 317 times longer to solve the problem!

D. Code Implementation

Code implementation refers to those practices and strategies used to translate algorithms and code designs into functioning software. Some important considerations:
    Arrange data arrays so that they are accessed contiguously in memory (by columns in Fortran, by rows
    in C)
    Clarity and effective annotation makes the code usable and intelligible for other users (and yourself months later)
    Make use of version control utilities (RCS, SCCS, etc.) to control the software development process
    To as great a degree as possible, write machine-independent code

E. Datasets
    Size of dataset
    Nature of dataset
    Datasets used by parallel applications

F. System Factors

The key factors of the computer system that affect performance are:
    Chip architecture
    Memory Hierarchy
    I/O Configuration
    Node interconnect (switch)
    Compiler

System:

I/O Configuration:

7.2. Optimization Procedures:

Profiling

Profiling is important in parallel code optimization because the performance of a message passing code is closely related to its granularity, defined here as the ratio of the time between communication events to the duration of an event.

To minimize time spent communicating, you should maximize your code's granularity by parallelizing at the highest feasible level. Save yourself the trouble of implementing a parallelization strategy that will result in too fine a granularity by discovering from a profiling run that the time T between message passing events is quite short. A good rule of thumb is that T should be greater than ten times the latency for sending a message.

For HPF codes, profilers should also be used in conjunction with compiler parallelization reports to
determine the effect of adding HPF directives to your code. You need to determine whether you've coded correctly, and whether the compiler is handling the directives as you expected.

Example C code

Debugging

To debug a message-passing code, you can use IBM's parallel debugger or a serial debugger (if you are not running very many processes). All provide the standard serial debugging actions. All require the -g flag at compile time, to associate the source code you wrote with the assembly language code.

You cannot use these debuggers directly on HPF code since, at some stage, the compiler translates the code to message-passing. If your compiler allows the intermediate message-passing code to be saved, and you are desperate, it is possible to use a debugger on this code. The generated code is usually not very readable, variables are often renamed, and loop indices re-formulated.

Example:
mpcc -g mismatch.c -o mismatch

Tracing

It is from a trace that you usually get the most information about how well your parallel job is doing. Here that you can find out exactly when tasks stall because they are waiting for messages.

Tracing involves (at least) two steps:

generating the trace

viewing the results.

At runtime, trace records are written when message-passing library routines are called or at set time intervals. The trace records contain the type of event and a timestamp. After the run completes, you can use a tool to graphically display and summarize this data.

A separate software is needed to do tracing. There are a number of such softwares. NTV is one of them.