| CSCI Projects
About SCEC Projects
SCEC
Work Areas

|
|
Implementation and Evaluation of Parallel Finite-Difference Wave Propagation Code Using a Nonuniform Mesh
|
|
Project Overview:
We propose to convert an existing serial finite-difference wave propagation code to an MPI-based parallel implementation and to produce a version of the wave propagation code that can utilize a nonuniform mesh. We will evaluate the computational improvement, memory requirements, and the accuracy of the results of the converted code by comparing these attributes of the new code against the equivalent attributes of the original serial code.
Researchers:
- Karthikeyan Chockalingam
- Philip Maechling
- Bilal Shaw
Goals of Proposed Work:
We can summarize the goals of this project as follows:
- Create an MPI code that has better performance characteristics (uses less memory, shorter time to solution) than the existing MPI code.
- Demonstrate that the new code produces equivalent results to the existing code.
Details of Proposed Work: See Project Proposal section below.
|
|
Progress Reports:
|
|
Versions of Code:
Our overall goal is to create an MPI-based version of a code that can use a non-uniform mesh. We have two source code distributions to work with:
Initial Codes:
- uniform code - parallel V1.14 - emod3d-mpi v1.14
- non-uniform code - serial - V2.0 - emod3d v2.0
Target Code:
- non-uniform - parallel - V3.0 - emod3d-mpi v3.0
The two initial codes are variations of the same original code. We believe it may be possible to utilize the parallel implementation as a prototype on which we can base our parallel implementation of the non-uniform version.
Modification Plan :
We recognize that we have selected a large scale activity for our project and that the scope of this activity may be larger than the time we have to work on this project. However, we believe that project has several useful features that make it worthwhile even if all phases are not completed by May 9th, 2007. For example, in this project we will be working with a a parallel, high performance, scientific research code currently in use. This provides insights into the real complexity of research codes. As a second potential benefit, we understand that many scientific organizations need to convert serial codes to parallel. On this project, we are performing a task that is in great demand and this project provides us with an opportunity to better understand the practical difficulties in such work.
Our plan is to outline a reasonable approach this project, and to work through as many steps in the process as possible in the time that we have. During our group's internal discussions, and discussions with Professor Nakano and Dr. Robert Graves, we have identified the following approach to modifying the codes.
- Assemble and Build Codes:
- Collect Metrics on code:
- Assemble example input data sets
- Measure performance of MPI code on Reference Problem
- Measure performance of Serial code on Reference Problem
- Generate a reference output dataset for MPI code
- Generate a reference output dataset for serial code
- Convert Serial Non-uniform code to MPI
- Use new MPI code to generate an output dataset for reference problem and Verify Results
- Measure performance of new MPI code
- Compare performance measurements of new MPI code to Previous Codes
- Collect source code metrics for non-uni MPI code
|
|
1) Assemble and Build Codes
|
|
Building emod3dv1.14
We built the code on USC HPCC system using the default PGI compilers:
- /usr/usc/mpich/1.2.6..14b/gm-pgi/bin/mpicc
Building emod3d v2.0
We built the code on the USC HPCC system using the default gcc compilers:
- gcc version 3.4.6 20060404 (Red Hat 3.4.6-
|
|
2) Collect Metrics on Code
|
|
Source Code Size of Programs:
We began by obtaining source code distributions of both programs. We then performed a couple reviews and measurements in order to scope the difficulty of the conversion we have proposed:
We used the USC CodeCount tools to count SLOC. We downloaded this software from the USC web site that distributes it and we built the tool on USC HPCC system. Then, we created the configuration files for each of the two code distributions we are studying. We counted only the .c files and present the logical, not the physical number of lines of code for each distribution. A logical SLOC will count a statements or expression as a single line even if it is split over two or more physical lines.
Code Metrics
| Code Name |
Source Files In Executable (count *.c files) |
Logical SLOC |
Ratio of SLOC to Comments |
| emod3d-mpi V1.14 |
19 |
10,887 |
25.1% |
| emod3d V2.0 |
17 |
7217 |
29.0% |
|
|
3) Assemble Example Input Data Sets
|
|
In order to run the codes, we needed to assemble an initial set of input files for a reasonable sized problem. We were not able to develop an example input that worked for both codes, so our example input data sets were somewhat different.
emod3d-mpi v1.14
The inputs required for this code include the following files:
- e3d.par - parameter files that describes things like the number of timesteps, other input files, source parameters and other initial inputs
- emod3d_vp,emod3d_vs,emod3d_rho - three binary files representing the geology of the simulation region
- emod3d_Q_tbl.txt - text file containing specification of attenuation values to be used in the simulation
- graves_run.csh - shell called by the pbs script to run the executable with appropriate command line on each processor
- graves.pbs - pbs script that submit jobs requesting number of nodes, number of processors per node, wall clock time, etc
emod3d v2.0
in addition to the files used by emod3d-mpi, this code uses the following files:
- station coordinates - lists stations at which the seismograms should be produced
- gridfile - lists the resolution of the grid as it varies with depth. This helps to define the non-uniform grid spacing supported by the code.
Use of SCEC Earthworks Science Gateway to Generate emod3d Configuration Files:
We initially tried to create valid emod3d input configuration files by hand, based on discussions and our understanding about reasonable values for each parameter. However, all of the configuration files we created by hand produced errors when we used them with the emod3d-mpi code.
We turned to a software system developed at SCEC, called the SCEC Earthworks Science Gateway, in order to create valid inputs. This code accepts a somewhat simplified description of an earthquake simulation, and it generates valid emod3d-mpi input files. A paper describing the SCEC Earthworks Science Gateway (Maechling et al) was recently accepted by the TeraGrid 2007 Conference.
We used Earthworks to generate a single set of input files that were used for all the strong scaling tests. We used Earthworks to generate four separate sets of input files for the weak scaling tests.
The input files for the emod3d simulations include velocity mesh files. The velocity mesh files are large binary files used in our tests and these also generated by the Earthworks system.
|
|
4) Measure performance of MPI Code on Reference Problem
|
|
In order to establish that we have met our first goal (improve on the performance of the existing MPI code), we must measure performance charactersitics of the MPI code. We will measure the scaling and memory usage characteristics of the code by running one or more example problems and collecting performance measures from the code while it runs.
Instrument code:
We modified the emod3d-mpi code to collect information about run time and system resource usage.
Memory measurements were made by internal calculation of the velocity media stored on each node and is reported in MB.
The time information is collected after each "span" timesteps using the times() method in the gnu c library. This function will return multiple types of information on the system time used by the program.
We use sprintf() to report this information at the end of each "span" timestep for each processor.
We compared the information we got from the internal monitoring in the code to the results from the PBS epilogue report. The epilogoue report provides information such as CPUT used, wall clock time, and physical memory used. At this time, there are aspects of these two reports that we cannot reconcile. For example, on a 26 processor run, the memory report by the code is 14.2MB x 25 = 365MB while the Epilogoue report is 1402GB
Initial Strong Scaling Tests:
Then we ran initial test of the scalability of the existing code. We ran each simulation three times and took the lowest number from each set of three runs. We used the USC HPCC cluster and the main queue for all jobs. The simulation size in this study was:
- Mesh Points: 2099601
- Time steps: 1000
The emod3d-mpi code has internal checks that limit the scaling tests. When the number of mesh points per processor drops below a certain number, the code will exit with a warning to the user. This limited the size of the problem we could run in our scaling test with the selected reference problem. For example, if we tried to run the reference problem on 30 processors, the code exited with a warning so we have no measurements for this size problem.
The following measures are strong scaling tests in which we divide the same calculation across more and more processors.
emod3d-mpi Strong Scaling Measurements
| Mesh Size (Points) |
Processor Count |
Mesh Points per Processor |
Physical Memory Usage (MB) |
CPU Time (Seconds) |
Wall Clock Time (Seconds) |
| 2,099,601 |
3 |
699,867 |
379 |
823 |
278 |
| 2,099,601 |
5 |
419,920 |
478 |
897 |
183 |
| 2,099,601 |
10 |
209,960 |
726 |
755 |
78 |
| 2,099,601 |
15 |
139,974 |
973 |
868 |
63 |
| 2,099,601 |
20 |
104,980 |
1221 |
940 |
51 |
| 2,099,601 |
25 |
83,984 |
1468 |
1075 |
46 |

Initial Weak Scaling Tests:
In order to test the weak scaling of the code, we have constructed a set of 4 different input simulation data sets. Each data set contains twice as many mesh points as the next smaller size. We submit these jobs to twice as many processors so that the number of mesh points per processor remains constant.
The emod3d code also has internal checks that limited the weak scaling tests. The code performed some internal checking on how the domain decomposition was done. We found that it would exit with a warning when we double these weak scaling tests to 8M mesh points and 80 processors. We need to study the code more to understand specifcally why the code did not consider the next level configuation as valide.
emod3d-mpi Weak Scaling Measurements
| Mesh Size (Points) |
Processor Count |
Mesh Points per Processor |
Physical Memory Usage (MB) |
CPU Time (Seconds) |
Wall Clock Time (Seconds) |
| 534,681 |
5 |
106,936 |
289 |
220 |
47 |
| 1,062,761 |
10 |
106,276 |
585 |
480 |
52 |
| 2,099,601 |
20 |
104,980 |
1220 |
1239 |
67 |
| 4,186,161 |
40 |
104,645 |
2439 |
2789 |
74 |
|
|
5) Measure performance of Serial Code on Reference Problem
|
|
Our initial input data sets for the serial code serial code initialize then exit with a memory allocation error. This error appears to be due to a configuration mismatch, and not an actual lack of memory available for the malloc. We will perform performance tests on this code when we have obtained an working input data set.
|
|
6) Generate a Reference Output Dataset for MPI Code
|
|
As we convert the serial code to MPI, we need to ensure two things that the new code produces equivalent results for a given problem. In order to establish a reference solution, we have run a reference problem and generated a reference solution. In this case, we will use seismograms produced by the simulation as our reference solution. The outputs of the code are seismograms at each surface mesh point. However, for our purposes, we will select a small number (currently small number n=1) of surface mesh points and plot the seismograms produced by the MPI code. These seismograms will then function as our reference solution and our converted code must produce equivalent seismograms when we run the same simulation.
The conversion from native output data format to a "seismogram" or time series format was a mutli-step process. The emod3d-mpi writes an output file for each processor. In order to plot seismograms, the following processing steps were used:
1) We first merge the multiple output files into a single output file. We used the following script to do this: sh.merge. This file requires a configuration file listing all the existing output files, along with information about the number of processors and information about the simulation spacing from the e3d.par file. The output of sh.merge is a single large binary file (filename:testout.binary) with all three components of motion in it. Also, the file has a header describing the contents of the file. The format is this file contains the data in fast xyt format.
2) We verified that our single output binary file using a utility that just reads the header of the file. The program and results for our reference solution are as follows:
~maechlin/cme/p2v21/post/emod3d_out_fileheader in=testout.binary
Starting x grid location for output: 0
Starting y grid location for output: 0
Starting z grid location for output: 1
Starting time step for output : 0
Number of x points : 161
Number of y points : 161
Number of z points : 1
Number of time points : 1001
Grid spacing in X : 0.150
Grid spacing in Y : 0.150
Grid spacing in Z : 0.150
Time spacing in sec : 0.009
Rot of y-axis from south (clockwise positive): 0.000
Latitude of model origin : 34.216358
Longitude of model origin : -118.604073
3) The next processing step split this file into three separate component files. Each component file will be a binary file that contains only the data values with not header. The resulting files can be verified by their size. Because they contain only data, and no header, the file size should equal nx x ny x nt x 4 bytes. (161 x 161 x 1001 x 4 = 103,787,684).
~maechlin/cme/p2v21/post/emod3d_out_parsecomps in=testout.binary outX=seisx outY=siesy outZ=seisz
-rw-r--r-- 1 maechlin users 103787684 May 9 2007 seisx
-rw-r--r-- 1 maechlin users 103787684 May 9 2007 seisy
-rw-r--r-- 1 maechlin users 103787684 May 9 2007 seisz
These files contain the seismograms, but the data is in xyt order. This makes is somewhat difficult to read a single time series.
4) Next, we re-order the data to change the order of the indicies. We convert to a format which in which the indicies are order txy. We do this re-format step for each of the component f/home/scec-
~maechlin/cme/p2v21/post/3dMapview2Surfseis in=seisx out=seisxss nx=161 ny=161 nt=1001
~maechlin/cme/p2v21/post/3dMapview2Surfseis in=seisy out=seisyss nx=161 ny=161 nt=1001
~maechlin/cme/p2v21/post/3dMapview2Surfseis in=seisz out=seiszss nx=161 ny=161 nt=1001
5) Now we have binary time series files. We now use a simple python script to read data out and to create ASCII files with a list of floating point values in them. These data values represent ground velocities in cm/sec. The parameters to this script are the x and y mesh positions of the seismogram that will be retrieved. It assumes all three components inputs files and it produces three separate output files.
$ get_ts_project.py 0 0
filepos: 0 - Input files: ./seisxss - Output files: ./0000_0000.X.txt
-rw-r--r-- 1 maechlin users 9296 May 9 2007 0000_0000.X.txt
-rw-r--r-- 1 maechlin users 9368 May 9 2007 0000_0000.Y.txt
-rw-r--r-- 1 maechlin users 9510 May 9 2007 0000_0000.Z.txt
6) In order to view the ASCII files, we import them into an Excel spreadsheet. We could convert them to a seismological seismogram data format (e.g. SEED or SAC) and then use a more specialized plotting tools. But for now, we import the data as colums of numbers and plot the columns using Excel charts.
7) For our reference solution, we read the data from position x=0, y=0. This gives us the seismogram from x=0, y=0. We consider these three seismograms as our reference solutions for this project. Our converted code should produced equivalent seismograms. It is necessary for the new codes to produce equivalent seismograms at the same point, for the same problem. Our new code must be able to produce these seismograms. This is not sufficent verification however. Additional verifications must also be performed which we will investigate once this necessary verification is successful.
The reference seismograms are below:


|
|
7) Generate a Reference Output Dataset for Serial Code
|
|
We do not yet have a reference output dataset for the serial non-uniform mesh code.
|
|
8) Convert Serial, Non-Uniform Code to MPI
|
|
Conversion of the serial non-uniform code to MPI is under development.
|
|
9) Use New MPI-code to Generate an Output Reference Dataset and Verify Results
|
|
TBD
|
|
10) Measure Performance Of New MPI Code
|
|
TBD
|
|
11) Compare Performance of new MPI code to Previous Codes
|
|
TBD
|
|
12) Collect Source Code Metrics for New MPI Code
|
|
TBD
|
|
Project Proposal
|
|
Problem Statement:
Numerical modeling codes that can simulate seismic motions in 3D elastic media typically utilize a uniform mesh to represent the material properties of the geological volume through which the earthquake waves are propagated. A regular mesh has an important advantage that it is easy to create. However, it has a significant disadvantage in that the finest resolution must be used throughout the full volume. In the case of seismic wave simulations, geological information near the surface is known with greatest accuracy and a high resolution mesh is useful. At significant depths (>10KM) only very coarse information about geological structures is known, so a lower resolution mesh could capture our knowledge of the geology with reasonable accuracy. The resolution of the mesh impacts both the computational time (because calculations must be performed at each mesh point) as well as memory requirements (because each mesh point must be read into RAM as an initial condition).
Proposed Approach:
There is an existing serial finite-difference code that utilizes a nonuniform mesh (R.W. Graves 2002). We will begin with this code and evaluate its memory requirements and performance on a simple “reference” problem. The results of this reference problem will form our reference solution. We will then convert the code to use MPI. To help us with this conversion, we may refer to a related code (which does not support nonuniform meshes) that has been converted to run in parallel using MPI. Once we have a converted code, we will re-run our reference problem and measure the memory and performance characteristics of the new code. We will also compare the results of the new code against our reference solution to verify proper operation of our code.
Impact:
The availability of a verified wave propagation code that can use a nonuniform mesh would enable SCEC to run large scale simulations on low-memory system such as Blue Gene SDSC. If our conversion is successful, we believe this new implementation would have significant potential use for SCEC research.
|
|