CMSSL RELEASE NOTES Version 3.2, April 1994 Copyright (c) 1994 by Thinking Machines Corporation. 1 INTRODUCTION *************** These release notes summarize the changes and new features in the CM Fortran interface to Version 3.2 of the CM Scientific Software Library (CMSSL). One section is devoted to each of the following topics: o Hardware and software requirements o New features o Changes from previous versions o Limitations and restrictions o Documentation for this release o Acknowledgments CMSSL Version 3.2 includes all the functionality of Version 3.1, as well as new functionality described in Section 3. 2 HARDWARE AND SOFTWARE REQUIREMENTS ************************************* 2.1 HARDWARE REQUIRED ---------------------- CMSSL Version 3.2 supports CM-5 systems with or without vector units, and supports the nodal CMSSL library. You can also call this version of CMSSL from global/local CM Fortran programs. The nodal and global/local execution models require a CM-5 with vector units. 2.2 SOFTWARE REQUIRED ---------------------- CMSSL Version 3.2 for CM Fortran requires prior installation of the following software: o CMOST Version 7.2 (or higher) o CM Fortran Version 2.1 (or higher) In addition, for CM Fortran programs that run in the single-node execution model (-node), you must install CMMD Version 3.0 (or higher). If the default CMMD is not an appropriate version, then you may need to use -cmmd_root /usr/cmmd/3.0 in your link line. 2.3 LINKING WITH CMSSL VERSION 3.2 ----------------------------------- After writing a CM Fortran program that calls CMSSL routines, compile it and link it with the library. Compiling a CM Fortran CMSSL program is the same as compiling other CM Fortran programs: use the cmf command. To compile the program program on a CM-5 and link it with CMSSL, you can use either of the alternatives indicated below. Linking with CMSSL Using -lcmssl If the CM Fortran program program calls CMSSL routines, you can compile it and link it with CMSSL using the -lcmssl switch. This switch links with the appropriate CMSSL library based on the execution modal you specify in the cmf command line. To use this alternative, issue one of the following command lines at the UNIX prompt: o In the CM Fortran (SPARC) nodes model: cmf -cm5 -sparc -o program program.fcm -lcmssl o In the CM Fortran vector-units model: cmf -cm5 -vu -o program program.fcm -lcmssl o In the CM Fortran single-node model: cmf -cm5 -vu -node -o program program.fcm -lcmssl In the single-node model, add -cmmd_root /usr/cmmd/3.0 before -lcmsslcm5vu-node if CMMD Version 3.0 (or higher) is not the default on your system. o To compile and link a CM Fortran global/local program that makes calls to CMSSL routines in both the global and the local program units: cmf -cm5 -vu -o program program.fcm -lcmssl -local local.fcm program.proto -local -lcmssl Linking with a Selected CMSSL Library Explicitly Alternatively, if you want to link with the appropriate CMSSL library explicitly, issue one of the following command lines at the UNIX prompt: o In the CM Fortran (SPARC) nodes model: cmf -cm5 -sparc -o program program.fcm -lcmsslcm5 o In the CM Fortran vector-units model: cmf -cm5 -vu -o program program.fcm -lcmsslcm5vu o In the CM Fortran single-node model: cmf -cm5 -vu -node -o program program.fcm -lcmsslcm5vu-node In the CM Fortran single-node model, add -cmmd_root /usr/cmmd/3.0 before -lcmsslcm5vu-node if CMMD Version 3.0 (or higher) is not the default on your system. o To compile and link a CM Fortran global/local program that makes calls to CMSSL routines in both the global and the local program units: o cmf -cm5 -vu -o program program.fcm -lcmsslcm5vu -local local.fcm example.proto -local -lcmsslcm5vu-node 3 NEW FEATURES **************** The CM Fortran operations listed below are new since Version 3.1. All chapter numbers refer to CMSSL for CM Fortran, Version 3.2. o Gaussian elimination with external storage The external LU factorization and solver routines have been completely rewritten and optimized in Version 3.2; they have a new interface that involves reverse communication. (Chapter 5) o Fixed-machine-size save and restore routines for LU and QR solvers The LU and QR routines are joined by new save and restore routines: save_gen_lu_fms, restore_gen_lu_fms, save_gen_qr_fms, and restore_gen_qr_fms. These new routines have the same calling sequences and perform the same functions as the corresponding older routines (save_gen_lu, restore_gen_lu, save_gen_qr, and restore_gen_qr, respectively), except that the new routines use fixed-machine-size I/O while the older ones use serial-order I/O. (Chapter 5) o Other new Gaussian elimination routines The Gaussian elimination routines described in Chapter 5 of CMSSL for CM Fortran are joined by three new routines in Version 3.2: gen_lu_apply_l, gen_lu_apply_l_tra, and gen_lu_zero_rows. (Chapter 5) o Bidiagonalization Version 3.2 introduces the gen_bidiag routine, which transforms dense real matrices to bidiagonal form. Auxiliary routines transform arbitrary vectors between the basis of the original dense matrix and the basis of the bidiagonal matrix. (Chapter 9) o Singular values of a bidiagonal matrix The bidiag_svd_singular_values routine computes all the singular values of one or more real bidiagonal matrices. (Chapter 9) o Singular vectors of a bidiagonal matrix The bidiag_svd_singular_vectors routine computes the singular vectors corresponding to a set of singular values of one or more real bidiagonal matrices of the same order. (Chapter 9) o Singular value decomposition of dense real matrices The gen_bidiag_singular_system routine performs a singular value decomposition of one or more dense real matrices. (Chapter 9) o Range histogram with CM array of bins Version 3.2 introduces a new histogram routine, histogram_range_cm, that performs the same function as the histogram_range routine, but stores the bins in a CM array rather than a front-end array. (Chapter 14) o Combination of permutations The new combine_fe_perms routine returns the combination of two permutations supplied to it in front-end arrays. (Chapter 15) o Zeroing routine This release introduces a new routine, zero_elements, that zeroes a CM array. This routine provides faster performance than the equivalent CM Fortran code, especially for single-precision real or complex data. (Chapter 15) 4 CHANGES FROM PREVIOUS VERSIONS ********************************** The following CM Fortran routines have changed since Version 3.1: o Arbitrary elementwise sparse matrix operations The irandom, itrace, and trace arguments to the arbitrary elementwise sparse matrix operations have been renamed to mapping, motion, and setup, respectively. The mapping and motion arguments provide new options for permutation method and communication algorithm, respectively. The mapping argument now makes available source and destination array permutations generated internally using the partition_mesh routine, for problems with symmetric sparsity. The motion argument provides an option for communication that uses the part_gather and part_scatter routines, as an alternative to the sparse_ util_gather and sparse_util_scatter routines. The values accepted by these arguments are as follows: Value of mapping Permutation returned 0 identity (old irandom = 0) 1 random (old irandom = 1) 2 based on partitioning Value of motion Operations Used by Communication algorithm 0 get, send, scan-add (old itrace = 0) 1 part_gather, part_scatter 2 sparse_util_gather, sparse_util_scatter (old itrace = 1) Additionally, in the Version 3.1 manual, the man page for these routines had placed the where_is_x and where_is_y arguments in the wrong order; they have been switched in the Version 3.2 man page. (Chapter 4) o Arbitrary elementwise sparse matrix operations The irandom, itrace, trace, trace_mask, and setup arguments of the arbitrary block sparse matrix routines have been changed to mapping, motion, setup1, setup2, and setup3, respectively. Permutations based on partitioning are not yet available with these routines, but the motion values are the same as above (except that motion = 0 uses gets and send-adds). (Chapter 4) o QR and LU Routines The QR routines now allow nblock values greater than 1 with pivoting. The LU routines now allow you to specify m > n in the no-pivoting case. (However, you may not use the gen_lu_get_l routine when m > n.) The LU and QR factors are now clearly defined in the manual for the non-square case. (Chapter 5) o Range histogram Beginning with Version 3.2, the histogram_range and histogram_range_cm routines require the min and range arguments to have the same data type (integer or real) as the destination array, A. (Chapter 14) o Banded System Solvers The nblock argument of the gen_banded_factor, block_tridiag_factor, block_tridiag_solve, block_pentadiag_factor, and block_pentadiag_solve routines has a new name and meaning. The argument, now called group_size, allows you to specify (or ask the routine to select) the number of problem instances per processing element that are treated together in one step of Gaussian elimination. This feature can significantly improve performance in comparison to CMSSL Version 3.1. In addition, the work argument is now a scalar integer rather than a front-end array. (Chapter 6) o Iterative Solvers The iterative solvers now support complex data for all algorithms except CMSSL_cg (which is used for symmetric positive definite systems) and CMSSL_bicgstab2. Previous to Version 3.2, all algorithms required real data. (Chapater 7) o Generalized Eigensystem Analysis The sym_tred_gen_eigensystem routine now operates on complex Hermitian matrices; previously it operated only on real symmetric matrices. (Chapter 8) o Simplex routine The gen_simplex routine now checks the value of ier on input. If ier is set to 2 (reinvert) or 16 (degenerate), gen_simplex assumes you are reinverting; otherwise, it assumes you are passing it a new problem. Two new performance guidelines take effect with Version 3.2. These new guidelines, together with the two conditions already listed in the gen_simplex man page, ensure that gen_simplex does not copy the CM array A: Lay out A so that the subgrid size of the first axis is not 1. Compile the main program with -axisreorder. (Chapter 12) o Matrix transpose performance enhancements The axis-length restriction that was imposed by the gen_matrix_transpose routine in order to achieve enhanced performance has been lifted in Version 3.2. In addition, Version 3.2 adds another enhancement for three-dimensional and higher-rank arrays. The gen_matrix_transpose routine yields superior performance when a local axis is exchanged with a non-local axis that is distributed across the vector units on the same processing node. By far the best performance is obtained when the non-local axis spans only two vector units (that is, only the lowest off-chip bit is set). The next best performance results when the non-local axis spans four vector units on the same processing node (the two lowest off-chip bits are set). For the general case of an axis that spans multiple processing nodes, the transpose performance does not depend significantly of the number of contiguous off-chip bits (except if the axis spans only two nodes, in which case the performance is slightly better). You can use this fact to improve the performance of transposes involving arrays of rank greater than or equal to three. With nodal CMSSL, you can also exploit this fact with two-dimensional arrays. (Chapter 15) o Fast Fourier Transform The fft_setup and fft routines now perform error checking; see Chapter 10 of CMSSL for CM Fortran for details. In addition, the enhancements to gen_matrix_transpose mentioned above may help users who perform transposes explicitly when performing FFTs on multidimensional arrays. (Chapter 10) o Sparse gather, scatter, vector gather, and vector scatter utilities In the sparse vector gather and scatter utilities, the vectors are no longer required to lie along the left-most axis; you choose the vector axes in the source and destination arrays using two new arguments, x_vector_axis and y_vector_axis. In addition, the calling sequences of these routines now more closely resemble those of the partitioned gather and scatter utilities. The following changes have been made in Version 3.2: . The trace and trace_mask arguments of sparse_util_gather and its associated routines were combined into a single setup argument. . The y_template and x_mask arguments exchanged places in sparse_util_scatter_setup and sparse_util_scatter_setup. . The error code argument ier was added to sparse_util_ scatter_setup and sparse_util_vec_scatter_setup. . The pointers argument p was removed from sparse_util_scatter. (Chapater 15) o Partitioning The following new features were introduced into the partition_mesh routine in Version 3.2: . You can now divide a mesh into partitions that contain multiple processing elements. One use for this feature would be to divide a very large mesh (which cannot be handled as one piece by your application) into several smaller pieces that can be processed sequentially. The numproc argument is now an input as well as output argument; it helps determine the number of processing elements in each partition. . A new argument, storage_option, provides a low-storage option that is slower than the default operation, but uses less storage for working arrays. . A new argument, verbose, prints statistics to standard output. In addition, in Version 3.2, you can choose the axis along which to reorder pointers with the reorder_pointers routine. Previously, this routine always reordered a pointers array along its last axis. You can also choose which axis counts the mesh elements in the ien and idual arrays. Moreover, a new element type (segment) has been added. (Chapter 15) o Partitioned gather and scatter utilities The part_gather, part_scatter, part_vector_gather, part_vector_scatter, and associated routines changed in the following ways in Version 3.2: . The trace argument was renamed setup. . The routines now take advantage of data locality along any axis; the reordered axis need not be the last axis. For best performance, the reordered axis should be non-local and all other axes should be local. (A local axis is either :serial or laid out with :procs = 1.) (Chapter 15) o Optimization Hints for All-to-All Broadcast and Reduction The performance guidelines for the all-to-all broadcast and reduction routines have been updated to take into account the CM Fortran -noaxisreorder switch. (Chapter 15) o Vector move (extract and deposit) The vector move routine has a new error code. Additionally, the description of the vector move operation in CMSSL for CM Fortran, CM-5 Edition, Version 3.1, was incorrect. The vector move routine moves one vector from each subgrid of the source CM array into a subgrid of the destination CM array. (Chapter 15) 5 LIMITATIONS AND RESTRICTIONS ******************************** The routines listed below cannot be used with the nodal CMSSL library (the library used when you compile your program in the CM Fortran single-node execution model). Also, if you are using the CM Fortran global/local execution model, do not call these routines from the local program unit. save_gen_lu gen_lu_solve_ext gen_qr_solve_ext restore_gen_lu save_gen_qr gen_matrix_mult_ext save_gen_lu_fms restore_gen_qr save_fast_rng_temps restore_gen_lu_fms save_gen_qr_fms restore_fast_rng_temps gen_lu_setup_ext restore_gen_qr_fms save_vp_rng_temps gen_lu_factor_ext gen_qr_factor_ext restore_vp_rng_temps 6 DOCUMENTATION FOR THIS RELEASE ********************************** The CM Fortran interface to CMSSL Version 3.2 is documented in CMSSL for Fortran, Version 3.2. The information that is summarized in these release notes is presented in more detail in the manual. Your software tape includes ASCII and PostScript versions of these release notes and of CMSSL for CM Fortran. The default location for the release notes is /usr/doc/cmssl-3.2-releasenotes The default location for CMSSL for CM Fortran, Version 3.2 is /usr/doc/cmssl/cmssl-for-cmf-v1-3.2 (Volume I) /usr/doc/cmssl/cmssl-for-cmf-v2-3.2 (Volume II) Within each volume directory you will find a PostScript file for each chapter. The file called README contains more information. If you do not find the documents in the default locations, check with your system administrator or the person who installs the CMSSL at your site. 6.1 ON-LINE SAMPLE CODE AND MAN PAGES -------------------------------------- Included with CMSSL are sample on-line programs that demonstrate how to call each CMSSL routine. You are encouraged to experiment with these sample programs. Also included with CMSSL are on-line man pages for all routines. The on-line sample programs are located in subdirectories of the CMSSL examples directory. The default location for the examples directory is /usr/examples/cmssl. Examples for the operation operation are included in the subdirectory operation/cmf (or operation/sub-operation/cmf) or operation/cstar (or operation/sub-operation/cstar) of the examples directory. For example, CM Fortran sample code for the routine that performs eigenvector analysis using the Jacobi method is located in the subdirectory eigen/jacobi/cmf of the examples directory. If you do not find the on-line examples in /usr/examples/cmssl, check with your system administrator (or the person who installs the CMSSL at your site) to find out where they were installed. To read the on-line man page for a routine, enter the command man routine_name at the UNIX prompt. 7 : ACKNOWLEDGMENTS ******************* The bidiagonalization and singular value decomposition routines introduced in this release are the result of collaborative development between Thinking Machines Corporation and the Danish Computing Center for Research and Education (UNI-C).