## NERSC, Cori, Knights Landing, And Other Matters Jack Deslippe February, 2016 ## Agenda - 1. Overview of NERSC - 2. Things I wished I knew back when I took CS267 - 3. Optimizing Applications For Cori Phase 2 - 4. Example Case Study # **NERSC** # NERSC is the production HPC center for the DOE Office of Science - 5000 users, 600 projects - From 48 states; 65% from universities - Hundreds of users each day - 1500 publications per year Systems designed for science ## The NERSC-8 System: Cori - Cori will support the broad Office of Science research community and begin to transition the workload to more energy efficient architectures - Cray XC system with over 9,300 Intel Knights Landing compute nodes mid 2016 - Self-hosted, (not an accelerator) manycore processor with up to 72 cores per node - On-package high-bandwidth memory - Data Intensive Science Support - 10 Haswell processor cabinets (Phase 1) to support data intensive applications Summer 2015 - NVRAM Burst Buffer to accelerate data intensive applications - 28 PB of disk, >700 GB/sec I/O bandwidth - Robust Application Readiness Plan - Outreach and training for user community - Application deep dives with Intel and Cray ## **NERSC's Current Big System is Edison** - Edison is the HPCS\* demo system (serial #1) - First Cray Petascale system with Intel processors (Ivy Bridge), Aries interconnect topology - Very high memory bandwidth (100 GB/s per node) - 5,576 nodes, 133K cores, 64 GB/node - Exceptional application performance ## **NERSC** moved into Wang Hall late 2015 - Four story, 140,000 GSF, 300 offices, 20Ksf HPC floor, 12.5->40 MW - Located for collaboration - LBNL, CRD, Esnet, UCB - Exceptional energy efficiency - Natural air and water cooling - Heat recovery - PUE < 1.1 # Things I Wish I Knew When I was In CS267 ## **Debugging Parallel Programming Bugs** #### This code hangs because both Task 0 and Task N-1 are blocking on MPI\_Recv ``` if(task_no==0) { ret = MPI_Recv(&herBuffer, 50, MPI_DOUBLE, totTasks-1, 0, MPI_COMM_WORLD, &status); ret = MPI_Send(&myBuffer, 50, MPI_DOUBLE, totTasks-1, 0, MPI_COMM_WORLD); } else if (task_no==(totTasks-1)) { ret = MPI_Recv(&herBuffer, 50, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &status); ret = MPI_Send(&myBuffer, 50, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD); } ``` #### **NERSC NX – Accelerate Your X Connection** #### **FOR USERS** - » Live Status - **My NERSC** - » Getting Started - Computational Systems - » Data & File Systems - » Network Connections Connecting to NERSC Using X Windows Connecting to NERSC with NX FAQ Download Tested NX Player **NX Configuration File** Startup Tutorial Transferring Data Network Performance - » Queues and Scheduling - » Job Logs & Analytics - » Training & Tutorials - » Software - » Accounts & Allocations - » Policies - » Data Analytics & Visualization - » Data Management Policies Home » For Users » Network Connections » Connecting to NERSC with NX # NERSC NX SERVICE - X-WINDOWS ACCELERATION AT NERSC #### Introduction NX is a computer program that handles remote X Window System connections and it provides three benefits for NERSC users: - SPEED: NX can greatly improve the performance of X Windows, allowing users with slow, high latency connections (e.g. on cell phone network, traveling in Africa) to use complex X Windows programs (such as rotating a plot in Matlab). - SESSION: NX provides sessions that allow a user to disconnect from the session and reconnect to it at a later time while keeping the state of all running applications inside the session. - DESKTOP: NX gives users a virtual desktop that's running at NERSC. You can customize the desktop according to your work requirement. #### **TABLE OF CONTENTS** - 1. Introduction - 2. New NX Service - 3. Availability - 4. Current Users (Live!) - 5. Quick Start Up Tutorial - 6. Got A Question? #### Related Information **Download Tested NX Player** Download NX Configuration File NX FAQ ## **Compile & Start DDT** ## Compile for debugging ``` edison% make cc -c -g hello.c cc -o hello -g hello.o ``` ## Set up the parallel run environment ``` edison% qsub -I -V -lmppwidth=24 edison% cd $PBS_O_WORKDIR ``` ## Start the DDT debugger ``` edison% module load ddt edison% ddt ./hello ``` ### **DDT Screen Shot** ## **Vendors are starting to listen (DDT)** ## **Debuggers on NERSC machines** - Parallel debuggers with a graphical user interface - DDT (Distributed Debugging Tool) - TotalView - Specialized debuggers on Hopper and Edison - STAT (Stack Trace Analysis Tool) - Collect stack backtraces from all (MPI) tasks - ATP (Abnormal Termination Processing) - Collect stack backtraces from all (MPI) tasks when an application fails - CCDB (Cray Comparative Debugger) - Comparative debugging ## **Profile Your Application (VTune / CrayPat)** #### Thread Activity # Cori Phase 2 #### What is different about Cori? - Cori will begin to transition the workload to more energy efficient architectures - Cray XC system with over 9300 Intel Knights Landing (Xeon-Phi) compute nodes - Self-hosted, (not an accelerator) manycore processor with 72 cores per node - On-package high-bandwidth memory #### What is different about Cori? ## Edison (Ivy-Bridge): - 12 Cores Per CPU - 24 Virtual Cores Per CPU - 2.4-3.2 GHz - Can do 4 Double Precision Operations per Cycle (+ multiply/add) - 2.5 GB of Memory Per Core - ~100 GB/s Memory Bandwidth ## Cori (Knights-Landing): - Up to 72 Physical Cores Per CPU - Up to 288 Virtual Cores Per CPU - Much slower GHz - Can do 8 Double Precision Operations per Cycle (+ multiply/add) - < 0.3 GB of Fast Memory Per Core</li>< 2 GB of Slow Memory Per Core</li> - Fast Memory has ~ 4-5x DDR4 Bandwidth ## **NESAP** #### <u>Breakdown of Application Hours</u> <u>on Hopper and Edison 2013</u> # Basic Optimization Concepts #### MPI + X? Need to explicitly consider both inter and on-node parallelism in application. Existing applications may suffer from: - Memory overhead due to duplicated data in traditional MPI tasks - Lack of SIMD/Vectorization expressiveness in app. - Potential MPI latency in all-to-all communication patterns #### Possible Solutions: MPI+MPI, MPI+OpenMP, PGAS (MPI+PGAS), Task Based Programming #### PARATEC Use Case For OpenMP PARATEC computes parallel FFTs across all processors. Involves MPI all-to-all communication (small messages, latency bound). Reducing the number of MPI tasks in favor OpenMP threads makes large improvement in overall runtime. Figure Courtesy of Andrew Canning #### Vectorization There is a another important form of on-node parallelism do i = 1, n $$a(i) = b(i) + c(i)$$ enddo $$\begin{pmatrix} a_1 \\ ... \\ a_n \end{pmatrix} = \begin{pmatrix} b_1 \\ ... \\ b_n \end{pmatrix} + \begin{pmatrix} c_1 \\ ... \\ c_n \end{pmatrix}$$ Vectorization: CPU does identical operations on different data; e.g., multiple iterations of the above loop can be done concurrently. Works best with long/aligned vectors. #### Vectorization #### There is a another important form of on-node parallelism $$\begin{pmatrix} a_1 \\ \dots \end{pmatrix} = \begin{pmatrix} b_1 \\ \dots \end{pmatrix} + \begin{pmatrix} c_1 \\ \dots \end{pmatrix}$$ Intel Xeon Sandy-Bridge/Ivy-Bridge: 4 Double Precision Ops Concurrently Vectori: Intel Xeon Phi: 8 Double Precision Ops Concurrently above I of the Compilers want to "vectorize" your loops whenever possible. But sometimes they get stumped. Here are a few things that prevent your code from vectorizing: #### Loop dependency: #### Task forking: ``` do i = 1, n if (a(i) < x) cycle if (a(i) > x) ... enddo ``` Example From NERSC User Group Hackathon - (Astrophysics Transport Code) ``` for (many iterations) { ... many flops ... et = exp(outcome1) tt = pow(outcome2,3) IN = IN * et +tt } ``` Example From NERSC User Group Hackathon - (Astrophysics Transport Code) ``` for (many iterations) { ... many flops ... et = exp(outcome1) tt = pow(outcome2,3) IN = IN * et +tt } ``` ``` for (many iterations) { ... many flops ... et(i) = exp(outcome1) tt(i) = pow(outcome2,3) } for (many iterations) { IN = IN * et(i) + tt(i) } ``` Example From NERSC User Group Hackathon - (Astrophysics Transport Code) ``` for (many iterations) { ... many flops ... et = exp(outcome1) tt = pow(outcome2,3) IN = IN * et +tt } ``` ``` for (many iterations) { ... many flops ... et(i) = exp(outcome1) tt(i) = pow(outcome2,3) } for (many iterations) { IN = IN * et(i) + tt(i) } ``` 30% speed up for entire application! ``` real(8), dimension (5, (col f nvr-1)*(col f nvz-1), (col f nvr-1)*(col f nvz-1)) :: Ms do index ip = 1, mesh Nzm1 do index jp = 1, mesh Nrm1 index 2dp = index jp+mesh Nrm1*(index ip-1) tmp vol = cs2%local center volume(index jp) tmp f half v = f half(index jp, index ip) * tmp vol tmp dfdr v = dfdr(index jp, index ip) * tmp vol tmp dfdz v = dfdz(index jp, index ip) * tmp vol tmpr(1:3) = tmpr(1:3) + Ms(1:3,index 2dp,index 2D)* tmp f half v tmpr(5) = tmpr(5) + Ms(4,index 2dp,index 2D)*tmp dfdr v + ``` ``` Optimized ``` #### Example From Cray COE Work on XGC1 ``` real (8), dimension ((col f nvr-1), 5, (col f nvz-1), (col f nvr-1)*(col f nvz-1)) :: Ms do index ip = 1, mesh Nzm1 do index jp = 1, mesh Nrm1 index 2dp = index jp+mesh Nrm1*(index ip-1) tmp vol = cs2%local center volume(index jp) tmp f half v = f half(index jp, index ip) * tmp vol tmp dfdr v = dfdr(index jp, index ip) * tmp vol tmp dfdz v = dfdz(index jp, index ip) * tmp vol tmpr(index_jp,1) = tmpr(index_jp,1) + Ms(index_jp,1,index_ip,index 2D)* tmp f haIf v tmpr(index_jp,2) = tmpr(index_jp,2) + Ms(index_jp,2,index_ip,index_ZD)* tmp f half v tmpr(index_jp,3) = tmpr(index_jp,3) + Ms (index jp, 3, index ip, index 2D) * tmp f half v tmpr(index jp, 5) = tmpr(index jp, 5) + Ms(index_jp,4,index_ip,index_2D)* Ms(index_ip,2,index_ip,index_2D)* tmp dfdr v tmn_dfdz_v ``` Original ``` Original real(8), dimension (5, (col f nvr-1)*(col f nvz-1), col f nvr-1)*(col f nvz-1)) :: Ms do index ip = 1, mesh Nzm1 do index jp = 1, mesh Nrm1 index 2dp = index jp+mesh Nrm1*(index ip-1) tmp vol = cs2%local center volume(index jp) tmp f half v = f half(index jp, index ip) * tmp vol tmp dfdr v = dfdr(index jp, index ip) * tmp vol tmp dfdz v = dfdz(index jp, index ip) * tmp vol tmpr(1:3) = tmpr(1:3) + Ms(1:3,index 2dp,index 2D)* tmp f half tmpr(5) = tmpr(5) Ms(4,index 2dp,index 2D)*tmp dfdr v + ``` ``` Optimized ``` #### Example From Cray COE Work on XGC1 ``` real (8), dimension ((col f nvr-1), 5, (col f nvz-1), ~40% speed up for kernel do index ip = 1, mesh Nzm1 do index jp = 1, mesh Nrm1 index 2dp = index jp+mesh Nrm1*(index ip-1) tmp vol = cs2%local center volume(index jp) tmp f half v = f half(index jp, index ip) * tmp vol tmp dfdr v = dfdr(index jp, index ip) * tmp vol tmp dfdz y = \frac{dfdz(index ip, index ip)}{} * tmp vol tmpr(index_jp,1) = tmpr(index_jp,1) Ms(index jp,1,index ip,index 2D)* tmp f half v tmpr(index_jp,2) = tmpr(index_jp,2) + Ms(index_jp,2,index_ip,index_ZD)* tmp f haIf v tmpr(index_jp,3) = tmpr(index_jp,3) Ms(index_jp,3,index_ip,index_2D)* tmp f half v tmpr(index jp, 5) = tmpr(index jp, 5) + Ms(index_jp,4,index_ip,index_2D)* Ms(index_ip,2); Ms(index_ip,2); tmp dfdr v tmn_dfdz_v ``` ### Memory Bandwidth #### Consider the following loop: Assume, n & m are very large such that a & b don't fit into cache. Then, During execution, the **number of loads From DRAM** is n\*m + n ### Memory Bandwidth Consider the following loop: Assume, n & m are very large such that a & b don't fit into cache. do $$i = 1$$ , $n$ do $j = 1$ , $m$ $$c = c + a(i) * b(j)$$ enddo enddo Assume, n & m are very large such that a & b don't fit into cache. Then, During execution, the **number of loads From DRAM** is n\*m + n Requires 8 bytes loaded from DRAM per FMA (if supported). Assuming 100 GB/s bandwidth on Edison, we can at most achieve 25 GFlops/second (2 Flops per FMA) Much lower than 460 GFlops/second peak on Edison node. Loop is memory bandwidth bound. #### Roofline Model For Edison ### Improving Memory Locality Improving Memory Locality. Reducing bandwidth required. Loads From DRAM: Loads From DRAM: ## Improving Memory Locality Moves you to the Right on the Roofline # **Optimization Strategy** OpenMP scales only to 4 Threads large cache miss rate Communication dominates beyond 100 nodes Code shows no improvements when turning on vectorization 50% Walltime The Ant Farm! Compute intensive doesn't vectorize MPI/OpenMP Scaling Issue Use Edison to Test/Add OpenMP Improve Scalability. Help from NERSC/Cray COE Available. Can you use a library? Memory bandwidth bound kernel Increase Memory Locality Create micro-kernels or examples to examine thread level performance, vectorization, cache use, locality. IO bottlenecks Utilize High-Level IO-Libraries. Consult with NERSC about use of Burst Buffer. The Dungeon: Simulate kernels on KNL. Plan use of on package memory, vector instructions. Utilize performant / portable libraries ### Measuring Your Memory Bandwidth Usage (VTune) Measure memory bandwidth usage in VTune. (Next Talk) Compare to Stream GB/s. If 90% of stream, you are memory bandwidth bound. If less, more tests need to be done. Run Example in "Half Packed" Mode If you run on only half of the cores on a node, each core you do run has access to more bandwidth aprun -n 24 -N 12 - S 6 ... VS aprun -n 24 -N 24 -S 12 ... If your performance changes, you are at least partially memory bandwidth bound Run Example in "Half Packed" Mode If you run on only half of the cores on a node, each core you do run has access to more bandwidth aprun -n 24 -N If your performand Run Example at "Half Clock" Speed Reducing the CPU speed slows down computation, but doesn't reduce memory bandwidth available. VS aprun --p-state=1900000 ... If your performance changes, you are at least partially compute bound ### So, you are Memory Bandwidth Bound? ### What to do? Try to improve memory locality, cache reuse 2. Identify the key arrays leading to high memory bandwidth usage and make sure they are/will-be allocated in HBM on Cori. Profit by getting ~ 5x more bandwidth GB/s. ### So, you are Compute Bound? ### What to do? 1. Make sure you have good OpenMP scalability. Look at VTune to see thread activity for major OpenMP regions. 2. Make sure your code is vectorizing. Look at Cycles per Instruction (CPI) and VPU utilization in vtune. See whether intel compiler vectorized loop using compiler flag: -qopt-report=5 ### Complex-Division (without -fp model fast=2) ### So, you are neither compute nor memory bandwidth bound? You may be memory latency bound (or you may be spending all your time in IO and Communication). If running with hyper-threading on Edison improves performance, you \*might\* be latency bound: If you can, try to reduce the number of memory requests per flop by accessing contiguous and predictable segments of memory and reusing variables in cache as much as possible. On Cori, each core will support up to 4 threads. Use them all. # **NESAP Case Study** ### BerkeleyGW Use Case - ★ Big systems require more memory. Cost scales as N<sub>atoms</sub>^2 to store the data. - ★ In an MPI GW implementation, in practice, to avoid communication, data is duplicated and each MPI task has a memory overhead. - ★ Users sometimes forced to use 1 of 24 available cores, in order to provide MPI tasks with enough memory. 90% of the computing capability is lost. **Distributed Data** Overhead Data MPI Task 1 Distributed Data Overhead Data MPI Task 2 **Distributed Data** Overhead Data MPI Task 3 # Computational Bottlenecks In house code (I'm one of main developers). Use as "prototype" for App Readiness. # Computational Bottlenecks In house code (I'm one of main developers). Use as "prototype" for App Readiness. Significant Bottleneck is large matrix reduction like operations. Turning arrays into numbers. $$\langle n\mathbf{k} | \Sigma_{\text{CH}}(E) | n'\mathbf{k} \rangle = \frac{1}{2} \sum_{n''} \sum_{\mathbf{q}\mathbf{G}\mathbf{G}'} M_{n''n}^*(\mathbf{k}, -\mathbf{q}, -\mathbf{G}) M_{n''n'}(\mathbf{k}, -\mathbf{q}, -\mathbf{G}')$$ $$\times \frac{\Omega_{\mathbf{G}\mathbf{G}'}^2(\mathbf{q}) \left(1 - i \tan \phi_{\mathbf{G}\mathbf{G}'}(\mathbf{q})\right)}{\tilde{\omega}_{\mathbf{G}\mathbf{G}'}(\mathbf{q}) \left(E - E_{n''\mathbf{k} - \mathbf{q}} - \tilde{\omega}_{\mathbf{G}\mathbf{G}'}(\mathbf{q})\right)} v(\mathbf{q} + \mathbf{G}')$$ ### Early Optimization Work - 1. Target more on-node parallelism. (MPI model already failing users) - 2. Ensure key loops/kernels can be vectorized. **Example: Optimization steps for Xeon Phi Coprocessor** ### Final Loop Structure ``` !$OMP DO reduction(+:achtemp) do my igp = 1, ngpown do iw=1,3 scht=0D0 wxt = wx array(iw) do ig = 1, ncouls !if (abs(wtilde array(ig,my igp) * eps(ig,my igp)) .lt. TOL) cycle wdiff = wxt - wtilde array(ig,my igp) delw = wtilde array(ig,my igp) / wdiff scha(ig) = mygpvar1 * aqsntemp(ig) * delw * eps(ig,my igp) scht = scht + scha(iq) enddo ! loop over q sch array(iw) = sch array(iw) + 0.5D0*scht enddo achtemp(:) = achtemp(:) + sch array(:) * vcoul(my igp) enddo ``` ngpown typically in 100's to 1000s. Good for many threads. Original inner loop. Too small to vectorize! ncouls typically in 1000s - 10,000s. Good for vectorization. Attempt to save work breaks vectorization and makes code slower. ### Remaining Questions. We've had two dungeon sessions with Intel. What kinds of questions can they help with? - 1. Why is KNC slower than Haswell for this problem? - Known to be bandwidth bound on KNC, but not on Haswell. - How is it bound on Haswell? - 2. How much gain can we get by allocating just a few arrays in HBM? • 2S Haswell 27.9s KNC 39.9s (Bandwidth bound on KNC, but not on Haswell) ``` do my_igp = 1, ngpown (OpenMP) do iw = 1, 3 do ig = 1, igmax load wtilde_array(ig,my_igp) 819 MB, 512KB per row load aqsntemp(ig,n1) 256 MB, 512KB per row load I_eps_array(ig,my_igp) 819 MB, 512KB per row do work (including complex divide) depends on ig, iw ... ``` • 2S Haswell 27.9s KNC 39.9s (Bandwidth bound on KNC but not on Haswell) ``` do my_igp = 1, ngpown (OpenMP) do iw = 1, 3 do ig = 1, igmax load wtilde_array(ig,my_igp) 819 MB, 512KB per row load aqsntemp(ig,n1) 256 MB, 512KB per row load I_eps_array(ig,my_igp) 819 MB, 512KB per row do work (including divide) ``` Required Cache size to reuse 3 times: 1536 KB L2 on KNC is 256 KB per Hardware Thread L2 on Has. is 256 KB per core L3 on Has. is 3800 KB per core • 2S Haswell 27.9s KNC 39.9s (Bandwidth bound on KNC but not on Haswell) ``` do my_igp = 1, ngpown (OpenMP) do iw = 1, 3 do ig = 1, igmax load wtilde_array(ig,my_igp) 819 MB, 512KB per row load aqsntemp(ig,n1) 256 MB, 512KB per row load l_eps_array(ig,my_igp) 819 MB, 512KB per row do work (including divide) ``` Required Cache size to reuse 3 times: 1536 KB L2 on KNC is 256 KB per Hardware Thread L2 on Has. is 256 KB per core L3 on Has. is 3800 KB per core Without blocking we spill out of L2 on KNC and Haswell. But, Haswell has L3 to catch us. 2S Haswell 27.9s KNC 39.9s (Bandwidth bound on KNC but not on Haswell) ``` igblk = 2048 do my_igp = 1, ngpown (OpenMP) do igbeg = 1, igmax, igblk do iw = 1, 3 do ig = igbeg, min(igbeg + igblk,igmax) load wtilde_array(ig,my_igp) 819 MB, 512KB per row load aqsntemp(ig,n1) 256 MB, 512KB per row load I_eps_array(ig,my_igp) 819 MB, 512KB per row do work (including divide) ``` Required Cache size to reuse 3 times: 1536 KB L2 on KNC is 256 KB per Hardware Thread L2 on Has. is 256 KB per core L3 on Has. is 3800 KB per core Without blocking we spill out of L2 on KNC and Haswell. But, Haswell has L3 to catch us. Igblk=2048 - to enable reuse of L2 cache on KNC • Morning: 2S Haswell 27.9s KNC 39.9s • Afternoon: 2S Haswell 27.5s KNC 29.7s The loss of L3 on MIC makes locality more important. ### How much performance can we get from 3 arrays in Fast Memory? #### Identify the candidate (key arrays) for HBM - VTune Memory Access tool can help to find key arrays - Using NUMA affinity to simulate HBM on a dual socket system - Use FASTMEM directives and link with jemalloc/memkind libraries ``` On Edison (NERSC Cray XC30): real, allocatable :: a(:,:), b(:,:), c(:) !DIR$ ATTRIBUTE FASTMEM :: a, b, c % module load memkind jemalloc % ftn -dynamic -g -O3 -openmp mycode.f90 % export MEMKIND_HBW_NODES=0 % aprun -n 1 -cc numa_node numactl --membind=1 --cpunodebind=0 . /myexecutable On Haswell: Link with '-ljemalloc -lmemkind -lpthread -lnuma " ``` % numactl --membind=1 --cpunodebind=0 ./myexecutable | Application | All<br>memory<br>on far<br>memory | All<br>memory<br>on near<br>memory | Key arrays<br>on near<br>memory | |-------------|-----------------------------------|------------------------------------|---------------------------------| | BerkeleyGW | baseline | 52% faster | 52.4%<br>faster | | EmGeo | baseline | 40% faster | 32% faster | | XGC1 | baseline | | 24% faster | Bandwidth collection on Haswell. Now \*Mostly\* not bandwidth bound. #### General Exploration of two OpenMP regions VNC config <no current project> - Intel VTune Amplifier Welcome test\_ge X General Exploration General Exploration viewpoint (change) ② Intel VTune Amplifier XE 2015 Collection Log 🔀 Analysis Target 🖟 Analysis Type 🐧 Summary 🚱 Bottom-up 🚱 Top-down Tree 🖼 Tasks and Frames 🕻 gppkernel.d... \$ 4. Q X Grouping: OpenMP Region / Function / Call Stack \* Unfilled Pipeline Slots (Stalls) Filled Pipeline Slots Pot.. Back-End Bound Gain Mod. Fun.. Sou.. Instructions MUX Front-... (% Elaps... of Ins.. Clockticks Start Address OpenMP Region / Function / Call Stack Reliability Retiring Bad of Time Ope. Cou. (Full) File Retired Gain Memory Core Specul... Bound Col... thr. Bound Bound Tim.. MAIN\_\$omp\$parallel:18@unknown:281:529 0.328 1,757,836,636,751 2,159,187,238,776 0.814 0.001 0.141 0.731 0.003 0.926s 1.9% 39.849s 18 MAIN\_\$omp\$parallel:18@unknown:67:74 3,946,005,919 4,100,006,150 0.962 0.245 0.004 0.105 0.357 0.004 0.233s 0.5% 0.234s 18 0 MAIN \$omp\$parallel:18@unknown:206:260 297,620,446,430 194,362,291,543 1.531 0.162 0.001 0.621 0.305 0.010 0.160s 0.3% 6.751s 18 512 0 ▶[Serial - outside any region] 13,782,020,673 11,316,016,974 1.218 0.258 0.000 0.392 0.302 0.021 0.0% 1.465s 0 The dynamic loop is now core bound, not memory bound. Removing the divide shows it to be the culprit! Selected 1 row(s): 1,757,836,636,751 2,159,187,238,776 0.814 0.996 0.328 0.001 0.141 0.731 0.003 0.0% 39.849s 512 Ruler Area Q0Q+Q-Q# ✓ Region Instance Thread Hardware Events ✓ Hardware Event Count Hardware E.. MEM\_LOAD\_UOPS\_L3\_MI -No filters are applied. Process: Any Process ▼ Thread: Any Thread ▼ Module: Any Module ▼ Call Stack Mode: User functions + 1 ▼ Inline Mode: on ▼ Loop Mode: Functions only # Conclusions ### **High Level Lessons** 1. Optimizing code for Cori is not always straightforward. It is a continual discovery process that involves many sequential and coupled changes. # 2013 - Poor locality, loop ordering issues # 2014 - Refactored loops, improved locality ### 2014 - Vectorized Code ## 2015 - Cache Blocking ### High Level Lessons 1. Optimizing code for Cori is not always straightforward. It is a continual discovery process that involves many sequential and coupled changes. 2. Use profiling tools like VTune and CrayPat on Edison to find and characterize hotspots. 3. Understanding bandwidth and compute limitations of hotspots are key to deciding how to improve code. # The End (Extra Slides) ### Why Complex Divides so Slow? ### Code performance now limited by complex divides why?? For complex division in performance critical loop, I had already removed the explicit complex divide but what is faster? a) $$c = 1 / c$$ vs. b) $$r = c * conjg(c)$$ $$r = 1 / r$$ $$c = conjg(c) * r$$ c/d) Compiling with/without -fp-model fast=2 ### Real-Division (with or without -fp model fast=2) ### Complex-Division (with -fp model fast=2) ### Approximation: - a. Real Division - b. Complex Division - c. Complex Division+ -fp-model fast=2 Wall Time: 6.37 seconds 4.99 seconds 5.30 seconds ### Approximation: Wall Time: a. Real Division 6.37 seconds b. Complex Division 4.99 seconds c. Complex Divsion + -fp-model=fast 5.30 seconds ### Approximation: a. Real Division b. Complex Division c. Complex Division + - fp-model fast=2 d. Complex Division + -fp-model=fast=2 + !dir\$ nounroll #### Wall Time: 6.37 seconds 4.99 seconds 5.30 seconds 4.89 seconds ### Early NESAP (Advances with Cray and Intel) Advances Thread Scaling in BerkeleyGW GPP Kernel on Xeon-Phi BerkeleyGW FF Kernel Runtimes on Xeon and Xeon-Phi (Nathan) | | Overall Improvement | <u>Notes</u> | |---------------------------------------------------------------------|------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | BGW GPP Kernel<br>BGW FF Kernel<br>BGW Chi Kernel<br>BGW BSE Kernel | 0-10%<br>2x-4x<br>10-30%<br>10-50% | Pretty optimized to begin with. Thread scalability improved by fixing ifort allocation performance. Unoptimized to begin with. Cache reuse improvements Moved threaded region outward in code Created custom vector matmuls | ### **Early Lessons Learned** Cray and Intel very helpful in profiling/optimizing the code. See following slides for using Intel resources effectively Generating small tangible kernels is important for succes Targeting Many-Core greatly helps performance back on Xeon. Complex division is slow on (particularly on KNC) BGW 1.0 vs 1.1 Sigma Performance