Scalability Bottleneck Analysis of High Performance Applications
Obtaining high performance and scalability for high performance applications are challenging. There are various bottlenecks including, higher rate of memory access, complex algorithm, high rate of communication, big messages over communication, write-write con ict etc. that ham- per scalability. Consequently, there is a critical need for performance prediction and scalability bottleneck analysis to enable scientists to understand impediments to performance on emerging systems. Performance predictions for large problem sizes and processors using limited small scale runs are useful for a variety of purposes including scalability projections, and help in min- imizing the time taken for constructing training data for building performance models. Many parallel applications often suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made, where remediation can be substantially difficult. Hence, performance predictions for such applications are challenging. One method is to build analytical models based on the knowledge of the application charac- teristics. This requires laborious constructions of the models, and the models are difficult to build for large-scale parallel applications. Such models are also error-prone due to manual in- tervention. Hence, the model generation in recent years is mainly based on empirical methods. The model is generated using a base equation with some parameters and those parameters are learned from some training runs on the cores which are already available. The trained model is then used to predict the performance for higher cores. Thus, most of the existing methods rely heavily on the basis the subtleties of extrapolations for achieving good prediction accuracies. Techniques that can predict performance with reasonable accuracy may fail to capture the scalability trends of the applications at higher scales. In general, if a scalability bottleneck is hidden and can only manifest at larger core counts, it is hard for extrapolation techniques to correctly capture the trend from only the performance of the training runs. Static code analysis is needed to capture the application characteristics at different scales. In this thesis, we develop strategies for both performance predictions and scalability projections including for applications whose scalability trends can change at scale due to the presence of scalability bottlenecks. For performance predictions, two strategies are explored for building performance models using executions on smaller number of cores. In the rst strategy, the orig- inal application is executed on smaller number of cores, and a performance model is developed using these executions. In the second strategy, common benchmark kernels are executed on smaller number of cores and a benchmark kernel that matches with the application in terms of execution characteristics like instruction counts is chosen. The performance model of the chosen benchmark kernel is then used to predict performance of the application for larger scales. For application specifi c extrapolation the resuls show that the error percentage stays around 15% but the prediction function fails to capture the trend shown by the applications. The correla- tion coeffecient between actual and predicted behaviour of AMG (Adaptive Multigrid) is around 75%. For Benchmark based prediction, the error percentage is better that simple extrapolation method and stays below 10% for SMG (Semicoarsening Multigrid) but this method also fails to capture the trend shown by the application at high cores. The correlation coefficient between actual and predicted behaviour of SMG is around 90%. For scalability projections, the different phases of the application are instrumented to collect various parameters including time stamps and number of invocations of the phase executions. These phases could be function calls or loops. The phase and parameter values of a phase that follows the scalability trend of the application for different number of cores are identi fied. For example, the number of invocations of a particular function call may follow the scalability trend of the overall application for different number of cores. Static analysis is performed to map the identfii ed parameter values in terms of the input parameters, namely, problem size and number of cores. A performance model is then built for the parameter value, e.g., number of invocations of a function call, in terms of the input parameters. The performance model is then used to project the scalability trend for larger number of cores. Since this performance model is built for the parameter that follows the scalability trend of the application, the model can be reliably used for scalability projections and identifying scalability bottlenecks. We demonstrated the techniques with three scienti c applications, namely, SMG (Semi-Coarsening MultiGrid), AMG (Adaptive MultiGrid) and GTC (Gyrokinetic Toroidal Code). In application speci c metric based prediction, the results show prediction errors below 10% for all the above applications considered and also show that that the the scalability trends of the applications are captured well by the predictions. The correlation coefficient of the actual and predicted trends at high core is at least 97% while for other methods also considered this value is well less than 90% for the applications.