Scalability Bottleneck Analysis of High Performance Applications
Abstract
Obtaining high performance and scalability for high performance applications are challenging.
There are various bottlenecks including, higher rate of memory access, complex algorithm, high
rate of communication, big messages over communication, write-write con
ict etc. that ham-
per scalability. Consequently, there is a critical need for performance prediction and scalability
bottleneck analysis to enable scientists to understand impediments to performance on emerging
systems. Performance predictions for large problem sizes and processors using limited small
scale runs are useful for a variety of purposes including scalability projections, and help in min-
imizing the time taken for constructing training data for building performance models. Many
parallel applications often suffer from latent performance limitations that may prevent them
from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when
an attempt to scale the code is actually being made, where remediation can be substantially
difficult. Hence, performance predictions for such applications are challenging.
One method is to build analytical models based on the knowledge of the application charac-
teristics. This requires laborious constructions of the models, and the models are difficult to
build for large-scale parallel applications. Such models are also error-prone due to manual in-
tervention. Hence, the model generation in recent years is mainly based on empirical methods.
The model is generated using a base equation with some parameters and those parameters are
learned from some training runs on the cores which are already available. The trained model is
then used to predict the performance for higher cores. Thus, most of the existing methods rely
heavily on the basis the subtleties of extrapolations for achieving good prediction accuracies.
Techniques that can predict performance with reasonable accuracy may fail to capture the
scalability trends of the applications at higher scales. In general, if a scalability bottleneck is
hidden and can only manifest at larger core counts, it is hard for extrapolation techniques to
correctly capture the trend from only the performance of the training runs. Static code analysis
is needed to capture the application characteristics at different scales.
In this thesis, we develop strategies for both performance predictions and scalability projections
including for applications whose scalability trends can change at scale due to the presence of
scalability bottlenecks. For performance predictions, two strategies are explored for building
performance models using executions on smaller number of cores. In the rst strategy, the orig-
inal application is executed on smaller number of cores, and a performance model is developed
using these executions. In the second strategy, common benchmark kernels are executed on
smaller number of cores and a benchmark kernel that matches with the application in terms of
execution characteristics like instruction counts is chosen. The performance model of the chosen
benchmark kernel is then used to predict performance of the application for larger scales. For
application specifi c extrapolation the resuls show that the error percentage stays around 15%
but the prediction function fails to capture the trend shown by the applications. The correla-
tion coeffecient between actual and predicted behaviour of AMG (Adaptive Multigrid) is around
75%. For Benchmark based prediction, the error percentage is better that simple extrapolation
method and stays below 10% for SMG (Semicoarsening Multigrid) but this method also fails to
capture the trend shown by the application at high cores. The correlation coefficient between
actual and predicted behaviour of SMG is around 90%.
For scalability projections, the different phases of the application are instrumented to collect
various parameters including time stamps and number of invocations of the phase executions.
These phases could be function calls or loops. The phase and parameter values of a phase that
follows the scalability trend of the application for different number of cores are identi fied. For
example, the number of invocations of a particular function call may follow the scalability trend
of the overall application for different number of cores. Static analysis is performed to map the
identfii ed parameter values in terms of the input parameters, namely, problem size and number
of cores. A performance model is then built for the parameter value, e.g., number of invocations
of a function call, in terms of the input parameters. The performance model is then used to
project the scalability trend for larger number of cores. Since this performance model is built
for the parameter that follows the scalability trend of the application, the model can be reliably
used for scalability projections and identifying scalability bottlenecks. We demonstrated the
techniques with three scienti c applications, namely, SMG (Semi-Coarsening MultiGrid), AMG
(Adaptive MultiGrid) and GTC (Gyrokinetic Toroidal Code).
In application speci c metric based prediction, the results show prediction errors below 10%
for all the above applications considered and also show that that the the scalability trends of
the applications are captured well by the predictions. The correlation coefficient of the actual
and predicted trends at high core is at least 97% while for other methods also considered this
value is well less than 90% for the applications.