High Performance GPU Tensor Core Code Generation for Matmul using MLIR
Katel, Naveep Kumar
MetadataShow full item record
State of the art in high-performance deep learning is primarily driven by highly tuned libraries. These libraries are often hand-optimized and tuned by expert programmers using low-level abstractions with significant effort. A lot of the effort may have to be repeated for similar hardware and future ones. This process is thus not modular or reusable to the same extent that compiler infrastructures like LLVM are. Manual optimization does not typically use a standard intermediate representation (IR) or transformations and passes on such IRs, although the optimizations performed can be encoded as a sequence of transformation steps and customized passes on an IR. We believe that until the recent introduction of MLIR (Multi-level intermediate representation), IR infrastructure was not geared to tackle the problem of automatic generation of libraries in an effective manner. In particular, it was hard to represent and transform compute abstractions at high, middle, and low levels using a single IR. Multiple levels of abstractions in a single IR permits the user to apply transformations and optimizations at the most suitable level and perhaps even reuse them for different targets or front-ends Some previous works have optimized matrix-matrix multiplication (matmul) for different GPU microarchitectures. All of these works exploit really low-level details of the hardware. Some of them are written directly in assembly, while some use a combination of CUDA C++ with inline assembly. While the set of high-level optimizations is the same, the very dependence on low-level hardware details drifts them away from re-usability. Going against this trend, we show that, by using a set of simple optimizations, suitable abstractions, and lowering passes on such abstractions in MLIR, we can get competitive performance with hand-written libraries. To achieve this, we put together a lowering pipeline that can automatically generate (without hand-writing any code) code for matmul on NVIDIA GPUs while utilizing its tensor cores. We have used and extended some existing utilities in MLIR, such as tiling, loop unrolling, loop permutation, and generation of fast memory buffers for input operands. Additional utilities, types, and operations necessary for optimal code generation were implemented from scratch. These include adding WMMA operations and types to provide fundamental support for programming tensor cores, adding loop normalization support, adding multi-level tiling support in affine dialect, creating WMMA operations to load, compute, and store matrix products in a given matmul nest, detection, and hoisting of invariant WMMA load-store pairs, hiding latency of global to shared data movement, and adding support for mapping and converting parallel loops to warps. On a set of problem sizes we evaluated, performance results show that we can attain performance that is 95-119% and 80-160% of cuBLAS, for FP32 and FP16 accumulate respectively, on NVIDIA’s Ampere microarchitecture based GeForce 3090 RTX. A similar evaluation on NVIDIA’s Turing-based RTX 2080 Ti revealed that we achieve 86-111% and 72-89% of cuBLAS for FP32 and FP16 accumulate, respectively. We take our approach further by fusing common pointwise operations with matrix-matrix multiplication. This is the first work to demonstrate fusion of operations for tensor core matmul using a systematic IR-based approach. Fusion is done with the support of additional WMMA operations, which perform warp level matrix operations such as ReLU and constant addition. We see significant gains on small to medium problem sizes when evaluating our fused kernels against a combination of library kernels and custom kernels. On Ampere, consumer fusion performance ranges from 95% to 167% compared with the respective implementations. Similar ranges on Turing are 84% to 150%. We also present preliminary results, which serve as a proof of concept, for producer fusion, i.e., fusion of pointwise operations on the inputs with matmul. Performance of ReLU on C input fused with matmul against a custom ReLU kernel followed by cuBLAS matmul ranges from 98% to 138% on Ampere and 91% to 133% on Turing. We believe that these results could be used as a foundation and motivation for further research and development on automatic code and library generation using IR infrastructure for similar specialized accelerators.