Vamsi Sripathi

Hi! I'm Vamsi Sripathi, I currently work as a Technical Lead for HPC Software Optimizations at Intel. I joined Intel after completing my Masters' degree in Computer Science from North Carolina State University. I worked in several Intel software teams applying x86 code optimizations to mathematical libraries, deep learning frameworks and scientific applications. The following stack diagram captures my key projects -

Listed below are some of my contributions in reverse chronological order. For a complete CV/Resume, see here (pdf)

Improved the performance of HotQCD (Lattice Quantum Chromo Dynamics) framework by 1.4x on Intel Sapphire Rapids Processors with High Bandwidth Memory (HBM) by optimizing the lattice memory layout and software prefetching. (technical article)

Improved the performance of MPAS-A (Model for Prediction Across Scales - Atmosphere) by 1.25x on Intel Sapphire Rapids Processors with High Bandwidth Memory (HBM) by applying code tuning techniques. (slides)

Optimization of STREAM benchmark kernels with Intel Data Streaming Accelerator (DSA) (technical article)

Delivered performance improvements (1.3x) to a weather application (from European Center for Medium Weather Forecasts [ECMWF]) on Intel Icelake Processors through explicit AVX512 vectorization of prefix-sum/scan operations. Write-up of prefix-sum optimization published in Intel Parallel Universe magazine.

Directly contributed to Intel Silicon design win (valued at $M's) with a leading content recommendation provider (Taboola) by delivering targeted code optimizations to Eigen/TensorFlow framework (White-paper). This win was highlighted in quarterly earnings call by Intel CEO and covered by top media outlets (Reuters, Venturebeat).

Implemented 8-bit integer matrix-matrix multiplication kernels using Vector Neural Network Instructions (VNNI) targeted for Intel Cascade Lake processors. Improved the performance of a leading language translation model (Transformer-LT) by accelerating the performance of integer GEMM in TensorFlow. (Paper)

Developed a tool called TACKLE (Thread Affinity Advisor, Checker, and Launcher) that recommends the ideal thread affinity, NUMA binding and taskset settings for multi-instance deep learning inference deployments.

Proposed and developed run-time profiling capabilities to Intel MKL-DNN library. This feature has been upstreamed and widely used by user community.

Ported MKL FFT's to TensorFlow C++ backend which delivered performance gains of 6x and addressed competitive threat for Intel Silicon for a leading medical imaging provider.

Optimized Intel Math Kernel Library (MKL) matrix-matrix, matrix-vector and vector-vector operations spanning multiple generations of Intel CPU microarchitectures (AVX - 256bit SIMD, AVX2 - 256bit SIMD + FMA, AVX512 - 512bit SIMD + FMA).

Designed and developed compiler vector SIMD intrinsics framework for MKL BLAS optimizations.

Played a key role in implementing bitwise numerical reproducibility of floating point operations in MKL BLAS under variable data alignment conditions. Developed a test suite to validate bitwise accurate results of all BLAS with multiple precisions (FP32, FP64, Complex64, Complex128), arbitrary number of inputs and memory alignment offsets.

Open-Source sofware/contributions:

Github - https://github.com/vamsi-sripathi
Optimized AVX512 SIMD implementations of prefix-sum, argmax - https://github.com/vamsi-sripathi/simd
Using Intel Data Streaming Accelerator (DSA) to accelerate memory bandwidth-bound/STREAM kernels - https://github.com/vamsi-sripathi/stream-dsa
TACKLE (Thread Affinity Checker, Advisor, Launcher) - https://github.com/vamsi-sripathi/tackle
Improved the performance of N-dimensional tensor broadcast operations by 4-30x using SIMD techniques in Eigen framework.
Improved the performance of RNN workloads by enabling TensorFlow to use Intel MKL as the backend engine for acceleration of linear algebra functions, ported Intel MKL matrix-matrix multiplication APIs (GEMM, batched GEMM) to TensorFlow framework.
Delivered performance gains of 1.75x (AlexNet topology) for Intel Caffe framework on Intel Xeon Phi architecture (KNL) through SIMD optimization of data transformation operations.
Improved the performance of k-means clustering algorithm employed in a geospatial data application by 2.7x on Intel Xeon Phi (KNL) architecture by developing OpenMP parallelization and vectorization techniques.
Benchmarks to measure memory bandwidth performance on CPU's and GPU's - https://github.com/intel/memory-bandwidth-benchmarks

Tech Talks:

Optimization of ACRANEB2 radiation kernel on Intel Xeon Processors. Delivered at ECMWF's 19th Workshop on High Performance Computing in Meteorology Fall 2021. (video, slides)
Optimization of TensorFlow-Serving Application on Intel Xeon Scalable Processors. Delivered at Intel Extreme Performance Users Group (IXPUG) Fall 2018. (slides)
TensorFlow Performance Optimizations on Intel Architectures. Delivered at Argonne National Laboratory Leadership Facility (ALCF) Developer Session, July 2018. (slides)
Scalable Algorithms for Clustering Large Geospatiotemporal Data Sets on Manycore Architectures. Invited talk at 7th Workshop on Data Mining in Earth System Science, part of IEEE International Conference on Data Mining 2017. (slides)

Select Publications/Articles:

Vamsi Sripathi, Accelerating Memory-Bandwidth-Bound Kernels Using the Intel Data Streaming Accelerator. Published in Parallel Universe Magazine - Issue-57, July 2024
Vamsi Sripathi, Optimize QCD Performance on Intel Processors with HBM. Published in Parallel Universe Magazine - Issue-53, July 2023
Vamsi Sripathi, Optimizing the Maxloc Operation Using Intel AVX512 Instructions. Published in Parallel Universe Magazine - Issue 46, October 2021
Vamsi Sripathi, Ruchira Sasanka, Optimization of Scan Operations Using Explicit Vectorization. Published in Parallel Universe Magazine - Issue 44, April 2021
Vikram A Saletore, Deepthi Karkada, Vamsi Sripathi, Kushal Datta, and Ananth Sankaranarayanan, Boosting Deep Learning Training & Inference Performance on Intel Xeon and Intel Xeon Phi Processors (link)
Sarat Sreepathi, Jitendra Kumar, Forrest M. Hoffman, Richard T. Mills, Vamsi Sripathi, William W. Hargrove, Parallel Multivariate Spatio-Temporal Clustering of Large Ecological Datasets on Hybrid Supercomputers. IEEE Cluster 2017. (link)
Murat Guney, Sarah Knepper, Kazushige Goto, Vamsi Sripathi, Greg Henry, and Shane Story. Batched MatrixMatrix Multiplication Operations for Intel Xeon Processor and Intel Xeon Phi Co-Processor, 2015 SIAM Conference on Applied Linear Algebra. (link)
Sarat Sreepathi, Vamsi Sripathi, Richard Mills, Glenn Hammond, G. Kumar Mahinthakumar, SCORPIO: A Scalable Two-Phase Parallel I/O Library With Application to a Large Scale Subsurface Simulator, IEEE Conference on High Performance Computing (HiPC) 2013, Bengaluru, India. (link)
R. T. Mills, G. E. Hammond, P. C. Lichtner, V. Sripathi, G. Mahinthakumar and B. F. Smith, Modeling subsurface reactive flows using leadership-class computing, Journal of Physics, 2009. (link)
R. T. Mills, V. Sripathi, G. Mahinthakumar, G. E. Hammond, P. C. Lichtner, B. F. Smith, Engineering PFLOTRAN for Scalable Performance on Cray XT and IBM BlueGene Architectures, Proceedings of SciDAC 2010, July 11-15, 2010, Chattanooga, TN, USA. Invited paper.

Masters Thesis:
Performance Analysis and Optimization of Parallel I/O in a Large Scale Groundwater Application on Petascale Architectures (link)