James Ding
Mar 03, 2026 20:24
NVIDIA releases cuTile.jl, enabling Julia developers to write high-performance GPU kernels using tile-based programming with near-parity Python performance.
NVIDIA has extended its tile-based GPU programming model to Julia developers with the release of cuTile.jl, an open-source package that achieves up to 100% performance parity with its Python counterpart on compute-intensive workloads.
The package, developed in collaboration with JuliaGPU, represents the latest expansion of CUDA Tile—what NVIDIA has called the most significant addition to CUDA programming since the platform launched in 2006. While Python developers gained access to the tile-based model earlier this year, Julia’s scientific computing community can now tap into the same automatic hardware optimization.
Why Tile-Based Programming Matters
Traditional CUDA development forces programmers to manually manage threads, warps, and memory hierarchies. Tile-based programming flips this: developers describe operations on chunks of data, and the compiler handles hardware mapping automatically. This includes automatic access to Tensor Cores and Tensor Memory Accelerators—specialized hardware that previously required expert-level optimization.
The practical difference shows up in code complexity. A vector addition kernel in traditional CUDA.jl requires explicit thread indexing, bounds checking, and block configuration. The cuTile.jl equivalent reads more like standard array operations, with the compiler handling the low-level details.
Benchmark Results on Blackwell Hardware
Testing on an NVIDIA GeForce RTX 5080 (Blackwell architecture), cuTile.jl matched Python performance across core operations:
Vector addition hit 838 GB/s versus Python’s 843 GB/s (99% parity). Matrix multiplication reached 50.9 TFLOPS against Python’s 50.5 TFLOPS—actually slightly faster. Matrix transpose achieved 98% parity at 797 GB/s.
Batch matrix multiply showed the largest gap at 91% (43.0 vs 47.5 TFLOPS), while complex control-flow kernels like layer normalization and FFT still need optimization work.
Technical Implementation
cuTile.jl uses a custom Julia compiler that intercepts standard library calls—operations like sum, reshape, and basic arithmetic—and routes them to Tile IR operations. This produces the same bytecode format as cuTile Python, feeding into NVIDIA’s tileiras compiler for final GPU machine code generation.
The design deliberately mirrors Python’s API structure, making documentation and code examples portable between languages. But it embraces Julia conventions where appropriate: 1-based indexing, broadcast syntax with dots (.^, .-, ./), and native integration with CUDA.jl for array management.
Current Limitations
This remains experimental software. Not all cuTile features work yet. Iterator-based for loops either fail or generate inefficient code. APIs may change without warning. The package requires Blackwell GPUs (compute capability 12.0+) and CUDA 13 drivers—hardware that most developers don’t have access to yet.
For Julia shops already invested in GPU computing through CUDA.jl, cuTile.jl offers a path toward simpler kernel development as Blackwell hardware becomes available. The package is available now through Julia’s package manager at github.com/JuliaGPU/cuTile.jl.
Image source: Shutterstock
Credit: Source link


















