Honey, I Tiled the Tensors

Shapes, Strides, Swizzles and Suffering! - An intro to Layout Algebra

Published:
Reading time: 18 min read

Layouts are a powerful abstraction introduced in NVIDIA’s CuTe library for making operations on complicated Tensor configurations a little bit easier to understand. My goal here is to provide a good taste of how operations on these layouts work and an example matrix-matrix multiplication kernel to show the value of these abstractions and its drawbacks. This is not a full mathematical analysis of Layouts, for that a good reference is “A note on the algebra of CuTe Layouts”

GPUs don’t know about tensors and their structures, operations are only performed on linear memory. Tensor structures need to be maintained by the kernels themselves. e.g. for a row-major matrix the matrix indices i, j will become index i*C + j where C is the number of columns. These kinds of mappings are pivotal in building GPU kernels since the shapes and structures of tensors can get very complicated.

A Layout we can easily define as nothing but a combination of the shape and stride (num of jumps to go from one element to next in a dimension) of a Tensor. It defines a mapping from the Tensor coordinate space to a flat array layout indices. Here stride and shape are tuples of matching dimensions. Calling the shapes SS and the strides DD we say

L=S:DL = S : D

For example lets start with matrices, a row major n×mn \times m matrix will be represented as (n,m):(m,1)(n, m) : (m, 1). Here we can see that increasing the first dimension (row) coordinate by 11 will increase the flat index by mm and the second dimension (column) only increases it by 11. For getting the flat index/offset given ii, jj, we know we can get it by i×m+j=i×m+j×1i \times m + j = i \times m + j \times 1. Or more generally we can see from how stride is defined that the flat index offset\text{offset} given indices i={i0,i1,...,id}i = \{ i_0, i_1, ..., i_d\} and stride s={s0,s1,...,sd}s = \{s_0, s_1, ..., s_d\},

I=iksk=isI = \sum i_k s_k = i \cdot s

Similarly the column-major matrix represented as (n,m):(1,m)(n,m):(1,m). Layouts can themselves be again made out of other Layouts. Lets analyze an example; ((2,2),(2,2)):((1,4),(2,8))((2,2), (2,2)): ((1,4), (2,8)). Okay! Lets try some coordinates.

Coordinate ((i0,i1),(j0,j1))((i_0,i_1),(j_0,j_1))CalculationPhysical Offset
((0,0),(0,0))((0,0), (0,0))0(1)+0(4)+0(2)+0(8)0(1) + 0(4) + 0(2) + 0(8)0
((0,1),(0,0))((0,1), (0,0))0(1)+1(4)+0(2)+0(8)0(1) + 1(4) + 0(2) + 0(8)4
((1,0),(0,0))((1,0), (0,0))1(1)+0(4)+0(2)+0(8)1(1) + 0(4) + 0(2) + 0(8)1
((1,1),(0,0))((1,1), (0,0))1(1)+1(4)+0(2)+0(8)1(1) + 1(4) + 0(2) + 0(8)5
((0,0),(1,0))((0,0), (1,0))0(1)+0(4)+1(2)+0(8)0(1) + 0(4) + 1(2) + 0(8)2
((0,1),(1,0))((0,1), (1,0))0(1)+1(4)+1(2)+0(8)0(1) + 1(4) + 1(2) + 0(8)6
((1,0),(1,0))((1,0), (1,0))1(1)+0(4)+1(2)+0(8)1(1) + 0(4) + 1(2) + 0(8)3
((1,1),(1,0))((1,1), (1,0))1(1)+1(4)+1(2)+0(8)1(1) + 1(4) + 1(2) + 0(8)7

As we can see this is an interleaved layout (or swizzled). These kinds of layouts are generally used in GPU kernels to prevent memory bank conflicts.

Visualization of the swizzled layout

This is okay, but it is inconvenient to do offset calculations of a particularly complex layout. Another way of visualizing this is to think about the groups individually first (2,2)(2, 2) has a stride of (1,4)(1, 4) creating a 04150 \rightarrow 4 \rightarrow 1 \rightarrow 5 and this is repeated again in a (2,2)(2,2) shape with each block in a 0213 0 \rightarrow 2 \rightarrow 1 \rightarrow 3 layout. This can be seen by factorizing the expanded offset\text{offset}.

offset=i01+i14+j02+j18\text{offset} = i_0 \cdot 1 + i_1 \cdot 4 + j_0 \cdot 2 + j_1 \cdot 8

offset=(i0+2j0)r1+(i1+2j1)c4\text{offset} = \underbrace{(i_0 + 2 \cdot j_0)}_{r} \cdot 1 + \underbrace{(i_1 + 2 \cdot j_1)}_{c} \cdot 4

This reduces a 4D layout into a 2D layout (4,4):(1,4)(4, 4) : (1, 4) which is easy to visualize. But this is not possible for all layouts.

Layouts can also be operated on. This is the algebra part of the layout algebra. So we can have functions that can map from one layout to another layout. Such a function is equivalent to mapping of one flat index space to another. Hence

f:L1L2    f(i):I(L1)I(L2)f : L_1 \rightarrow L_2 \implies f(i) : I(L_1) \rightarrow I(L_2)

Lets go through a few of these operations.

Coalescing#

Coalescing is to simplify a layout. More formally, we can define Rank of a layout as the number of modes it has, coalesce operation reduces the number of modes i.e. to reduce its Rank. For example the layout (2,4):(1,2)(2,4):(1, 2) is the same as 8:18:1 for 1-D coordinates. So how do we get a coalesced layout? Say we have a layout L=(s0,s1,...):(d0,d1,...)L = (s_0, s_1, ...) : (d_0, d_1, ...) , (no nesting) how do we reduce a pair of layouts into one layout? We can check if the two modes are contiguous. When are they contiguous? When the next layout has a stride equaling the total flat indices of the former layout. This way they line up perfectly. so for s0:d0s_0: d_0 and s1:d1s_1:d_1 can reduce iff.

d1=s0×d0d_1 = s_0 \times d_0

and the two modes merge into s0×s1:d0s_0 \times s_1 : d_0. We can also see that 11 shape is an identity in reduction. i.e. 1:d1: d' and s:ds: d reduce to s:ds: d. This is useful when we need to simplify some particularly gnarly tensor layouts. Coalescing can also be done by-mode i.e. we can only reduce some ranks e.g. from 424 \rightarrow 2.

Composition#

Composition chains two layouts together. Given a layout AA and a layout BB, the composition BAB \circ A creates a new layout that first maps coordinates through AA and then uses those resulting indices as coordinates into BB. In other words, the output indices of AA become the input coordinates of BB.

(BA)(i)=B(A(i))(B \circ A)(i) = B(A(i))

Why is this useful? Composition is fundamental to the concept of tiling, where we partition a large data into smaller chunks that can be loaded into GPU’s Shared Memory or further into Register Memory. We can do this with the composition operation as we’ll see in later sections.

Lets work through an example. Take A=(3,2):(1,3)A = (3, 2) : (1, 3) and B=6:2B = 6 : 2. Layout AA maps 2-D coordinates to 1-D indices in a 3×23 \times 2 arrangement. Layout BB maps a 1-D coordinate to offsets {0,2,4,6,8,10}\{0, 2, 4, 6, 8, 10\}. The composition BAB \circ A should give us a layout that maps 2-D coordinates directly to the physical offsets.

Coordinate (i,j)(i, j)A(i,j)A(i,j)B(A(i,j))B(A(i,j))
(0,0)(0, 0)0000
(1,0)(1, 0)1122
(2,0)(2, 0)2244
(0,1)(0, 1)3366
(1,1)(1, 1)4488
(2,1)(2, 1)551010

We can verify this is the layout (3,2):(2,6)(3, 2) : (2, 6). Now lets build up the general composition rules.

Single-mode BB#

In the simple case where B=sB:dBB = s_B : d_B is a single mode, composition just scales the strides of AA by dBd_B:

(sB:dB)(SA:DA)=SA:(dBDA)(s_B : d_B) \circ (S_A : D_A) = S_A : (d_B \cdot D_A)

This works because BB is linear — B(k)=kdBB(k) = k \cdot d_B — so B(A(i))=dBA(i)B(A(i)) = d_B \cdot A(i). This is the case we saw in the example above: (6:2)((3,2):(1,3))=(3,2):(2,6)(6:2) \circ ((3,2):(1,3)) = (3,2):(2,6).

Single-mode AA#

When BB has multiple modes, the output of AA needs to be decomposed into coordinates for BB. Given B=(s0,s1,...):(d0,d1,...)B = (s_0, s_1, ...) : (d_0, d_1, ...) and A=s:dA = s : d, the flat indices from AA (which are {0,d,2d,...,(s1)d}\{0, d, 2d, ..., (s-1)d\}) are split across BB‘s modes using its shapes. This can only be done if either ds0d \mid s_0 or s0ds_0 \mid d. Without this, AA‘s indices don’t cleanly align with BB‘s mode boundaries and composition is undefined.

There are two cases:

Case 1: s0ds_0 \mid d (stride skips past first mode). Since dd is a multiple of s0s_0, every index kdk \cdot d has kdmods0=0k \cdot d \mod s_0 = 0, so BB‘s first mode coordinate is always zero. We skip it entirely and recurse with reduced stride:

(s0,s1,...):(d0,d1,...)(s:d)=(s1,...):(d1,...)(s:d/s0)(s_0, s_1, ...) : (d_0, d_1, ...) \circ (s : d) = (s_1, ...) : (d_1, ...) \circ (s : d / s_0)

Case 2: ds0d \mid s_0 (stride fits within first mode). Let q=s0/dq = s_0 / d. The first qq indices from AA cycle through BB‘s first mode before overflowing into the next.

  • If sqs \leq q: all indices fit in the first mode. Result: s:(d0d)s : (d_0 \cdot d)
  • If s>qs > q and qsq \mid s: the first mode fills completely, and the rest recurse: B(s:d)=(q:d0d), ((s1,...):(d1,...)(s/q:1))B \circ (s : d) = (q : d_0 \cdot d),\ ((s_1, ...) : (d_1, ...) \circ (s / q : 1)) 1

Lets trace through a concrete example. Take B=(4,3):(1,8)B = (4, 3) : (1, 8) and A=6:2A = 6 : 2.

We have d=2d = 2 and s0=4s_0 = 4. Since ds0d \mid s_0, we’re in Case 2 with q=4/2=2q = 4 / 2 = 2. Since s=6>q=2s = 6 > q = 2 and qsq \mid s:

(4,3):(1,8)(6:2)=(2:12), ((3,):(8,)(3:1))(4, 3) : (1, 8) \circ (6 : 2) = (2 : 1 \cdot 2),\ ((3,) : (8,) \circ (3 : 1))

For the recursive step, d=1d = 1 and s0=3s_0 = 3, so q=3q = 3 and s=3qs = 3 \leq q:

(3,):(8,)(3:1)=3:8(3,) : (8,) \circ (3 : 1) = 3 : 8

Combining: BA=(2,3):(2,8)B \circ A = (2, 3) : (2, 8).

General composition: left-associative reduction#

For the fully general case where both AA and BB are multi-mode, we reduce it to the cases above by processing AA‘s modes left to right. Given A=(s0A,s1A,...):(d0A,d1A,...)A = (s_0^A, s_1^A, ...) : (d_0^A, d_1^A, ...), we compose each mode of AA one at a time (due to 1):

BA=(B(s0A:d0A)), (Brest(s1A:d1A)), ...B \circ A = (B \circ (s_0^A : d_0^A)),\ (B_\text{rest} \circ (s_1^A : d_1^A)),\ ...

Each single-mode composition consumes some of BB‘s leading modes (via the cases above), and the unconsumed remainder BrestB_\text{rest} carries forward for the next mode of AA.

Composition of a Layout with a smaller Tile layout gives us a single first Tile from the original layout, we need some operation that can also do this for the rest of the layout.

Complement#

Composition is always a subset of indices from the index space of AA, what about the leftovers? This is where the complement operation comes into the picture. Complement is done with respect to a size MM, i.e. given a layout AA the complement AMA^*_M, is the layout that fills up the rest of MM indices not covered by AA. Complement only makes sense if the gaps left in MM indices is shaped for filling up the space perfectly with a layout. For a layout A=(s0,s1,...,sn):(d0,d1,...,dn)A = (s_0, s_1, ..., s_n) : (d_0, d_1, ..., d_n), this means sididi+1s_i \cdot d_i | d_{i+1} and sndnMs_n \cdot d_n | M. This means that the inner strides should fit into outer strides. We can calculate the complement of AA as.

AM=(d0,d1s0d0,d2s1d1,...,Msndn):(1,s0d0,s1d1,...,sndn)A^*_M = (d_0, \frac{d_1}{s_0 \cdot d_0}, \frac{d_2}{s_1 \cdot d_1}, ..., \frac{M}{s_n \cdot d_n}) : (1, s_0 \cdot d_0, s_1 \cdot d_1, ..., s_n \cdot d_n)

Here is an example for complement. Note we can reduce the s0:d0=1:1s_0: d_0 = 1:1 from this example. The gray part is the original layout and colored indices

Visualization of complement of a layout (from NVIDIA)

Division#

Division operation splits a layout BB into equal-sized tiles defined by layout AA. As we have seen previously we can compose a layout and a tiler to get the first tile but to get all of them we need to also compose the complement with respect to a size MM to get the rest of them.

BA:=B(A,AM)B \oslash A := B \circ (A, A^*_M)

This splits each mode of BB into two: the first mode is within tile from the composition and the second mode is indexing across tile from the complement. For a mode of BB with shape sBs_B and stride dBd_B, and tiler shape tt, we get

  • Within-tile: shape tt, stride dBd_B
  • Across-tiles: shape sB/ts_B / t and stride tdBt \cdot d_B.

The original mode becomes a pair: (t,sB/t):(dB,tdB)(t, s_B / t): (d_B, t \cdot d_B). This operation doubles the Rank of the layout. Lets work through a few examples, first a simple 1D case. B=12:1B = 12:1 and A=4:1A = 4:1, AA should divide BB into 3 pieces. Hence;

Complement: A12=3:4A^*_{12} = 3:4 (the 3 tile offsets: 0,4,8{0, 4, 8}).

Concatenating: (A,A)=(4,3):(1,4)(A, A^*) = (4, 3):(1, 4).

Composing: B(4,3):(1,4)=(4,3):(1,4)B \circ (4,3):(1,4) = (4, 3):(1, 4).

The first mode (size 4, stride 1) = within-tile indices 0,1,2,3{0,1,2,3}. Second mode (size 3, stride 4) = tile offsets 0,4,8{0,4,8}. Three tiles of four elements each, covering all 12.

1D Example

Lets also work through a 2D example B=(4,6):(1,4)B = (4, 6):(1, 4) (4x6 column-major matrix), tiler A=(2,3)A = (2, 3) (2x3 tiles). Mode-by-mode:

  • Mode 0: shape 4, tiler 2 → (2,2):(1,2)(2, 2):(1, 2) — 2 rows in-tile, 2 tile-rows
  • Mode 1: shape 6, tiler 3 → (3,2):(4,12)(3, 2):(4, 12) — 3 cols in-tile, 2 tile-cols Logical divide result: ((2,2),(3,2)):((1,2),(4,12))((2, 2), (3, 2)) : ((1, 2), (4, 12))
2D Example

This is rank-4. Each original mode became a nested pair of (intra,inter)(\text{intra}, \text{inter}).

This layout is difficult to work with due to its nesting, we can reduce it to easier layouts

Zipped, Tiled and Flat Divides#

These are just convenience representations of the logical divide operation. Lets say a layout LL has shape (m,n,l,....)(m, n, l, ....) and a tiler TT with shape (tm,tn)(t_m, t_n). The convenience forms are as below

  • Logical: ((tm,rm),(tn,rn),...)((t_m, r_m), (t_n, r_n), ...)
  • Zipped: ((tm,tn),(rm,rn,l,...))((t_m, t_n), (r_m, r_n, l, ...))
  • Tiled: ((tm,tn),rm,rn,l,...)((t_m, t_n), r_m, r_n, l, ...)
  • Flat: (tm,tn,rm,rn,l,...)(t_m, t_n, r_m, r_n, l, ...)

Tiled Divide is especially interesting, it regroups the modes by role: all intra-tile modes together, all inter-tile modes together. The result is a clean two-level hierarchy — first group is “what’s inside a tile,” second group is “which tile.” This is the workhorse form that local_tile and partition_* use under the hood. It’s what you want when different parts of the kernel need to reason about tiles independently — the CTA picks its tile via the inter-tile mode, then threads work within it via the intra-tile mode.

Product#

Where division breaks a layout into tiles, product does the opposite — it replicates a layout to fill a larger space. Given a layout AA (the atom to replicate) and a layout BB (the replication pattern), the product is

AB:=(A,AB)A \otimes B := (A, A^* \circ B)

The result is a two-mode layout: mode 0 is the atom AA itself and mode 1 is ABA^* \circ B which describes how the copies of AA are arranged. The complement AA^* finds the gaps between elements of AA, and composing with BB maps the replication pattern into those gaps.

Lets work through a 1D example. Take A=4:1A = 4:1 (a contiguous atom of 4 elements) and B=3:1B = 3:1 (replicate 3 times).

First we need the complement of AA with respect to M=size(A)×size(B)=12M = \text{size}(A) \times \text{size}(B) = 12. We get A12=3:4A^*_{12} = 3:4 — the three offsets {0,4,8}\{0, 4, 8\} where copies begin.

AB=(3:4)(3:1)=3:4A^* \circ B = (3:4) \circ (3:1) = 3:4

So the product is AB=(4,3):(1,4)A \otimes B = (4, 3) : (1, 4). Notice anything? This is the same layout we got from dividing 12:112:1 by 4:14:1. That’s the duality — division splits a layout into tiles, product builds one up from tiles. Two sides of the same coin.

Where product really shines is building thread-value layouts for distributing work across GPU threads. Say we have 4 threads each handling 2 values. The thread layout T=4:1T = 4:1 and value layout V=2:1V = 2:1, the product TVT \otimes V tells us which elements each thread owns.

T8=2:4T^*_8 = 2:4, so TV=(2:4)(2:1)=2:4T^* \circ V = (2:4) \circ (2:1) = 2:4. The product is (4,2):(1,4)(4, 2) : (1, 4).

Thread 0 handles elements {0,4}\{0, 4\}, thread 1 handles {1,5}\{1, 5\} and so on — each thread’s values are spaced apart by the number of threads. This cyclic distribution is exactly what CuTe’s make_tiled_copy_tv builds internally from a thread layout and value layout.

Blocked and Raked Products#

For multi-dimensional layouts, the logical product works mode-by-mode just like division. Given atom AA with shape (am,an)(a_m, a_n) and replication pattern BB with shape (bm,bn)(b_m, b_n), the result has structure ((am,rm),(an,rn))((a_m, r_m), (a_n, r_n)) where rr denotes the replica modes. Blocked and raked products are two ways of reassociating these modes.

Blocked product groups like-modes with the atom inside: ((am,rm),(an,rn))((a_m, r_m), (a_n, r_n)). Each atom occupies a contiguous block and replicas tile these blocks across the space. Think of it as “each thread gets a contiguous rectangle of elements.”

Raked product reverses the nesting: ((rm,am),(rn,an))((r_m, a_m), (r_n, a_n)). The replicas interleave within the atom dimensions, creating a cyclic distribution. Thread 0 gets element 0, thread 1 gets element 1, wrapping around. This is the classic GPU pattern for coalesced memory access — adjacent threads access adjacent memory locations.

Zipped and Tiled Products#

Same convenience regrouping as with division. Given a layout LL with shape (m,n,l,...)(m, n, l, ...) and tiler TT with shape (tm,tn)(t_m, t_n):

  • Logical: ((m,tm),(n,tn),l,...)((m, t_m), (n, t_n), l, ...)
  • Zipped: ((m,n),(tm,tn,l,...))((m, n), (t_m, t_n, l, ...))
  • Tiled: ((m,n),tm,tn,l,...)((m, n), t_m, t_n, l, ...)
  • Flat: (m,n,tm,tn,l,...)(m, n, t_m, t_n, l, ...)

Putting it all together#

Now lets see how all these layout operations come together in a real tiled GEMM kernel using CuTe’s Python DSL. We compute C=A×BC = A \times B where AA is M×KM \times K, BB is K×NK \times N (stored as N×KN \times K, i.e. row-major transposed), and CC is M×NM \times N. The kernel tiles across all three dimensions with block tile sizes (bM,bN,bK)(b_M, b_N, b_K).

Setup and Tiled Copies#

First we define shared memory layouts and tiled copy operations for moving data from global memory (GMEM) to shared memory (SMEM):

# Shared memory layouts - column-major for vectorized MMA access
sA_layout = cute.make_layout((b_m, b_k))  # (b_m, b_k) : (1, b_m)
sB_layout = cute.make_layout((b_n, b_k))  # (b_n, b_k) : (1, b_n)

# Thread layout for copy A: threads distributed across k-major order
tA = cute.make_layout(
    (num_threads // b_k, b_k), stride=(b_k, 1)
)
copy_atom_A = cute.make_copy_atom(
    cute.nvgpu.cpasync.CopyG2SOp(), dtype,
    num_bits_per_copy=dtype.width,
)
tiled_copy_A = cute.make_tiled_copy_tv(
    copy_atom_A, thr_layout=tA, val_layout=cute.make_layout((1, 1))
)

# Thread layout for copy B: with vectorized loads along n-major
num_vectorized = 4  # elements per vectorized load
copy_atom_B = cute.make_copy_atom(
    cute.nvgpu.cpasync.CopyG2SOp(), dtype,
    num_bits_per_copy=dtype.width * num_vectorized,
)
major_mode_size = b_n // num_vectorized
tB = cute.make_layout(
    (major_mode_size, num_threads // major_mode_size),
    stride=(1, major_mode_size),
)
tiled_copy_B = cute.make_tiled_copy_tv(
    copy_atom_B, thr_layout=tB,
    val_layout=cute.make_layout((num_vectorized, 1))
)

The make_tiled_copy_tv function takes a copy atom (the hardware copy instruction), a thread layout (how threads are mapped to tile elements), and a value layout (how many elements each thread copies per invocation). This is where layout composition shines — CuTe composes these layouts to determine exactly which elements each thread is responsible for copying.

MMA Setup#

The tiled MMA is set up similarly. We distribute threads in a (T/16,16,1)(T/16, 16, 1) layout across the M, N, K modes:

atoms_layout_mnk = cute.make_layout(
    (num_threads // 16, 16, 1), stride=(16, 1, 0)
)
tiled_mma = cute.make_tiled_mma(
    cute.nvgpu.MmaUniversalOp(abacc_dtype=acc_dtype),
    atom_layout_mnk=atoms_layout_mnk,
)

The Kernel#

Inside the kernel, we first use local_tile to carve out each thread block’s portion of the global tensors. The proj argument selects which modes the block tiler applies to:

bidx, bidy, _ = cute.arch.block_idx()
cta_coords = (bidx, bidy, None)

# gA: (b_m, b_k, k_tiles), gB: (b_n, b_k, k_tiles), gC: (b_m, b_n)
gA = cute.local_tile(mA, block_tiler, cta_coords, proj=(1, None, 1))
gB = cute.local_tile(mB, block_tiler, cta_coords, proj=(None, 1, 1))
gC = cute.local_tile(mC, block_tiler, cta_coords, proj=(1, 1, None))

Then we partition the tiles across threads for both copying and computation:

tidx, _, _ = cute.arch.thread_idx()
thr_mma = tiled_mma.get_slice(tidx)

# Partition copy source (GMEM) and destination (SMEM) for each thread
thr_copy_A = tiled_copy_A.get_slice(tidx)
tAgA = thr_copy_A.partition_S(gA)  # (cpy, cpy_m, cpy_k, k_tiles)
tAsA = thr_copy_A.partition_D(sA)  # (cpy, cpy_m, cpy_k)

thr_copy_B = tiled_copy_B.get_slice(tidx)
tBgB = thr_copy_B.partition_S(gB)  # (cpy, cpy_n, cpy_k, k_tiles)
tBsB = thr_copy_B.partition_D(sB)  # (cpy, cpy_n, cpy_k)

# Partition SMEM and output for MMA
tCsA = thr_mma.partition_A(sA)
tCsB = thr_mma.partition_B(sB)
tCgC = thr_mma.partition_C(gC)
tCrC = tiled_mma.make_fragment_C(tCgC)
tCrC.fill(0.0)

Each partition_* call internally uses layout composition and division to split the tile into per-thread pieces. The cpy mode in the copy partitions captures both the vectorization width and the number of copy operations each thread performs.

The Main Loop#

The k-tile loop copies each (bM,bK)(b_M, b_K) and (bN,bK)(b_N, b_K) tile from GMEM to SMEM using async copies, then performs the MMA:

k_tiles = cute.size(tAgA, mode=[3])

for k in range(k_tiles):
    cute.copy(tiled_copy_A, tAgA[None, None, None, k], tAsA, pred=tApA)
    cute.copy(tiled_copy_B, tBgB[None, None, None, k], tBsB, pred=tBpB)

    cute.arch.cp_async_commit_group()
    cute.arch.cp_async_wait_group(0)
    cute.arch.sync_threads()

    cute.gemm(tiled_mma, tCrC, tCsA, tCsB, tCrC)
    cute.arch.sync_threads()

The pred arguments handle boundary conditions — when M, N, or K aren’t exact multiples of the tile sizes, predicate tensors mask out-of-bounds accesses. These predicates are themselves built using layout operations on identity tensors.

Epilogue#

Finally the accumulated results are written back to global memory, again with predication for bounds checking:

cC = cute.make_identity_tensor(gC.shape)
tCpC = thr_mma.partition_C(cC)
predC = cute.make_rmem_tensor(tCrC.layout, cutlass.Boolean)
residue_m = mC.shape[0] - b_m * bidx
residue_n = mC.shape[1] - b_n * bidy
for i in range(cute.size(tCrC.shape)):
    predC[i] = cute.elem_less(tCpC[i], (residue_m, residue_n))
atom = cute.make_copy_atom(cute.nvgpu.CopyUniversalOp(), mC.element_type)
cute.copy(atom, tCrC, tCgC, pred=predC)

The full kernel is available in the companion repo.

Notice how the layout algebra permeates the entire kernel — from defining how threads map to data (make_tiled_copy_tv), to carving tiles (local_tile), to splitting work across threads (partition_*), to bounds checking (make_identity_tensor + elem_less). Without these abstractions, we’d be manually computing thread-to-element mappings with error-prone index arithmetic. The layout algebra replaces all of that with composable, type-safe operations.

Closing thoughts#

CuTe’s layout algebra turns what would be pages of error-prone index arithmetic into a handful of composable operations — composition, complement, and division — that are easier to reason about. But I want to be honest about the tradeoff here.

GPU memory layouts are inherently complex. Swizzled shared memory, bank conflict avoidance, mixed-radix thread-to-data mappings — this complexity is intrinsic to the hardware. CuTe doesn’t eliminate it; it repackages it into a different formalism. You’re trading one kind of complexity (raw index math) for another (an algebra with its own rules, edge cases, and debugging challenges). As one Reddit commenter put it, CUTLASS is “a typical example of a library with so much abstractions that makes complicated things simple and simple things complicated.”

The learning curve is steep. You need to internalize layouts, composition, complement, division, cosize, cotarget, tiled copies, MMA atoms — a whole vocabulary before you can write your first kernel. For someone who already understands raw CUDA well, the value proposition is debatable: you’re investing significant effort to learn abstractions that, at the end of the day, generate the same PTX. The payoff comes when you need to support multiple GPU architectures, swap out MMA instructions, or restructure tiling strategies without rewriting everything — that’s where the composability genuinely shines. But for a one-off kernel on a single architecture, you might be better off with raw CUDA and a good understanding of your hardware.

If you want to dig deeper, the CuTe documentation is thorough, and the CUTLASS Python DSL makes it easier to experiment with these ideas interactively — the Python interface significantly reduces the compile-time pain that plagues the C++ templates.

Footnotes#

  1. (s0,s1,...):(d0,d1,...)(s_0, s_1, ...): (d_0, d_1, ...) and listing them elementwise (s0:d0),(s1:d1),...(s_0: d_0), (s_1: d_1), ... is the same. 2