256.1

Signal Processing
Blog Labs About

Tensors are recursive cubes all the way down

WTF is a tensor?

Tensor sounds like abstract math. It is. But to understand AI concepts like LLMs, Transformers, and Stable Diffusion, you do not need a PhD in quantum neutrino fields. Tensors are just grids of numbers with named axes. The dimension count is how many indices you need to locate one value.

Axes Common name Example
0D Scalar 27
1D Vector [1, 2, 3]
2D Matrix [[1, 2], [3, 4]]

Read the table as geometry. A scalar is a point, a vector is a line, and a matrix is a plane. Add a third axis and you get a cube, which is a volume in our universe's 3D space. Add time and batch and you are in 4D and 5D. Add a few more dimensions and maybe you are studying superstring theory. These are still just coordinate lists, but our human spatial intuition starts to fade.

Note

Some libraries call this count a tensor's rank (side-eyes TensorFlow), which here means the same as dimensions: the number of axes. This is confusingly different from matrix rank.

See the structure

Use +1D/-1D to step through dimensions. 0D is a single number, 1D a line, 2D a grid, 3D a cube. At 4D the cube repeats in a line, at 5D in a grid, at 6D in a cube of cubes. Picture a warehouse of crates, where each crate holds a grid of boxes. A 6D index [i, j, k, x, y, z] works in two stages: [i, j, k] locates the crate in the warehouse, [x, y, z] locates a box inside it. Past 9D the inner detail compresses into colored frames so the recursion can keep going.

Tesseract projections and hypercube wireframes are the standard visualization for higher dimensions. They are mathematically accurate but rarely useful when you are learning tensor indexing. Space is 3D, and projecting more axes onto a flat screen loses structure. A different approach is to visualize indexing instead of space.

The cube-of-cubes model

The idea comes from Feiner and Beshers' 1990 paper on "Worlds within Worlds." Instead of adding new orthogonal axes, treat the 3D cube as a single unit and use higher dimensions to organize those units. Going from 3D to 6D is sort of like that giant alien playing marbles at the end of the first MIB movie.

Here is the pattern. Dimensions 0-3 build a cube: point, line, plane, volume. At dimension 3, freeze the cube. Now treat it as a single block. Dimensions 4-6 arrange those blocks: a line of cubes (4D), a grid of cubes (5D), a cube of cubes (6D). At dimension 6, freeze again. Dimensions 7-9 arrange those structures. The recursion continues every three dimensions.

Tensor dim Structure What you are building
0D Scalar A single number
1D Vector A line of scalars
2D Matrix A grid of scalars (rows × cols)
3D Cube A volume of scalars, freeze here
4D Line of Cubes Cubes arranged along one axis
5D Grid of Cubes Cubes arranged in a plane
6D Cube-of-Cubes Cubes arranged in a volume, freeze here
7D Line of Cube-of-Cubes 6D blocks arranged along one axis
8D Grid of Cube-of-Cubes 6D blocks arranged in a plane
9D Cube of Cube-of-Cubes 6D blocks arranged in a volume, freeze here

Practical slicing

This model aligns with how we actually slice data in PyTorch or NumPy. Slicing means fixing one or more indices to extract a sub-tensor. Consider a 5D tensor with shape (batch, time, channel, height, width). Common operations map cleanly:

Operation Slice Mental Model
Single video from batch tensor[0] Pick one 4D block from the batch
All channels at one pixel across time tensor[:, :, :, h, w] Drill through all frames to one spatial point
First frame of each video tensor[:, 0, :, :, :] Take the t=0 slice from each batch element

The hierarchical framing makes reshape errors easier to debug. The total element count is the product of all dimensions. If tensor.reshape(32, 8, -1) fails, you can reason about it as "I have 32 outer blocks, each containing 8 sub-blocks, and I want to flatten everything inside each sub-block." The numbers either multiply correctly or they do not.

Inference kernels: Tensors in the loop

An inference kernel is a small function that runs across many elements in parallel on a GPU. The mental model is not full 6D but tiles, like block matrix multiplication. A launch grid picks an outer block, and threads walk an inner block. The cube-of-cubes lens matches that: grid indices select the outer cube, thread indices select the inner voxel (a single 3D cell).

Striding is the step size between consecutive elements in memory. If the kernel is striding the wrong axis, it is pulling the wrong inner cube. If the block shape is mismatched, it is over or under covering the outer grid. The tensor becomes an address space to traverse deliberately.

Final thoughts

This "trick" (if we can even call it that) won't teach transformer inference. For that see Attention thrashing or Forgetting is a feature. It's a simple lookup story: each index is a step into a box inside a box. Three numbers pick the outer cube, three numbers pick the inner cube. Recurse on the inner cube if necessary. Hopefully it helps demystify tensors a bit and keeps the cubes spinning.

For better or for worse, we did not evolve to visualize hypercubes, but to stack containers.

References