Appendix A. Operation-by-device matrix

This appendix is the full per-operation, per-family status reference behind chapter 4. Read down a family column to see what compiles and runs on that chip, and read the status marks and note for the gate and the route.

Each row is one intermediate-language operation, grouped by operation class, with its status on each Mac engine family from the M1 through the M5. Chapter 4 summarizes this table; the cells here are the reference.

The status marks are fixed:

  • Native: the operation compiles and runs on that family on the direct engine path.
  • Family-gated: no path on the listed family, native from the family named in the note.
  • Bridge: reachable only through a decompose, software fallback, or compiler-internal route, never as a standalone code-generated operation.
  • No path: rejected on every family from the M1 through the M5, computed off-engine.

The family columns are M1 (H13, A13), M2 (H14, A14), M3 (H15, A15), and M4 and M5 (H16 and H17s, A16 and A17). The A11 and A12 engines are below the floor that runs any of this vocabulary and are out of scope for the table.

The M1, M2, and M5 columns are measured on physical silicon. The M3 column and the M4 part of the merged M4 and M5 column are decompile-derived predictions from the per-chip tables, so a per-cell status there is a predicted capability rather than a measured one.

The table covers the 187 intermediate-language operations the compiler exposes. Of these, about 108 are native on the M1: the full elementwise, compare, activation, convolution, pooling, structural, and quantization vocabulary, plus the reduction, normalization, softmax, square-root family, fused attention, tile, and space-channel set. Nine need the M2 or later: the texture-engine operations (crop-resize, resample, affine, hardware gather) and the rank and sort bridge (top-k, sort, dynamic slice). Four need the M3 or later: native sin and cos, the hardware random generator, and the whole-tensor argument reductions on the intermediate-language route. Thirty-seven are rejected on every family and decompose on the host. About twenty-four are compiler-internal: mapped but with no observed standalone code generation, reachable only inside a wrapping construct.

Per-chip numeric limits

The status of an operation is one axis; the numeric envelope it runs in is the other. Table A.1 gives that envelope across the five capability tiers, measured from the live compiler by calling every per-architecture parameter constructor on a single M1, and a dash marks an unsupported value. The older column is the pre-A13 legacy targets the compiler still has parameter tables for, below the floor of the operation-status table above.

LimitolderM1, A13A14A15A16, M5
max kernel W (default format, large)2929323232
max kernel W (fp16, large)1313161616
max kernel W (default, small)1515161616
max kernel W (fp16, small)77888
min kernel W (default / fp16, large)16 / 816 / 81 / 11 / 11 / 1
max kernel H (large / small)29 / 1529 / 1532 / 1632 / 1632 / 16
max kernel D (large / small)1 / 1, no 3D16 / 816 / 816 / 816 / 8
max patch W / H / D15 / 15 / 028 / 28 / 1531 / 31 / 1531 / 31 / 1531 / 31 / 15
max tensor W / H1638416384163841638465536
max tensor D116384163841638465536
max tensor C6553665536655366553665536
max tensor N (batch)409665536655366553665536
max transpose W / H0, always split16384163841638465536
reduction-to-transpose thresholdnone192192384384
group-conv decompose limit (Cin·kW·kH)642048204820482048
stride factor list[2,3,4,8][2,3,4,8][2,3,4,8][2,3,4,8][2,3,4,8]
matmul SRAM working set2 MB, M9 1 MB2 MB2 MB2 MB2 MB
DMA width granule16 B16 B16 B16 B16 B
patch-width floor / maxnone16 / 512 px16 / 51216 / 51216 / 512
instruction alignment256 B256 B16 B16 B16 B
has texture enginenonoyesyesyes
kernel-memory budget64 KB64 KB64 KB64 KB64 KB
activation-LUT budget150 B86 B86 B86 B86 B
context-switch live-tensor limit22

Table A.1. The per-chip numeric limits of the engine, measured from the live compiler across the five capability tiers.

The four generational dividing lines are visible in this table. The M1 adds the depth and three-dimensional axis and every reduction-class operation. The A14 adds the texture engine. The A15 raises the reduction-to-transpose threshold from 192 to 384 and adds native trigonometry. The A16 quadruples the maximum tensor and transpose dimensions from 16384 to 65536.

Convolution, matrix multiply, and pooling

Table A.2 lists the convolution, matrix-multiply, and pooling operations with their per-family status and the lowering note for each.

OperationM1 (A13)M2 (A14)M3 (A15)M4, M5 (A16, A17)Note
convNativeNativeNativeNativeM1 kernels up to 29x29, M5 up to 32x32; Winograd auto-selected for eligible 3x3 stride-1 convs
conv_transposeNativeNativeNativeNativeDeconvolution; strided axes use the small-kernel caps
linearNativeNativeNativeNativeFolds to convolution when the right operand fits the on-chip working set
linear_activationNativeNativeNativeNativeFused linear and activation
matmulNativeNativeNativeNativeEngine lane or convolution fold; same tensor caps as convolution
ne_matmulNativeNativeNativeNativePrivate engine-lane matrix-multiply unit
einsumNativeNativeNativeNativeLowers to a matmul and transpose chain
ne_convNativeNativeNativeNativePrivate engine-lane convolution unit
avg_poolNativeNativeNativeNativeWindow up to 29 on the M1, up to 31 from the M2
max_poolNativeNativeNativeNative
l2_poolNativeNativeNativeNativeLookup-table pool
ne_poolNativeNativeNativeNativePrivate engine-lane pooling unit
pe_poolNativeNativeNativeNativePrivate planar-engine pooling unit
pe_elementwiseNativeNativeNativeNativePrivate planar-engine elementwise unit
pe_gocBridgeBridgeBridgeBridgePrivate planar-engine gain-offset unit, compiler-internal
ne_bypassBridgeBridgeBridgeBridgePrivate engine-lane bypass unit, compiler-internal
scaled_dot_product_attentionNativeNativeNativeNativeRuns on the matmul and softmax path, not texture-gated

Table A.2. Convolution, matrix-multiply, and pooling operations by device family.

The ne_ and pe_ rows are private engine-lane and planar-engine unit selections of the same convolution, matrix-multiply, pooling, and elementwise atoms, not separate operations.

Normalization

Table A.3 gives the normalization operations, native on every family from the M1.

OperationM1 (A13)M2 (A14)M3 (A15)M4, M5 (A16, A17)Note
batch_normNativeNativeNativeNativeInference fold-to-affine; native statistics form from the M1
layer_normNativeNativeNativeNative
instance_normNativeNativeNativeNative
l2_normNativeNativeNativeNative
local_response_normNativeNativeNativeNativeMeasured on the M1

Table A.3. Normalization operations by device family.

Elementwise arithmetic

Table A.4 gives the elementwise arithmetic operations, where only mod takes no engine path.

OperationM1 (A13)M2 (A14)M3 (A15)M4, M5 (A16, A17)Note
absNativeNativeNativeNative
addNativeNativeNativeNativeConstant and tensor forms
subNativeNativeNativeNativeLowered to add of a negated constant
mulNativeNativeNativeNativeConstant and tensor forms
real_divNativeNativeNativeNativeGeneral divide
floor_divNativeNativeNativeNativeLookup-table assisted
powNativeNativeNativeNative
squareNativeNativeNativeNative
sqrtNativeNativeNativeNativeLookup-table activation
rsqrtNativeNativeNativeNativeLookup-table
inverseNativeNativeNativeNativeReciprocal lookup-table
maximumNativeNativeNativeNative
minimumNativeNativeNativeNative
modNo pathNo pathNo pathNo pathDecompose on host
cumsumNativeNativeNativeNativeNative through a curated runtime path, not the standard compile path; M1 measured

Table A.4. Elementwise arithmetic operations by device family.

Comparison and logical

Table A.5 gives the comparison and logical operations, the bitwise-logical ones decomposing on the host.

OperationM1 (A13)M2 (A14)M3 (A15)M4, M5 (A16, A17)Note
equalNativeNativeNativeNative
not_equalNativeNativeNativeNative
greaterNativeNativeNativeNative
greater_equalNativeNativeNativeNative
lessNativeNativeNativeNative
less_equalNativeNativeNativeNative
logical_notNativeNativeNativeNative
selectNativeNativeNativeNativeThe where operation
logical_andNo pathNo pathNo pathNo pathDecompose through minimum or multiply on host
logical_orNo pathNo pathNo pathNo pathDecompose through maximum on host
logical_xorNo pathNo pathNo pathNo pathDecompose through not-equal on host

Table A.5. Comparison and logical operations by device family.

Activations

Table A.6 gives the activation operations, native on every family and most lookup-table backed.

OperationM1 (A13)M2 (A14)M3 (A15)M4, M5 (A16, A17)Note
reluNativeNativeNativeNative
relu6NativeNativeNativeNativeLookup-table
leaky_reluNativeNativeNativeNativeLookup-table
preluNativeNativeNativeNativePer-channel slope; native at rank 3 or above
clamped_reluNativeNativeNativeNativeLookup-table
thresholded_reluNativeNativeNativeNativeLookup-table
thresholdNativeNativeNativeNativeLookup-table
clipNativeNativeNativeNativeThe clamp operation
eluNativeNativeNativeNativeLookup-table
sigmoidNativeNativeNativeNativeIncludes the hard variant
sigmoid_hardNativeNativeNativeNativeLookup-table
tanhNativeNativeNativeNativeLookup-table
scaled_tanhNativeNativeNativeNativeLookup-table
geluNativeNativeNativeNativeLookup-table approximation
siluNativeNativeNativeNativeAlso named swish; lookup-table
softmaxNativeNativeNativeNativeLookup-table
softplusNativeNativeNativeNativeLookup-table
softplus_parametricNativeNativeNativeNativeLookup-table
softsignNativeNativeNativeNativeLookup-table
erfNativeNativeNativeNativeLookup-table
expNativeNativeNativeNativeLookup-table
exp2NativeNativeNativeNativeLookup-table
logNativeNativeNativeNativeLookup-table
signNativeNativeNativeNativeLookup-table
ceilNativeNativeNativeNativeLookup-table
floorNativeNativeNativeNativeLookup-table
roundNativeNativeNativeNativeRound-to-nearest lookup-table

Table A.6. Activation operations by device family.

Reduction

Table A.7 gives the reduction operations, where reduce_argmin is gated and reduce_prod takes no path.

OperationM1 (A13)M2 (A14)M3 (A15)M4, M5 (A16, A17)Note
reduce_sumNativeNativeNativeNativeReduced axis at or above 192 takes the transpose route, at or above 384 from the M3
reduce_meanNativeNativeNativeNative
reduce_maxNativeNativeNativeNative
reduce_minNativeNativeNativeNative
reduce_sum_squareNativeNativeNativeNativeThe reduce-then-square fusion is M2 onward; the M1 emits an extra fp16 round
reduce_l1_normNativeNativeNativeNative
reduce_l2_normNativeNativeNativeNative
reduce_log_sumNativeNativeNativeNativeLookup-table assisted
reduce_log_sum_expNativeNativeNativeNativeLookup-table assisted
reduce_argmaxNativeNativeNativeNativePer-axis argmax on all families
reduce_argminBridgeBridgeNativeNativePer-axis argmin; the intermediate-language route is gated to the M3, the bridge route works on the M1 and M2
reduce_prodNo pathNo pathNo pathNo pathDecompose through log-sum-exp on host

Table A.7. Reduction operations by device family.

The whole-tensor argument reductions global_argmax and global_argmin follow the same gate as reduce_argmin: native on the intermediate-language route from the M3, reachable through the bridge on the M1.

Data movement and structural

Table A.8 gives the data-movement and structural operations, the largest class, spanning reshape, slice, gather, scatter, and the space-channel set.

OperationM1 (A13)M2 (A14)M3 (A15)M4, M5 (A16, A17)Note
reshapeNativeNativeNativeNativeMetadata edit
reshape_likeNativeNativeNativeNative
expand_dimsNativeNativeNativeNative
squeezeNativeNativeNativeNative
flatten2dNativeNativeNativeNative
transposeNativeNativeNativeNativeCapped by the maximum transpose extent, 16384 through the M3, 65536 on the M5
concatNativeNativeNativeNativeDMA
splitNativeNativeNativeNative
stackNativeNativeNativeNative
padNativeNativeNativeNativeConstant pad is native everywhere; symmetric and reflect pad are texture-gated, software on the M1 and native from the M2
slice_by_sizeNativeNativeNativeNativeM1 and M2 nonzero width-offset routes through a fixed-point crop-DMA that saturates a magnitude above 4094 to infinity; clean from the M3
slice_by_indexBridgeBridgeBridgeBridgeStatic-offset slice folds into the descriptor inside a graph
slice_updateNativeNativeNativeNative
reverseNativeNativeNativeNativeMeasured on the M1
reverse_sequenceNo pathNo pathNo pathNo pathDecompose on host
tileNativeNativeNativeNativeFactors of 2, 3, 4, and 8
gatherNativeNativeNativeNativeM1 software path valid only for a batch of one and a depth of one; the hardware path is M2 onward
gather_along_axisNativeNativeNativeNativeSame M1 envelope caveat
gather_ndBridgeNativeNativeNativeM1 software envelope only (batch one, depth one, three-element index channel); native texture path from the M2
scatterNo pathNo pathNo pathNo pathDecompose on host
scatter_along_axisNo pathNo pathNo pathNo pathDecompose on host
scatter_ndNo pathNo pathNo pathNo pathDecompose on host
depth_to_spaceNativeNativeNativeNativeThe pixel-shuffle operation
space_to_depthNativeNativeNativeNativeThe pixel-unshuffle operation
pixel_shuffleNativeNativeNativeNativeEngine-lane reorganization, factors of 2, 3, 4, and 8; z-factor must be 1
pixel_unshuffleNativeNativeNativeNativeEngine-lane reorganization; input dimension divisible by the factor
space_to_batchNativeNativeNativeNativeFactor in 2, 3, 4, 8; batch cap 4096 on older families, 65536 on the newer
batch_to_spaceNativeNativeNativeNativeInverse of the above
identityNativeNativeNativeNativeAliases a cast or no-op
fillNativeNativeNativeNativeConstant tensor producer
fill_likeNativeNativeNativeNativeConstant tensor producer
range_1dBridgeBridgeBridgeBridgeM1 code generation rejects it; host-precompute the constant
cropNativeNativeNativeNativeSlice and crop, distinct from the texture crop-resize
band_partNo pathNo pathNo pathNo pathMask on host
non_zeroNo pathNo pathNo pathNo pathData-dependent shape
one_hotNo pathNo pathNo pathNo pathDecompose through an identity gather on host
shapeNo pathNo pathNo pathNo pathStatic-shape graphs only
sliding_windowsNo pathNo pathNo pathNo pathDecompose on host

Table A.8. Data-movement and structural operations by device family.

Image, resize, and texture

Table A.9 gives the image, resize, and texture operations, gated to the texture engine from the A14 with software fallbacks on the M1.

OperationM1 (A13)M2 (A14)M3 (A15)M4, M5 (A16, A17)Note
resizeBridgeNativeNativeNativeTexture-gated; M1 takes a software transpose fallback with different rounding, native from the M2
resize_bilinearBridgeNativeNativeNativeSoftware fallback on the M1
resize_nearest_neighborBridgeNativeNativeNativeSoftware fallback on the M1
upsample_bilinearBridgeNativeNativeNativeSoftware fallback on the M1
upsample_nearest_neighborBridgeNativeNativeNativeSoftware fallback on the M1
crop_resizeFamily-gatedNativeNativeNativeTexture engine, M2 onward; no host substitution wired
resampleFamily-gatedNativeNativeNativeTexture engine, M2 onward
affineFamily-gatedNativeNativeNativeTexture engine, M2 onward
pixel_buffer_to_tensorBridgeBridgeBridgeBridgeFour-character-code image input; an entitlement gate, not a chip gate
tensor_to_pixel_bufferBridgeBridgeBridgeBridgeCompiler-internal
gammaBridgeBridgeBridgeBridgeImage-signal operation, compiler-internal
degammaBridgeBridgeBridgeBridgeImage-signal operation, compiler-internal

Table A.9. Image, resize, and texture operations by device family.

Quantization and dtype

Table A.10 gives the quantization and dtype operations, with the per-family streaming gates carried in the note column.

OperationM1 (A13)M2 (A14)M3 (A15)M4, M5 (A16, A17)Note
castNativeNativeNativeNativefp16 to fp32 and bool native on the M1; cast to int32 is rejected on the M1
quantizeNativeNativeNativeNativeNot texture-gated
dequantizeNativeNativeNativeNative
constBridgeBridgeBridgeBridgeFolded at compile, not a standalone code-generated operation
constexpr_affine_dequantizeBridgeBridgeBridgeBridgeint4 lookup-table streams from the M1; int8 and affine fold to fp16 below the M2, and stream from the A14 and M2
constexpr_lut_to_denseNativeNativeNativeNativePalette and lookup-table stream; int4 lookup-table streams natively from the M1
constexpr_lut_to_sparseBridgeBridgeBridgeBridgeFolded constant; sparse stream from the M3
constexpr_blockwise_shift_scaleBridgeBridgeNativeNativeBlockwise stream from the M3; folds to fp16 on the M1 and M2
constexpr_sparse_blockwise_shift_scaleBridgeBridgeNativeNativeSparse and blockwise stream from the M3
constexpr_sparse_to_denseNativeNativeNativeNativeSparse streams natively from the M1
constexpr_castNo pathNo pathNo pathNo pathRejected on every family

Table A.10. Quantization and dtype operations by device family.

Attention, control flow, and state

Table A.11 gives the attention, control-flow, and state operations, where the state pair is native and the control-flow operations are compiler-internal.

OperationM1 (A13)M2 (A14)M3 (A15)M4, M5 (A16, A17)Note
read_stateNativeNativeNativeNativeStateful; needs the inout tensor-descriptor plumbing for a key-value cache
write_stateNativeNativeNativeNativeStateful
tensor_buffer_to_tensorBridgeBridgeBridgeBridgeRing and streaming buffer mover, reachable inside a stateful graph
tensor_to_tensor_bufferBridgeBridgeBridgeBridgeCompiler-internal
circular_buffer_to_tensorBridgeBridgeBridgeBridgeRing-buffer reader
tensor_to_circular_bufferBridgeBridgeBridgeBridgeRing-buffer writer
condBridgeBridgeBridgeBridgeNo standalone code generation; flatten on host
while_loopBridgeBridgeBridgeBridgeNo standalone code generation; unroll on host
callBridgeBridgeBridgeBridgeInlined

Table A.11. Attention, control-flow, and state operations by device family.

Recurrent cells

Table A.12 gives the recurrent-cell operations, none of which take an engine path; each unrolls on the host.

OperationM1 (A13)M2 (A14)M3 (A15)M4, M5 (A16, A17)Note
gruNo pathNo pathNo pathNo pathUnroll to a convolution, matmul, and activation graph on host
lstmNo pathNo pathNo pathNo pathUnroll on host
rnnNo pathNo pathNo pathNo pathUnroll on host

Table A.12. Recurrent-cell operations by device family.

Trigonometric, special, and math

Table A.13 gives the trigonometric, special, and math operations, where sin and cos go native from the M3 and atan is the one M1-native primitive.

OperationM1 (A13)M2 (A14)M3 (A15)M4, M5 (A16, A17)Note
sinFamily-gatedFamily-gatedNativeNativeNative from the M3; the M1 and M2 use a host polynomial
cosFamily-gatedFamily-gatedNativeNativeNative from the M3; the M1 and M2 use a host polynomial
atanNativeNativeNativeNativeThe one trigonometric primitive native on the M1
tanNo pathNo pathNo pathNo pathDecompose through a sin and cos identity on host
asinNo pathNo pathNo pathNo pathHost decomposition
acosNo pathNo pathNo pathNo pathHost decomposition
atanhNo pathNo pathNo pathNo pathHost decomposition
asinhNo pathNo pathNo pathNo pathHost decomposition
acoshNo pathNo pathNo pathNo pathHost decomposition
sinhNo pathNo pathNo pathNo pathHost decomposition
coshNo pathNo pathNo pathNo pathHost decomposition
cross_productBridgeBridgeBridgeBridgeReachable through the bridge route, measured on the M1
cost_volumeBridgeBridgeBridgeBridgeReachable through the bridge route, measured on the M1
matrix_decompositionBridgeBridgeBridgeBridgeNo observed code generation

Table A.13. Trigonometric, special, and math operations by device family.

Detection and sampling

Table A.14 gives the detection and sampling operations, the rank and sort bridge gated to the M2 and the random and tensor-list operations off-engine.

OperationM1 (A13)M2 (A14)M3 (A15)M4, M5 (A16, A17)Note
non_maximum_suppressionBridgeBridgeBridgeBridgeReachable only with a CPU or GPU backend in the mask; the engine-only mask reports not supported on any backend, so it offloads to the CPU or GPU rather than the engine
topkFamily-gatedNativeNativeNativeRank and sort bridge, M2 onward; the validator is callable on the M1 but code generation rejects it
argsortFamily-gatedNativeNativeNativeSort family, M2 onward; code-generation-rejected on the M1
random_uniformBridgeBridgeNativeNativeHardware generator from the M3; host random below it
random_bernoulliNo pathNo pathNo pathNo pathHost random
random_categoricalNo pathNo pathNo pathNo pathHost random
random_normalNo pathNo pathNo pathNo pathHost random
list_gatherNo pathNo pathNo pathNo pathTensor-list operation
list_lengthNo pathNo pathNo pathNo pathTensor-list operation
list_readNo pathNo pathNo pathNo pathTensor-list operation
list_scatterNo pathNo pathNo pathNo pathTensor-list operation
list_writeNo pathNo pathNo pathNo pathTensor-list operation
make_listNo pathNo pathNo pathNo pathTensor-list operation

Table A.14. Detection and sampling operations by device family.