Appendix A. Operation-by-device matrix

This appendix is the full per-operation, per-family status reference behind chapter 4. Read down a family column to see what compiles and runs on that chip, and read the status marks and note for the gate and the route.

Each row is one intermediate-language operation, grouped by operation class, with its status on each Mac engine family from the M1 through the M5. Chapter 4 summarizes this table; the cells here are the reference.

The status marks are fixed:

Native: the operation compiles and runs on that family on the direct engine path.
Family-gated: no path on the listed family, native from the family named in the note.
Bridge: reachable only through a decompose, software fallback, or compiler-internal route, never as a standalone code-generated operation.
No path: rejected on every family from the M1 through the M5, computed off-engine.

The family columns are M1 (H13, A13), M2 (H14, A14), M3 (H15, A15), and M4 and M5 (H16 and H17s, A16 and A17). The A11 and A12 engines are below the floor that runs any of this vocabulary and are out of scope for the table.

The M1, M2, and M5 columns are measured on physical silicon. The M3 column and the M4 part of the merged M4 and M5 column are decompile-derived predictions from the per-chip tables, so a per-cell status there is a predicted capability rather than a measured one.

The table covers the 187 intermediate-language operations the compiler exposes. Of these, about 108 are native on the M1: the full elementwise, compare, activation, convolution, pooling, structural, and quantization vocabulary, plus the reduction, normalization, softmax, square-root family, fused attention, tile, and space-channel set. Nine need the M2 or later: the texture-engine operations (crop-resize, resample, affine, hardware gather) and the rank and sort bridge (top-k, sort, dynamic slice). Four need the M3 or later: native sin and cos, the hardware random generator, and the whole-tensor argument reductions on the intermediate-language route. Thirty-seven are rejected on every family and decompose on the host. About twenty-four are compiler-internal: mapped but with no observed standalone code generation, reachable only inside a wrapping construct.

Per-chip numeric limits

The status of an operation is one axis; the numeric envelope it runs in is the other. Table A.1 gives that envelope across the five capability tiers, measured from the live compiler by calling every per-architecture parameter constructor on a single M1, and a dash marks an unsupported value. The older column is the pre-A13 legacy targets the compiler still has parameter tables for, below the floor of the operation-status table above.

Limit	older	M1, A13	A14	A15	A16, M5
max kernel W (default format, large)	29	29	32	32	32
max kernel W (fp16, large)	13	13	16	16	16
max kernel W (default, small)	15	15	16	16	16
max kernel W (fp16, small)	7	7	8	8	8
min kernel W (default / fp16, large)	16 / 8	16 / 8	1 / 1	1 / 1	1 / 1
max kernel H (large / small)	29 / 15	29 / 15	32 / 16	32 / 16	32 / 16
max kernel D (large / small)	1 / 1, no 3D	16 / 8	16 / 8	16 / 8	16 / 8
max patch W / H / D	15 / 15 / 0	28 / 28 / 15	31 / 31 / 15	31 / 31 / 15	31 / 31 / 15
max tensor W / H	16384	16384	16384	16384	65536
max tensor D	1	16384	16384	16384	65536
max tensor C	65536	65536	65536	65536	65536
max tensor N (batch)	4096	65536	65536	65536	65536
max transpose W / H	0, always split	16384	16384	16384	65536
reduction-to-transpose threshold	none	192	192	384	384
group-conv decompose limit (Cin·kW·kH)	64	2048	2048	2048	2048
stride factor list	[2,3,4,8]	[2,3,4,8]	[2,3,4,8]	[2,3,4,8]	[2,3,4,8]
matmul SRAM working set	2 MB, M9 1 MB	2 MB	2 MB	2 MB	2 MB
DMA width granule	16 B	16 B	16 B	16 B	16 B
patch-width floor / max	none	16 / 512 px	16 / 512	16 / 512	16 / 512
instruction alignment	256 B	256 B	16 B	16 B	16 B
has texture engine	no	no	yes	yes	yes
kernel-memory budget	64 KB	64 KB	64 KB	64 KB	64 KB
activation-LUT budget	150 B	86 B	86 B	86 B	86 B
context-switch live-tensor limit	2	2	∞	∞	∞

Table A.1. The per-chip numeric limits of the engine, measured from the live compiler across the five capability tiers.

The four generational dividing lines are visible in this table. The M1 adds the depth and three-dimensional axis and every reduction-class operation. The A14 adds the texture engine. The A15 raises the reduction-to-transpose threshold from 192 to 384 and adds native trigonometry. The A16 quadruples the maximum tensor and transpose dimensions from 16384 to 65536.

Convolution, matrix multiply, and pooling

Table A.2 lists the convolution, matrix-multiply, and pooling operations with their per-family status and the lowering note for each.

Operation	M1 (A13)	M2 (A14)	M3 (A15)	M4, M5 (A16, A17)	Note
`conv`	Native	Native	Native	Native	M1 kernels up to 29x29, M5 up to 32x32; Winograd auto-selected for eligible 3x3 stride-1 convs
`conv_transpose`	Native	Native	Native	Native	Deconvolution; strided axes use the small-kernel caps
`linear`	Native	Native	Native	Native	Folds to convolution when the right operand fits the on-chip working set
`linear_activation`	Native	Native	Native	Native	Fused linear and activation
`matmul`	Native	Native	Native	Native	Engine lane or convolution fold; same tensor caps as convolution
`ne_matmul`	Native	Native	Native	Native	Private engine-lane matrix-multiply unit
`einsum`	Native	Native	Native	Native	Lowers to a matmul and transpose chain
`ne_conv`	Native	Native	Native	Native	Private engine-lane convolution unit
`avg_pool`	Native	Native	Native	Native	Window up to 29 on the M1, up to 31 from the M2
`max_pool`	Native	Native	Native	Native
`l2_pool`	Native	Native	Native	Native	Lookup-table pool
`ne_pool`	Native	Native	Native	Native	Private engine-lane pooling unit
`pe_pool`	Native	Native	Native	Native	Private planar-engine pooling unit
`pe_elementwise`	Native	Native	Native	Native	Private planar-engine elementwise unit
`pe_goc`	Bridge	Bridge	Bridge	Bridge	Private planar-engine gain-offset unit, compiler-internal
`ne_bypass`	Bridge	Bridge	Bridge	Bridge	Private engine-lane bypass unit, compiler-internal
`scaled_dot_product_attention`	Native	Native	Native	Native	Runs on the matmul and softmax path, not texture-gated

Table A.2. Convolution, matrix-multiply, and pooling operations by device family.

The ne_ and pe_ rows are private engine-lane and planar-engine unit selections of the same convolution, matrix-multiply, pooling, and elementwise atoms, not separate operations.

Normalization

Table A.3 gives the normalization operations, native on every family from the M1.

Operation	M1 (A13)	M2 (A14)	M3 (A15)	M4, M5 (A16, A17)	Note
`batch_norm`	Native	Native	Native	Native	Inference fold-to-affine; native statistics form from the M1
`layer_norm`	Native	Native	Native	Native
`instance_norm`	Native	Native	Native	Native
`l2_norm`	Native	Native	Native	Native
`local_response_norm`	Native	Native	Native	Native	Measured on the M1

Table A.3. Normalization operations by device family.

Elementwise arithmetic

Table A.4 gives the elementwise arithmetic operations, where only mod takes no engine path.

Operation	M1 (A13)	M2 (A14)	M3 (A15)	M4, M5 (A16, A17)	Note
`abs`	Native	Native	Native	Native
`add`	Native	Native	Native	Native	Constant and tensor forms
`sub`	Native	Native	Native	Native	Lowered to add of a negated constant
`mul`	Native	Native	Native	Native	Constant and tensor forms
`real_div`	Native	Native	Native	Native	General divide
`floor_div`	Native	Native	Native	Native	Lookup-table assisted
`pow`	Native	Native	Native	Native
`square`	Native	Native	Native	Native
`sqrt`	Native	Native	Native	Native	Lookup-table activation
`rsqrt`	Native	Native	Native	Native	Lookup-table
`inverse`	Native	Native	Native	Native	Reciprocal lookup-table
`maximum`	Native	Native	Native	Native
`minimum`	Native	Native	Native	Native
`mod`	No path	No path	No path	No path	Decompose on host
`cumsum`	Native	Native	Native	Native	Native through a curated runtime path, not the standard compile path; M1 measured

Table A.4. Elementwise arithmetic operations by device family.

Comparison and logical

Table A.5 gives the comparison and logical operations, the bitwise-logical ones decomposing on the host.

Operation	M1 (A13)	M2 (A14)	M3 (A15)	M4, M5 (A16, A17)	Note
`equal`	Native	Native	Native	Native
`not_equal`	Native	Native	Native	Native
`greater`	Native	Native	Native	Native
`greater_equal`	Native	Native	Native	Native
`less`	Native	Native	Native	Native
`less_equal`	Native	Native	Native	Native
`logical_not`	Native	Native	Native	Native
`select`	Native	Native	Native	Native	The where operation
`logical_and`	No path	No path	No path	No path	Decompose through minimum or multiply on host
`logical_or`	No path	No path	No path	No path	Decompose through maximum on host
`logical_xor`	No path	No path	No path	No path	Decompose through not-equal on host

Table A.5. Comparison and logical operations by device family.

Activations

Table A.6 gives the activation operations, native on every family and most lookup-table backed.

Operation	M1 (A13)	M2 (A14)	M3 (A15)	M4, M5 (A16, A17)	Note
`relu`	Native	Native	Native	Native
`relu6`	Native	Native	Native	Native	Lookup-table
`leaky_relu`	Native	Native	Native	Native	Lookup-table
`prelu`	Native	Native	Native	Native	Per-channel slope; native at rank 3 or above
`clamped_relu`	Native	Native	Native	Native	Lookup-table
`thresholded_relu`	Native	Native	Native	Native	Lookup-table
`threshold`	Native	Native	Native	Native	Lookup-table
`clip`	Native	Native	Native	Native	The clamp operation
`elu`	Native	Native	Native	Native	Lookup-table
`sigmoid`	Native	Native	Native	Native	Includes the hard variant
`sigmoid_hard`	Native	Native	Native	Native	Lookup-table
`tanh`	Native	Native	Native	Native	Lookup-table
`scaled_tanh`	Native	Native	Native	Native	Lookup-table
`gelu`	Native	Native	Native	Native	Lookup-table approximation
`silu`	Native	Native	Native	Native	Also named swish; lookup-table
`softmax`	Native	Native	Native	Native	Lookup-table
`softplus`	Native	Native	Native	Native	Lookup-table
`softplus_parametric`	Native	Native	Native	Native	Lookup-table
`softsign`	Native	Native	Native	Native	Lookup-table
`erf`	Native	Native	Native	Native	Lookup-table
`exp`	Native	Native	Native	Native	Lookup-table
`exp2`	Native	Native	Native	Native	Lookup-table
`log`	Native	Native	Native	Native	Lookup-table
`sign`	Native	Native	Native	Native	Lookup-table
`ceil`	Native	Native	Native	Native	Lookup-table
`floor`	Native	Native	Native	Native	Lookup-table
`round`	Native	Native	Native	Native	Round-to-nearest lookup-table

Table A.6. Activation operations by device family.

Reduction

Table A.7 gives the reduction operations, where reduce_argmin is gated and reduce_prod takes no path.

Operation	M1 (A13)	M2 (A14)	M3 (A15)	M4, M5 (A16, A17)	Note
`reduce_sum`	Native	Native	Native	Native	Reduced axis at or above 192 takes the transpose route, at or above 384 from the M3
`reduce_mean`	Native	Native	Native	Native
`reduce_max`	Native	Native	Native	Native
`reduce_min`	Native	Native	Native	Native
`reduce_sum_square`	Native	Native	Native	Native	The reduce-then-square fusion is M2 onward; the M1 emits an extra fp16 round
`reduce_l1_norm`	Native	Native	Native	Native
`reduce_l2_norm`	Native	Native	Native	Native
`reduce_log_sum`	Native	Native	Native	Native	Lookup-table assisted
`reduce_log_sum_exp`	Native	Native	Native	Native	Lookup-table assisted
`reduce_argmax`	Native	Native	Native	Native	Per-axis argmax on all families
`reduce_argmin`	Bridge	Bridge	Native	Native	Per-axis argmin; the intermediate-language route is gated to the M3, the bridge route works on the M1 and M2
`reduce_prod`	No path	No path	No path	No path	Decompose through log-sum-exp on host

Table A.7. Reduction operations by device family.

The whole-tensor argument reductions global_argmax and global_argmin follow the same gate as reduce_argmin: native on the intermediate-language route from the M3, reachable through the bridge on the M1.

Data movement and structural

Table A.8 gives the data-movement and structural operations, the largest class, spanning reshape, slice, gather, scatter, and the space-channel set.

Operation	M1 (A13)	M2 (A14)	M3 (A15)	M4, M5 (A16, A17)	Note
`reshape`	Native	Native	Native	Native	Metadata edit
`reshape_like`	Native	Native	Native	Native
`expand_dims`	Native	Native	Native	Native
`squeeze`	Native	Native	Native	Native
`flatten2d`	Native	Native	Native	Native
`transpose`	Native	Native	Native	Native	Capped by the maximum transpose extent, 16384 through the M3, 65536 on the M5
`concat`	Native	Native	Native	Native	DMA
`split`	Native	Native	Native	Native
`stack`	Native	Native	Native	Native
`pad`	Native	Native	Native	Native	Constant pad is native everywhere; symmetric and reflect pad are texture-gated, software on the M1 and native from the M2
`slice_by_size`	Native	Native	Native	Native	M1 and M2 nonzero width-offset routes through a fixed-point crop-DMA that saturates a magnitude above 4094 to infinity; clean from the M3
`slice_by_index`	Bridge	Bridge	Bridge	Bridge	Static-offset slice folds into the descriptor inside a graph
`slice_update`	Native	Native	Native	Native
`reverse`	Native	Native	Native	Native	Measured on the M1
`reverse_sequence`	No path	No path	No path	No path	Decompose on host
`tile`	Native	Native	Native	Native	Factors of 2, 3, 4, and 8
`gather`	Native	Native	Native	Native	M1 software path valid only for a batch of one and a depth of one; the hardware path is M2 onward
`gather_along_axis`	Native	Native	Native	Native	Same M1 envelope caveat
`gather_nd`	Bridge	Native	Native	Native	M1 software envelope only (batch one, depth one, three-element index channel); native texture path from the M2
`scatter`	No path	No path	No path	No path	Decompose on host
`scatter_along_axis`	No path	No path	No path	No path	Decompose on host
`scatter_nd`	No path	No path	No path	No path	Decompose on host
`depth_to_space`	Native	Native	Native	Native	The pixel-shuffle operation
`space_to_depth`	Native	Native	Native	Native	The pixel-unshuffle operation
`pixel_shuffle`	Native	Native	Native	Native	Engine-lane reorganization, factors of 2, 3, 4, and 8; z-factor must be 1
`pixel_unshuffle`	Native	Native	Native	Native	Engine-lane reorganization; input dimension divisible by the factor
`space_to_batch`	Native	Native	Native	Native	Factor in 2, 3, 4, 8; batch cap 4096 on older families, 65536 on the newer
`batch_to_space`	Native	Native	Native	Native	Inverse of the above
`identity`	Native	Native	Native	Native	Aliases a cast or no-op
`fill`	Native	Native	Native	Native	Constant tensor producer
`fill_like`	Native	Native	Native	Native	Constant tensor producer
`range_1d`	Bridge	Bridge	Bridge	Bridge	M1 code generation rejects it; host-precompute the constant
`crop`	Native	Native	Native	Native	Slice and crop, distinct from the texture crop-resize
`band_part`	No path	No path	No path	No path	Mask on host
`non_zero`	No path	No path	No path	No path	Data-dependent shape
`one_hot`	No path	No path	No path	No path	Decompose through an identity gather on host
`shape`	No path	No path	No path	No path	Static-shape graphs only
`sliding_windows`	No path	No path	No path	No path	Decompose on host

Table A.8. Data-movement and structural operations by device family.

Image, resize, and texture

Table A.9 gives the image, resize, and texture operations, gated to the texture engine from the A14 with software fallbacks on the M1.

Operation	M1 (A13)	M2 (A14)	M3 (A15)	M4, M5 (A16, A17)	Note
`resize`	Bridge	Native	Native	Native	Texture-gated; M1 takes a software transpose fallback with different rounding, native from the M2
`resize_bilinear`	Bridge	Native	Native	Native	Software fallback on the M1
`resize_nearest_neighbor`	Bridge	Native	Native	Native	Software fallback on the M1
`upsample_bilinear`	Bridge	Native	Native	Native	Software fallback on the M1
`upsample_nearest_neighbor`	Bridge	Native	Native	Native	Software fallback on the M1
`crop_resize`	Family-gated	Native	Native	Native	Texture engine, M2 onward; no host substitution wired
`resample`	Family-gated	Native	Native	Native	Texture engine, M2 onward
`affine`	Family-gated	Native	Native	Native	Texture engine, M2 onward
`pixel_buffer_to_tensor`	Bridge	Bridge	Bridge	Bridge	Four-character-code image input; an entitlement gate, not a chip gate
`tensor_to_pixel_buffer`	Bridge	Bridge	Bridge	Bridge	Compiler-internal
`gamma`	Bridge	Bridge	Bridge	Bridge	Image-signal operation, compiler-internal
`degamma`	Bridge	Bridge	Bridge	Bridge	Image-signal operation, compiler-internal

Table A.9. Image, resize, and texture operations by device family.

Quantization and dtype

Table A.10 gives the quantization and dtype operations, with the per-family streaming gates carried in the note column.

Operation	M1 (A13)	M2 (A14)	M3 (A15)	M4, M5 (A16, A17)	Note
`cast`	Native	Native	Native	Native	fp16 to fp32 and bool native on the M1; cast to int32 is rejected on the M1
`quantize`	Native	Native	Native	Native	Not texture-gated
`dequantize`	Native	Native	Native	Native
`const`	Bridge	Bridge	Bridge	Bridge	Folded at compile, not a standalone code-generated operation
`constexpr_affine_dequantize`	Bridge	Bridge	Bridge	Bridge	int4 lookup-table streams from the M1; int8 and affine fold to fp16 below the M2, and stream from the A14 and M2
`constexpr_lut_to_dense`	Native	Native	Native	Native	Palette and lookup-table stream; int4 lookup-table streams natively from the M1
`constexpr_lut_to_sparse`	Bridge	Bridge	Bridge	Bridge	Folded constant; sparse stream from the M3
`constexpr_blockwise_shift_scale`	Bridge	Bridge	Native	Native	Blockwise stream from the M3; folds to fp16 on the M1 and M2
`constexpr_sparse_blockwise_shift_scale`	Bridge	Bridge	Native	Native	Sparse and blockwise stream from the M3
`constexpr_sparse_to_dense`	Native	Native	Native	Native	Sparse streams natively from the M1
`constexpr_cast`	No path	No path	No path	No path	Rejected on every family

Table A.10. Quantization and dtype operations by device family.

Attention, control flow, and state

Table A.11 gives the attention, control-flow, and state operations, where the state pair is native and the control-flow operations are compiler-internal.

Operation	M1 (A13)	M2 (A14)	M3 (A15)	M4, M5 (A16, A17)	Note
`read_state`	Native	Native	Native	Native	Stateful; needs the inout tensor-descriptor plumbing for a key-value cache
`write_state`	Native	Native	Native	Native	Stateful
`tensor_buffer_to_tensor`	Bridge	Bridge	Bridge	Bridge	Ring and streaming buffer mover, reachable inside a stateful graph
`tensor_to_tensor_buffer`	Bridge	Bridge	Bridge	Bridge	Compiler-internal
`circular_buffer_to_tensor`	Bridge	Bridge	Bridge	Bridge	Ring-buffer reader
`tensor_to_circular_buffer`	Bridge	Bridge	Bridge	Bridge	Ring-buffer writer
`cond`	Bridge	Bridge	Bridge	Bridge	No standalone code generation; flatten on host
`while_loop`	Bridge	Bridge	Bridge	Bridge	No standalone code generation; unroll on host
`call`	Bridge	Bridge	Bridge	Bridge	Inlined

Table A.11. Attention, control-flow, and state operations by device family.

Recurrent cells

Table A.12 gives the recurrent-cell operations, none of which take an engine path; each unrolls on the host.

Operation	M1 (A13)	M2 (A14)	M3 (A15)	M4, M5 (A16, A17)	Note
`gru`	No path	No path	No path	No path	Unroll to a convolution, matmul, and activation graph on host
`lstm`	No path	No path	No path	No path	Unroll on host
`rnn`	No path	No path	No path	No path	Unroll on host

Table A.12. Recurrent-cell operations by device family.

Trigonometric, special, and math

Table A.13 gives the trigonometric, special, and math operations, where sin and cos go native from the M3 and atan is the one M1-native primitive.

Operation	M1 (A13)	M2 (A14)	M3 (A15)	M4, M5 (A16, A17)	Note
`sin`	Family-gated	Family-gated	Native	Native	Native from the M3; the M1 and M2 use a host polynomial
`cos`	Family-gated	Family-gated	Native	Native	Native from the M3; the M1 and M2 use a host polynomial
`atan`	Native	Native	Native	Native	The one trigonometric primitive native on the M1
`tan`	No path	No path	No path	No path	Decompose through a sin and cos identity on host
`asin`	No path	No path	No path	No path	Host decomposition
`acos`	No path	No path	No path	No path	Host decomposition
`atanh`	No path	No path	No path	No path	Host decomposition
`asinh`	No path	No path	No path	No path	Host decomposition
`acosh`	No path	No path	No path	No path	Host decomposition
`sinh`	No path	No path	No path	No path	Host decomposition
`cosh`	No path	No path	No path	No path	Host decomposition
`cross_product`	Bridge	Bridge	Bridge	Bridge	Reachable through the bridge route, measured on the M1
`cost_volume`	Bridge	Bridge	Bridge	Bridge	Reachable through the bridge route, measured on the M1
`matrix_decomposition`	Bridge	Bridge	Bridge	Bridge	No observed code generation

Table A.13. Trigonometric, special, and math operations by device family.

Detection and sampling

Table A.14 gives the detection and sampling operations, the rank and sort bridge gated to the M2 and the random and tensor-list operations off-engine.

Operation	M1 (A13)	M2 (A14)	M3 (A15)	M4, M5 (A16, A17)	Note
`non_maximum_suppression`	Bridge	Bridge	Bridge	Bridge	Reachable only with a CPU or GPU backend in the mask; the engine-only mask reports not supported on any backend, so it offloads to the CPU or GPU rather than the engine
`topk`	Family-gated	Native	Native	Native	Rank and sort bridge, M2 onward; the validator is callable on the M1 but code generation rejects it
`argsort`	Family-gated	Native	Native	Native	Sort family, M2 onward; code-generation-rejected on the M1
`random_uniform`	Bridge	Bridge	Native	Native	Hardware generator from the M3; host random below it
`random_bernoulli`	No path	No path	No path	No path	Host random
`random_categorical`	No path	No path	No path	No path	Host random
`random_normal`	No path	No path	No path	No path	Host random
`list_gather`	No path	No path	No path	No path	Tensor-list operation
`list_length`	No path	No path	No path	No path	Tensor-list operation
`list_read`	No path	No path	No path	No path	Tensor-list operation
`list_scatter`	No path	No path	No path	No path	Tensor-list operation
`list_write`	No path	No path	No path	No path	Tensor-list operation
`make_list`	No path	No path	No path	No path	Tensor-list operation

Table A.14. Detection and sampling operations by device family.

Apple Neural Engine: A Complete Guide