12. Across the chip family

An M(n) chip has the H(n+12) ANE architecture, so M1 is H13 and M5 is H17. A network that compiles and runs on one generation compiles and runs on the others, because one compiler binary builds every target and only a per-target data table changes. The single property that varies from chip to chip is fp16 numerics, and it varies only at a unit in the last place. Target the generation a network needs by its operation set, then verify per chip only the cancellation-sensitive reductions and the width-axis slices.

Naming rule

The M-series engine and the contemporaneous A-series engine are the same architecture under two product names, offset by a fixed amount. An M(n) chip has the H(n+12) ANE architecture, the compact relation $M (n) \to H (n + 12)$ . M1 is H13, M2 is H14, M3 is H15, M4 is H16, and M5 is H17. The A-series anchor is one generation over: the A13 and the M1 share H13, and the A17 and the M5 share H17.

Table 12.1 is the family map of the core count and clock that scale across the generations, with the un-measured upper generations decompile-derived from the device tables.

Chip	A-series anchor	ANE architecture	NE cores	Clock
M1	A13	H13	4 (base), 8 (Pro and Max)	~1.14 GHz
M2	A14	H14g	4 (base), 8 (Pro and Max), 32 (Max-class)	measured
M3	A15	H15	4 (base), predicted	predicted
M4	A16	H16	4 (base), predicted	predicted
M5	A17	H17s	16	~1.89 GHz

Table 12.1. The M-series chips with their A-series anchor, engine architecture string, core count, and clock.

Three rows are measured on physical silicon: M1 (H13), M2 (a Pro reporting H14g), and M5 (H17s); the M3 and M4 rows are decompile-derived from the per-family device tables and not individually measured, with the A15/M3 generation the one rail that remains unmeasured. The M2 measurement closes the middle of the sequence: a seeded classifier trains to the M1 number to the digit, the four fp16 axes match the M1 bit for bit, and a watt-complete device map reproduces the M1 and M5 shape. The A14 is thus the A13's numerical twin on every axis measured. The fuller silicon-to-target table, the board-type sequence, and the full set of 28 compiler targets are the subject of a later chapter [AppleANE].

The rule reads directly off the device tables and is confirmed on the measured parts. The live M1 reports the architecture string h13g, the Pro and Max variant of H13, and the M5 compiles to the H17 target and resolves the H17s variant on disk. The system frameworks corroborate the sequence: the on-device video upscaler includes exactly the five targets H13 through H17, which are the five Mac generations M1 through M5. The target a network needs can thus be named by generation, and the mapping holds without probing the silicon.

What holds across the family

A network that compiles and runs on one generation compiles and runs on the others. The compiler is a single binary that constructs any target on demand, so the program format, operation legality, and datapath are shared, and only a per-target data table changes underneath them.

The operation limits are properties of the family, not of one chip. The operations with no hardware path on any current part, such as the product reduction, scatter family, and recurrent cells, are absent on every generation from the M1 through the M5. Gated operations arrive at a known generation and stay: the texture-engine sampler operations turn on at the A14, and native sin and cos turn on at the A15. A network that avoids them runs everywhere, and a network that uses them runs from its unlock generation forward. No operation the M1 can compute is missing on the M5, because each newer engine is an operation superset of the one before it.

Each generation adds one capability over the one before it, then stops, and Table 12.2 names the single capability each engine generation adds, read off the per-target capability bytes.

Generation	Capability added
A13 (M1)	three-dimensional convolution, the sixteen-deep kernel, native softmax, layer norm, all reductions, and fused attention
A14 (M2)	the texture-engine samplers (resize, crop-resize, resample, affine, hardware gather) and cross-die addressing
A15 (M3)	native sin and cos, the dropout and random path, and global argument-min and argument-max
A16 (M4)	the tensor dimension limit rises from 16384 to 65536, and the fp16 kernel-width ceiling rises from 13 to 15
A17 (M5)	no operation over A16; the NE-core count scales

Table 12.2. The single capability each engine generation adds, read off the per-target capability bytes.

The A17 and A18 add no operation over the A16: identical dimension limits, the same texture engine, the same legal operation set, differing only in NE-core count, which scales throughput rather than legality. The dimension limit is not a single number per chip: on the M2 the spatial and contraction extents cap at 16384 while the channel axis caps at 65536, exactly four times the spatial cap. The limit thus belongs to the axis an operation uses rather than to the tensor.

The newer parts scale the core count and the clock; they do not change the programming model. The core count runs 4 on the M1, 8 on its Pro and Max variant, and 16 on the M5, and the operating clock rises from roughly 1.14 GHz to roughly 1.89 GHz across that span. The fp16 datapath, the wide accumulator, and the form of the roofline hold across the family unchanged. The measured M5 confirms the scaling: about 19.6 fp16 TFLOP/s on the matmul slope and about 14.3 fp16 TFLOP/s on the convolution peak, both from the fused-chain probe on the same network with no source change, against the roofline saturation peak of 18.8 TFLOP/s in Chapter 9. A single large matmul above the dispatch floor runs at about 9.5 fp16 TFLOP/s, the M5 analogue of the M1's 4.8, and the engine streams weights at about 145 GB/s over two DRAM read channels, near three times the M1's 51 GB/s. Those peaks are set by the larger core count and the higher clock, not by any change to the datapath, which is the metric on which the generations compare. The working-set threshold moves with the silicon, from near 2 MB on the M1 to a measured 4.72 MB on the M5, scaling with the larger 16-core on-chip memory.

One thing that varies

The single property that differs from chip to chip is fp16 numerics. The accumulator width is uniform across every engine and the compiler text is identical, so a cross-chip value difference can only come from a data-selected codegen route that changes the order in which fp16 operations combine. That surface is limited: most route changes are a numerical no-op, because the wide accumulator absorbs the reordering, and the rest are at a unit in the last place, set by tiling-boundary alignment.

The cross-generation measurement fixes the scale. The same seeded convolutional classifier trained on the M1, M2, and M5 reaches 0.9080, 0.9080, and 0.9070 test accuracy, each deterministic across repeated runs, a difference of one test sample in a thousand between the ends. The M2 is exactly on the M1 number, which puts the entire fp16 training drift at the A16 generation rather than spread across the family: the M1 and the M2 are numerical twins, and the gap opens only at the M5. That gap is the drift of sub-unit-in-the-last-place fp16 differences compounding over a few hundred training steps, real and negligible. The cross-silicon predictions extracted from the device tables were confirmed on the M5: all ten, covering throughput, the working-set threshold, operation limits, texture engine, and fp16 slice behavior, held on the real part. The one finite-to-infinity axis, a slice saturation that occurs on the M1, takes the non-saturating route on the M5, as predicted.

fp16 divergence axes

That data-selected codegen surface reduces to four axes. Three of them are at most a unit in the last place, and one is a finite-to-infinity saturation that occurs on the older parts. Table 12.3 names the four axes, the codegen route each selects, and the bounded magnitude of each.

Axis	Mechanism	Effect	Magnitude
Slice saturation	A width-axis slice with a nonzero offset routes through a fixed-point crop that multiplies by sixteen	A source value above 4094 saturates to plus or minus infinity, since 4094 times sixteen is the 65504 fp16 ceiling	finite to infinity
Reduction then square fusion	A reduction immediately followed by a square or multiply can fuse, removing one intermediate rounding step	Drops one fp16 rounding step	at most one unit in the last place
Reduction route	A reduction selects a transpose route or a reshape route by an extent threshold, 192 on the older parts and 384 from the M3	Reorders partial sums	numerical no-op, the wide accumulator absorbs it
Tiling granularity	The partial-sum tile alignment is set by the patch and core-count fields, granularity near 128	A sum off a tile boundary loses one rounding increment	at most one unit in the last place

Table 12.3. The four fp16 cross-chip divergence axes, the codegen route each selects, and the bounded magnitude of each.

The saturation axis is the only one that changes a finite value into an infinity, and it is magnitude-gated: it triggers only when a width-offset slice holds a value above 4094. Measured on the M1 the threshold is exact: a width-offset slice is finite at 4094 and goes to plus or minus infinity at 4100, while the zero-offset control stays finite even at 60000. The M2 saturates bit-identically to the M1, which corrects the earlier reading that the non-saturating route arrives at the A14: the saturation persists through the A14 and the M5 is the part that takes the non-saturating route. The reduction-then-square fusion never manifests on silicon: the M1, M2, and M5 all measure the unfused result, so that axis is uniform across the family. A divergence is thus predictable from a small set of fields without running every chip: a reduction or normalization denominator can differ by at most one unit in the last place where the tiling or route fields differ. A width-offset slice has saturation risk only when its values can exceed 4094.

Developer policy

Choose the generation a network requires from its operation set, then let the program run across every part at and above that generation. Verify per chip only the numerics that can move: the cancellation-sensitive reductions, the variance and normalization denominators, and any width-axis slice whose values can exceed the saturation bound. Everything else is portable by construction: the same source, the same operation legality, faster on the newer silicon by the core and clock scaling.

Compiling for a target generation

The naming rule lets a network name its target by generation rather than by probing the silicon. The compiler constructs any target from its per-chip table, so a developer compiles a program for the oldest generation it must support, then runs it unchanged on every part at and above that generation. A static estimate against a target reads back that target's core-and-clock scaling without the part in hand.

The procedure compiles for the floor generation a network requires, then estimates against the newer targets to read the core-and-clock speedup.

/* The target generation is held in the compiler options as a TargetArchitecture string, */
/* so the same source compiles for whatever floor the network must support.            */
e5rt_e5_compiler_options_create(&options);
e5rt_e5_compiler_options_set_custom_ane_compiler_options(options, "TargetArchitecture=h13");
e5rt_e5_compiler_compile(compiler, model_path, options, &library);  /* M1 floor, runs M1..M5 */

/* Then the same drive: retain the function, build the op, and dispatch on a stream. */
e5rt_program_library_retain_program_function(library, fn_name, &function);
e5rt_precompiled_compute_op_create_options_create_with_program_function(function, &op_opts);
e5rt_execution_stream_operation_create_precompiled_compute_operation_with_options(op_opts, &op);
e5rt_execution_stream_encode_operation(stream, op);
e5rt_execution_stream_execute_sync(stream);

A network compiled to h13 runs on every generation from the M1 through the M5; estimating the same graph against h17s reads back the M5 scaling without the M5 in hand.

Reference: per-family scaling constants

Table 12.4 collects the per-family scaling constants and the figures that fix cross-generation behavior, with the silicon each was measured on.

Quantity	Value	Silicon
Naming rule	$M (n) \to H (n + 12)$	family-wide
NE cores	4 (base), 8 (Pro and Max)	M1/H13
NE cores	16	M5/H17s
Operating clock	~1.14 GHz	M1/H13
Operating clock	~1.89 GHz	M5/H17s
Matmul-slope peak	about 19.6 fp16 TFLOP/s	M5/H17s
Single-program matmul peak	about 9.5 fp16 TFLOP/s	M5/H17s
Convolution peak	about 14.3 fp16 TFLOP/s	M5/H17s
M5 convolution peak versus M1 projected fp16 peak	near 5x	M5/H17s
Working-set threshold	near 2 MB	M1/H13
Working-set threshold	4.72 MB	M5/H17s
DRAM weight-stream bandwidth	about 145 GB/s, two read channels	M5/H17s
Texture-engine sampler unlock	A14 generation	family-wide
Native sin and cos unlock	A15 generation	family-wide
Tensor dimension limit raised, 16384 to 65536	A16 generation	family-wide
Kernel-width ceiling raised, 13 to 15	A16 generation	family-wide
M2 spatial and contraction extent cap	16384	M2/H14
M2 channel-axis extent cap	65536	M2/H14
Cross-generation training parity	0.9080, 0.9080, 0.9070	M1/H13, M2/H14, M5/H17s
fp16 cross-chip divergence bound	one unit in the last place	family-wide
fp16 slice-saturation threshold	source above 4094 to infinity	M1/H13, M2/H14
Number of fp16 divergence axes	four	family-wide
Cross-silicon prediction pass	ten of ten	M5/H17s

Table 12.4. The per-family scaling constants and the figures that fix cross-generation behavior, with the silicon each was measured on.

Part IV turns from the engine in isolation to the workloads that run on it.

Apple Neural Engine: A Complete Guide