34. Cross-silicon targets

The compiler builds 28 architecture targets, one per silicon profile, under the fixed relation . A suffix letter selects the NE-core count, and the operation surface stops expanding at A15, so the surface measured on the M5 is the surface for everything above it. A device's runtime architecture string is a separate identifier from the compiler target name. A resolver-derived board-type sequence maps every shipping chip onto its generation.

The compiler that builds for the Apple Neural Engine has 28 architecture targets, one per silicon profile it knows how to construct. Each target is a named hardware-abstraction-layer table the compiler builds by calling one per-architecture constructor, ZinIrHal<T>::GetParams(), and calling every constructor on a single host recovers the full set regardless of which chip runs it.

Full set

Table 34.1 gives all 28 targets, each with its silicon class and decoded NE-core count.

TargetSilicon and classNE cores
H11, H12, M9, T0pre-A13 legacy1 to 4
H13A13, M1 base4
H13gM1 Pro, Max, Ultra8
T1A13 reference4
H14A14, M2 base4
H14gM2 Pro, Max8
H14cA14 Max-class32
H15A15, M3 base4
H15gM3 Pro, Max8
H15cA15 Max-class32
H16A16, M4 base4
H16gM4 Pro, Max8
H16sA16 Pro-class16
H16cA16 Max-class32
H17A17, M5 base4
H17aA17 variant4
H17gM5 Pro, Max8
H17sA17 Pro-class, the M516
H17cA17 Max-class32
H17dA17 Ultra-class64
H18A18 base4
M11small embedded ANE1
U1, U2, U3reference, not silicon4

Table 34.1. The 28 compiler targets, each with its silicon class and decoded NE-core count.

The names fall into four groups: the H-architecture targets that stand for shipping A-series and M-series silicon, the pre-A13 legacy targets, a single small embedded profile, and three reference targets that are not silicon at all. A suffix letter selects the NE-core count within a generation, which the compiler decodes from the core-count field at hardware-abstraction-layer offset 0x238. The base name is 4 cores, the suffix g is 8, s is 16, c is 32, and d is 64, while M9 and M11 are single-core. H17s is thus the 16-core Pro-class part that is the M5, and H17d is the 64-core Ultra-class die, the largest in the table. These decoded num_nes values are the compiler's per-die core field, not Apple's marketing Neural Engine count; on the base M1 the decoded four stands against the published sixteen [AppleANE].

The reference targets hold placeholder limits that no part has: a maximum tensor depth of 1, a kernel-width limit of 1023, and no interchange-format support. They are unconstrained validation profiles the compiler builds for its own checking, not addressable silicon. The small embedded profile M11 is addressable silicon. It is an efficiency-class engine that has the A16-class feature flags but the A13-class 16384-dimension limit, a single NE core, and the odd kernel-width ceiling of 15 that is between the A13 value of 13 and the A14 value of 16.

Capability tiers

Table 34.2 groups the targets into capability tiers, giving each tier its dimension limit and the four gated capabilities that separate the generations.

TierTargetsMax dimension3D convTexture enginesin, cosDropout
pre-A13H11, H12, M9, T016384, depth 1nononono
A13H13, H13g, T116384yesnonono
A14H14, H14g, H14c16384yesyesnono
A15H15, H15g, H15c16384yesyesyesyes
A16H16, H16g, H16s, H16c65536yesyesyesyes
A17H17, H17a, H17g, H17s, H17c, H17d65536yesyesyesyes
A18H1865536yesyesyesyes
smallM1116384yesyesyesyes
referenceU1, U2, U365535 placeholdernononono

Table 34.2. The capability tier of each target, with the dimension limit and the four gated capabilities that separate the generations.

A17 and A18 add no operation over A16: identical dimension limits, identical kernel-width and kernel-depth ceilings, the same texture engine, same dropout and global-argmax flags, and same legal operation set. They differ from A16 only in NE-core count, which scales throughput rather than legality. The operation behavior measured on the M5, an H17 part, is thus the operation behavior of every target at or above A16, since the decoded capability tables are identical; the cross-silicon performance measurements of chapter 12 are predicted to carry to the unshipped generations on the same basis, with the per-chip rates confirmed only on the two measured silicon points.

Silicon to target

The map from a shipping chip to its architecture is a resolver Apple distributes that decompiles cleanly. The method aneArchitectureType on the private device-info class builds the architecture string from a board-type value read from the platform configuration store, switching on a strictly increasing board-type sequence. The live anchor on an M1 Max reads board type 96, which resolves to h13g with a 16-core count, matching the registry exactly. Table 34.3 gives the resolver-derived map from system-on-chip to runtime architecture and compiler target across the M1 through M5 generations.

ChipProductRuntime archCompiler target
T8103M1 baseh13H13
T600xM1 Pro, Max, Ultrah13gH13G
T8112M2 baseh14H14
T602xM2 Pro, Maxh14gH14G
T8122M3 baseh15H15
T603xM3 Pro, Maxh15gH15G
T8132M4 baseh16H16
T604xM4 Pro, Maxh16gH16G
T8142M5 baseh17H17
T605xM5 Pro, Maxh17sH17s

Table 34.3. The resolver-derived map from system-on-chip to runtime architecture and compiler target, M1 through M5.

The map follows the fixed relation of chapter 12. The sequence is anchored at both ends, the live M1 Max at h13g and the measured M5 Pro at h17s, and the intervening steps are corroborated independently. A single shipping vision filter has exactly the five tables H13, H14, H15, H16, H17, the five Mac engine generations. The board-type kext for an absent chip cannot be read on a different host, since only the running chip's table is resident, so each middle step rests on the anchored monotone sequence.

Runtime string and compiler target

The architecture name a device reports at runtime is not the compiler target name. The runtime string is the coarse form, h1N for a base part and h1Ng for a Pro, Max, or Ultra part, the only two variants the runtime emits on the desktop platform. The compiler target is the finer set, the full H17, H17s, H17c, H17d, H17g family, of which the runtime collapses several onto one string. A developer names the target by its compiler form and treats the runtime string as a separate identifier.

The direct compile entry point accepts any of the 28 target names and rejects an unknown name. The dispatch library, in contrast, falls back silently when handed an unknown architecture, so a developer gates a cross-target compile against the known-name set before dispatching it.

Interchange formats across the set

Each target has a per-chip table of accepted image-input formats, the interchange-format map at hardware-abstraction-layer offset 0x658, keyed by a four-byte ASCII format tag. Table 34.4 gives the accepted image-input format count by generation tier, with the format set each tier adds.

TierChipsFormat countSet
older, referenceH11, H12, M9, T0, U1, U2, U30none
A13, M1H13, H13g, T13&BGA, &L0h, &L16
A14H14, H14g, H14c13A13 set, RGBA-half, three compression variants
A15 and smallH15, H15g, H15c, M1116A14 set, YUV 4:2:0, luma-half
A16, A17, A18H16, H17, H18 families14A15 set minus YUV 4:2:0

Table 34.4. The accepted image-input format count by generation tier, with the format set each tier adds.

The tag is a one-byte compression-variant prefix on a three-byte base pixel format. The compiler does not parse the prefix character by character: it validates the whole four-byte tag against a 34-entry allow-list, and the prefix's meaning is the third byte of the format's packed-integer value, a packing-mode index on a uniform stride. Table 34.5 gives the compression-variant prefix on an interchange tag and the packing-mode index it selects.

PrefixMode indexMeaning
&0uncompressed, default raster surface
-1lossless compression, 32 by 32 macroblock
/2lossless compression, 16 by 16 macroblock
|3lossless compression, mode 3
*0compound prefix that sets the dynamic-channel flag

Table 34.5. The compression-variant prefix on an interchange tag and the packing-mode index it selects.

The packed integer that names each format is three bytes: a pixel class, a base-format code, and the packing-mode index. The base-format codes are BGRA8 (BGA, code 0x11), RGBA-half (RhA, code 0x13), 8-bit luma (L0h, code 0x07), 16-bit luma (L16, code 0x08), and YUV 4:2:0 (8f0 and 8v0, code 0x09). A base format routes to a vector of 20-byte plane descriptors, each a tuple of width divisor, height divisor, element type, channel count, and depth. BGRA8 is thus one four-channel uint8 plane, and YUV 4:2:0 is a luma plane with a half-resolution two-channel chroma plane. The binary string that reads "Architecture only supports lossless compression" confirms that & is the uncompressed variant and that -, /, and | are the lossless-compressed packing families.

The A15 generation and the small embedded profile are the only targets in the set that accept YUV 4:2:0 input, in both full-range (8f0) and video-range (8v0) form. The M5 and every A16-and-later part keep luma-half but drop the two YUV 4:2:0 formats. The full per-target format records, the wider 10-bit and packed YUV family, and the plane-layout structures are appendix material.