24. HAL and capability gates

The compiler is one binary that builds any chip in the line from a per-chip data table, the hardware abstraction layer, read at compile time. The table holds scalar fields indexed by byte offset for the numeric limits and a dense capability-byte region at offsets 0x48f through 0x8cc, each byte read as hal[offset] & 1 to gate one operation or format. Operation legality is declared on the operation as a MinimumFamily<N> trait: native only when the target family index is N or greater, and decomposed below the floor, with no compute operation floored above A15. A capability in the table attests support at the layer that reads it and does not prove the operation runs: three-dimensional convolution has its kernel-depth attestation at 0x70 and fails backend lowering on every device mask.

The compiler that targets the Apple Neural Engine is one binary that builds any chip in the line on demand. What separates one target from the next is a per-chip data table, read at compile time, that records every size limit and every per-operation switch for that silicon.

HAL property table

The hardware-abstraction-layer table is the compiler's profile of a target, one packed structure that holds both the numeric limits and the feature gates for a chip. Table 24.1 gives representative scalar fields of the hardware-abstraction table, each with its offset, meaning, M1 value, and the generation at which it changes.

OffsetFieldMeaningM1 (H13)Changes at
0x1b8max_operand_byteson-chip SRAM working set2 MBconstant across the line (1 MB on M9)
0x1c0dram_alignmentDMA width granule in bytes16constant (1 only on the small profiles)
0x1c8l2_bank_alignDMA bank-conflict modulo64constant
0x1f0L2-resident buffer thresholddedicated-buffer trip032768 at A15, 262144 at A16
0x200dense kernel-memory capnon-streamed weight ceiling64 KBthe fold-path budget
0x210streamed kernel-memory capstreamed weight ceiling16 MBthe stream-path budget
0x218instruction or segment alignmentrecord packing granule25616 at A14
0x228ne_perf_cycle_divisorcost-model per-cycle divisor6432 on H11, 16 on M9
0x238num_nesNE-core count4 (base)die-keyed: 4, 8, 16, 32, 64
0x288extended dual-kernel-memory mode16 MB versus 64 KB select00 on all 28 targets
0x70max_large_conv_kernel_dim_z3D-conv kernel depth16capability attested at A13 (1 below)
0x138max_tensor_widthmaximum tensor width1638465536 at A16
0x158max_tensor_depthmaximum tensor depth163841 below A13, 65536 at A16
0x3f0reduction-via-transpose extentreduction route threshold192384 at A15
0x400pe_min_patch_width_log2the -pixel tiling floor4constant M1 and M5
0x580cost-model policy nameroofline anchor stringSimpleNone on older and small profiles
0x668interchange-format map sizecount of accepted image formats313 at A14, 16 at A15, 14 at A16

Table 24.1. Representative scalar fields of the hardware-abstraction table; the full decoded register map is in Appendix C.

The hardware abstraction layer is a single packed structure the compiler constructs for its target, holding two kinds of entry. The first is a block of scalar fields, indexed by byte offset, that record numeric limits: maximum kernel sizes, maximum tensor dimensions per axis, the on-chip working-set size, data-movement alignment granule, and cost-model curve. The second is a dense region of single-byte boolean flags, the capability bytes, each gating one operation or one format on or off for the target. A family of constructors, one per architecture, builds the structure per target inside the compiler. The scalar region runs from offset 0x18 to roughly 0x348 as plain data, with non-scalar members such as the format map and the cost-model curve extending past it. The capability-byte region occupies offsets 0x48f through 0x8cc and holds on the order of 165 single-byte flags, of which 24 have recovered field names and the rest are enumerated by offset and classified by the family that enables them. The compiler reads a scalar as a value at its offset and a capability byte as hal[offset] & 1, so every limit and every gate is one indexed read into this one table. The same structure holds the cost model: a policy-name string at 0x580, frequency-to-efficiency curve at 0x7a8, and per-cycle divisor at 0x228, which the roofline of chapter 18 reads from this table rather than from a separate file.

Listing 24.1 gives a partial C view of the structure, with each selected scalar field and the capability-byte region at its recovered offset.

/* ZinIrHalParameters, selected fields at their byte offsets (M1/H13 values) */
struct ZinIrHalParameters {
    /* ... */
    uint64_t max_large_conv_kernel_dim_z;  /* 0x70:  3D-conv kernel depth   = 16  */
    /* ... */
    uint64_t max_tensor_width;             /* 0x138: max tensor width       = 16384 */
    /* ... */
    uint64_t max_operand_bytes;            /* 0x1b8: SRAM working set       = 2 MB  */
    uint64_t dram_alignment;               /* 0x1c0: DMA width granule      = 16    */
    uint64_t l2_bank_align;                /* 0x1c8: DMA/L2 bank count      = 64    */
    /* ... */
    uint64_t num_nes;                      /* 0x238: NE-core count          = 4     */
    /* ... */
    uint8_t  cap_bytes[0x8cd - 0x48f];     /* 0x48f..0x8cc: per-op capability flags */
};

Listing 24.1. A partial C view of the per-target hardware-abstraction structure, with selected scalar fields and the capability-byte region at their recovered offsets.

The capability bytes are read one at a time as hal[offset] & 1, for example the texture engine at 0x81d and the kernel-streaming master at 0x48f.

Operation gate

Operation legality is declared not in the HAL table but on the operation itself, as a trait the compiler attaches to every backend operation. The trait is a minimum-family index: the operation MinimumFamily<N> is natively legal only inside a compilation whose family index is N or greater, and below that floor the compiler decomposes it into legal operations. The family index orders the generations: A11Legacy is 0, A12 is 1, A13 is 2, A14 is 3, A15 is 4, and so on, with the M1 at A13 and the M5 at A17.

The check the compiler runs on each backend operation is the trait floor against the target family index, which listing 24.2 gives as the native-or-decompose decision.

# mlir::OpTrait::anec::MinimumFamily<N>: native iff target family >= N
def op_is_native(op, target_family):
    return target_family >= op.minimum_family   # e.g. softmax N=2 (A13), sin N=4 (A15)

def lower_op(op, target_family):
    if op_is_native(op, target_family):
        emit_native(op)                          # one anec op
    else:
        decompose(op)                            # rewrite into ops legal below the floor

Listing 24.2. The minimum-family gate, where an operation is emitted natively when the target family meets its floor and decomposed otherwise.

The M1 has family index two and the M5 has family index six, so an operation with floor four, such as sin, is native on the M5 and decomposed on the M1.

The floors fall into a small number of tiers. The base tier, family 0, holds the operations every engine runs: convolution, matrix multiply, pooling, the elementwise and activation set, reshape, transpose, and concat. At A13 come softmax, the normalizations, the reductions, fused attention, and the square-root and error functions. A14 brings the texture-engine samplers, crop-resize, and resample; A15 brings native sin and cos. No compute operation floors above A15, so the newest generations add core count and clock rather than new operations.

The two gate mechanisms work together. A capability byte read as hal[offset] & 1 decides a route inside a single operation, for example whether the texture engine at byte 0x81d is present, which on the M1 reads 0 and forces resize to a decomposition. The minimum-family trait decides whether the operation is native at all. When either gate is closed, the compiler either emits a decomposition into legal operations or rejects the operation with a message naming the architecture, depending on whether a legal decomposition exists.

Table 24.2 gives the minimum-family floors a developer reaches, each with the families it is native on and its representative operations.

FloorNative onRepresentative operations
F0all familiesconvolution, matmul, pooling, elementwise, reshape, transpose, concat
F2 (A13+)A13 onwardsoftmax, layer and instance and batch norm, reductions, attention, erf, sqrt
F3 (A14+)A14 onwardcrop-resize, resample
F4 (A15+)A15 onwardsin, cos, global argmin and argmax

Table 24.2. The minimum-family floors a developer reaches, with the families each is native on and representative operations.

Because the floor is an attribute of the operation and the limits are a table keyed to the chip, the per-chip difference is data, not code. The compiler text that rewrites an operation is identical across the family, and the chip selects a different limit, gate, or decomposition strategy from the table beneath it.

Capability-byte gates across the line

A capability byte is a single-byte switch in that table that turns one operation or feature on or off for a target. Table 24.3 gives the named capability bytes, each with its gate and its value across the M1 and the later generations.

ByteGateM1 (H13)A14A15A16A18
0x48fkernel-streaming master, the 64 KB to 16 MB select11111
0x494square-after-reduction fusion01111
0x4a9dropout and random00111
0x4f2global argmin and argmax11111
0x529per-format kernel-stride enable, the palette stream11111
0x52dfp8 E4M3 kernel format00001
0x563FIFO-mode direct memory access00001
0x815softmax, native11111
0x816instance normalization, native11111
0x81alocal-response normalization, native11111
0x81dtexture engine01111

Table 24.3. The named capability bytes.

The texture engine at byte 0x81d is the largest M1 functional gap: it reads 0 on the M1 and 1 from A14 onward. It gates resize, crop-resize, resample, affine transform, hardware gather, and symmetric padding all together, so each of those routes through a software decomposition on the M1. The fp8 byte 0x52d is set on the A18 generation alone of the 28 targets, so the M5, an A17 part, does not have it. The streaming master at byte 0x48f and the palette-stream byte 0x529 both read 1 on the M1, which is why the int4 palette and the sparse form stream on the M1, while int8 and blockwise fold, a mechanism chapter 25 develops. The compiler builds the table for a target by calling that target's constructor, so a single host recovers the table for every chip in the line whether or not it is the chip that is running.

Per-family scalar matrix

The scalar parameters across the generation anchors show the same pattern: a value holds for a span of generations and then steps once, as Table 24.4 gives across the generation anchors.

Field (offset)M1 (H13)A14A15A16 (M4)A17 (M5)
num_nes (0x238)444416
max_operand_bytes (0x1b8)2 MB2 MB2 MB2 MB2 MB
max_tensor_width (0x138)1638416384163846553665536
max_tensor_depth (0x158)1638416384163846553665536
max_large_conv_kernel_dim_z (0x70)1616161616
L2-resident threshold (0x1f0)0032768262144262144
instruction alignment (0x218)25616161616
reduction-transpose extent (0x3f0)192192384384384
interchange-format count (0x668)313161414

Table 24.4. The scalar parameters at the generation anchors.

The base-name M5 reads num_nes of 16 because the column is the 16-core Pro-class profile, while the base A17 profile has 4. The per-die sequence runs 4 for the base name, 8 for the g suffix, 16 for s and the legacy 16-core profile, 32 for c, and 64 for the d Ultra-class die.

Kernel-memory split

The streaming master byte does more than gate the compressed-weight stream: it selects which of two kernel-memory caps a layer's weights are sized against. The legalization check is two lines of logic, reading a streamable flag and the master byte to pick the offset of the cap, then comparing the demand against it, as listing 24.3 gives.

# ExceedKmemSizeLimit: split-legalize a layer's weights when they exceed the cap
def exceeds_kmem(hal, demand, is_streamable):
    cap = hal[0x210] if (is_streamable and hal[0x48f]) else hal[0x200]
    return cap < demand   # 0x200 = 64 KB dense, 0x210 = 16 MB streamed

Listing 24.3. The kernel-memory split, where a streamable weight under the streaming master is sized against the 16 MB cap and a dense weight against the 64 KB cap.

An ordinary non-streamed weight over 64 KB, or any weight over 16 MB, is thus split into multiple sub-layers on the M1, which raises the dispatch count and the compile time. A streamed compressed weight is sized against the 16 MB cap and has far more weight per layer. This is the weight path; it does not bound the activations, which stay within the maximum-tensor-dimension caps, so a layer with a tiny weight and a large activation is bounded by tiling cost in the partition passes rather than by this discrete limit.

Dead and family-gated fields

Not every per-target value in the table is a live gate. A byte-granular re-diff of all 28 target blobs leaves zero undecoded scalar fields, but several offsets that vary by family are populated by the per-chip builder and never read back through the table pointer, so their value is a write-only mirror. Five scalar offsets are dead as table fields in this fashion: the global element cap at 0x18, kernel-depth constant at 0x80, legacy tiling granule at 0x260, offset at 0x320, and die-class flag at 0x29c. Each varies meaningfully by family, but the value a reader consumes is read off a different object that shares the byte displacement, a tensor-dimensions, compiler-parameters, or memory-pools structure, not the table. The distinguishing test is whether the base register at the access holds the table pointer, since the same displacement aliases dozens of other by-reference structures, so a raw displacement match inside a table-typed function is not proof of a table read.

The one offset that looks dead on the M1 but is not is the FIFO-mode byte at 0x563: it reads 0 on the M1 and is read through the table pointer under a branch that is taken only when the byte is set, which happens on the A18 generation. A per-family value pattern alone does not establish a live gate; only a traced reader off the table base does.

Naming the remaining capability flags

The capability bytes and the scalar limits are both fields of one struct, ZinIrHalParameters, the per-family blob the compiler builds for its target. The struct has no per-field getter method, so the compiler reads a field through an inlined ldrb or ldr off the table pointer at a fixed offset, which is why a first pass recovers offsets and values but not names. The names survive in one place only. The compiler retains full mangled C++ symbols, and a reader function whose signature has ZinIrHalParameters const& reads each field, so the reader's name labels the field it reads. A read is attributed to the table only when the load's base register is the function's ZinIrHalParameters const& argument, since the same byte displacement aliases dozens of other by-reference structures.

Cross-referencing the unnamed offsets against these reader functions names 95 more of them, of which roughly 30 resolve to a precise individual meaning with the base register verified against the table argument, the recovered names Table 24.5 attributes each to its reader function.

OffsetFieldReader function
0x4a8PE work-unit-shape supportedPERasterization::ComputeWUShape
0x4acsmall-source-mode compression supportedZinANELayer::AllowCompressionBasedOnSmallSourceMode
0x4b0non-power-of-2 work-unit width supportedNERasterization::CanUseNonPowerOf2WUs
0x4f0preferred kernel layout formatZinIrKernel::GetPreferredKernelLayoutFormat
0x500transpose and multicast configurationZinNELayer::FindValidMirInfoForTransposeCore
0x520secure-mode cache-hint DSID gateGetDSIDFromPriorityHalAndSecureMode
0x52ctensor-format support flag, pairs with the named 0x52d fp8 byteZinLayerValidationUtils::ValidateFormat
0x54ccache-prefetch kernel-task-interval limitZinValidateTd<17>::ValidateCachePrefetchKernelTaskInterval
0x5a8cache-hint DSID valueGetDSIDFromPriorityHalAndSecureMode
0x708reflective-padding maximum extentZinValidateTd<20>::ValidateReflectivePaddingMode
0x748gather and texture-engine descriptor pointerZinGatherLayer::CreateTELayer
0x8b4tile-height-errata thresholdZinTileHeightErrata::Workaround
0x8bcchaining enabledZinIrRegAllocUtil::IsChainable
0x8e0kernel-caching enabledZinIrTdValidationUtil::ValidateKernelCaching<N>

Table 24.5. Precise capability-flag names recovered this round, each attributed to its reader function.

The remaining 95-minus-30 additions are class-named: the reader identifies the subsystem the field gates without the exact semantics. Examples are the per-axis DMA range bounds read by ZinValidateTd<N>::CheckInRangeDmaAccess and the texture-engine plane-equation coefficients in the 0x820 to 0x8f8 block gated by the named 0x81d texture-engine byte on A14 and later. Two candidates were rejected as table fields despite matching a displacement inside a table-typed function: 0xcf8 loads off an adrp-formed read-only constant rather than the table, and 0x678 loads off a nested object two pointers deep. The same base-register test that found the five dead fields above also rules these out.

With this round the silicon-capability subset of the packed bitfield, the part the compiler reads to gate a feature per family, is fully named. The struct is 0x938 bytes, and the few entries inside it that are not capability flags are the cost-model coefficient block at offsets 0x580 through 0x7f0. This block holds the frequency-to-efficiency curve, rate indices, performance multiplier that the roofline of chapter 18 reads, and about two soft fp64 coefficients. These are all performance coefficients rather than legality caps. A small number of capability fields are also true holdouts, read only off aliased bases. The offsets past 0x938 hold no table: an earlier reading that located an A12 operation-emulation catalog at 0xa30 through 0xe84 was a read into the adjacent zeroed memory beyond the struct, not a real field.

Attested is not reachable

A capability recorded in the HAL table attests support at the layer that reads the table. It does not by itself prove that the operation lowers to a task descriptor and runs on the silicon. These are distinct layers, and a capability present at the first can fail at the second.

The case that fixes the rule is three-dimensional convolution. The HAL scalar at offset 0x70 records a 3D-conv kernel depth of 16 on the M1, attesting that the kernel geometry is permitted, and the compiler frontend recognizes the operation. It still fails backend lowering on every device mask, returning the message that it is not supported on any backend. The capability is in the table and the operation does not run.

The gap appears in the other direction as well, where a checker accepts an operation the code generator rejects. On the M1 the top-k, sort, and dynamic-slice validators are all callable and all three are refused at code generation. A bit in the table, a frontend that recognizes an operation, or a validator that passes are each a claim about one layer; only a compile-and-run on the target confirms the operation at the layer that executes it. This is why the reachable surface of chapter 4 is smaller than the surface the table advertises, and why each native entry there was compiled and run on the M1 rather than inferred from a capability byte.