27. Kernel driver and IOKit ABI

The engine is reached through a single kernel driver and a flat user-client selector ABI, with 17 control-client selectors and 9 direct-path selectors, each pinned to one exact size tuple. Opening either user client is gated on the kernel entitlement com.apple.ane.iokit-user-access, which exactly two system binaries hold, so every application reaches the engine through a privileged broker daemon. Selector 2 alone reaches a hardware doorbell from user space, through a four-layer call path ending in an MMIO mailbox write. The size tuples match across the M1 and M2-class kernel cache, so the ABI at this layer is family-invariant.

This chapter covers the driver class hierarchy, the two user-client selector spaces with their exact struct sizes, the dispatch-array record format the kernel validates against, the path from a user-space call down to the hardware doorbell, and the broker model that fronts the device with one privileged daemon.

Driver stack and class hierarchy

The three cooperating kexts, their bundles, principal classes, and provider matches appear in Table 27.1.

Kext roleBundlePrincipal classProvider match
interface and hardware driverAppleH11ANEInterfaceANEHWDevice (registered as H11ANEIn)RTBuddyService, role ANE
per-die abstraction layerAppleT8132ANEHALAppleT8132ANEHALIOResources
multi-engine arbiterAppleANELoadBalancerANEDriver (H1xANELoadBalancer)IOResources, IOKit

Table 27.1. Bundle, principal class, and provider match for each of the three kernel extensions.

Three version-locked kexts cooperate, all at build 9.511.3 on the M1 generation. The interface and hardware driver registers the device, vends the user clients, and rings the firmware. A per-die hardware abstraction layer supplies the clock, power, and topology constants. The load balancer owns the program-to-engine residency map and arbitrates across physical engines on parts that have more than one. The engine attaches as an Apple RTKit coprocessor endpoint, so the interface driver is the host side of a real-time-operating-system mailbox client. Three kernel families handle the work: the address-translation family that drives the device IOMMU, surface family that backs zero-copy tensor buffers, and real-time mailbox family that holds firmware commands.

ANEHWDevice::newUserClient(task*, void*, uint type, IOUserClient**) vends the two distinct user clients plus a hint-only client of Listing 27.1.

/* ANEHWDevice::newUserClient vends, by requested type: */
H11ANEInUserClient        /* control client: program lifecycle + the inference hot path */
H11ANEInDirectPathClient  /* direct path: enqueue, memory-map, session-hint            */
ANEClientHints            /* scheduling-hint client (setClientHint)                    */

Listing 27.1. The three user-client types, by requested type.

The two functional clients hold separate connections and separate selector spaces that both start at zero, so one small selector integer names different methods on each. Each stores its connection at object offset +0x40, and every selector call loads that offset before issuing the kernel call.

On the user-space side an IOServiceOpen whose return type is 0xe00002c5 opens the device, and the connection object is at offset 0x40 in the device handle. Two front-ends open that type: ANEServicesDevice for inference and ANEHWDevice for administration. The user-space selector immediates read on the M1 are selector 0 for the device open, selector 2 for the send-request, selector 3 for create, and selector 4 for prepare. The rest are selector 6 for destroy, selector 10 for the version query, and selector 16 for the firmware load. The lifecycle indices differ from the kernel-registered table because the kernel has two user-client classes with independent selector tables. The two selectors agree across both sides, selector 0 opening the device at 104 bytes and selector 2 sending a request with a 2376-byte input and a 40-byte output.

Every selector shim reads its arguments out of the framework IOExternalMethodArguments block at the fixed set of offsets in Listing 27.2, identical across all 26 selectors.

/* IOExternalMethodArguments field offsets, used by every selector shim: */
+0x08  asyncWakePort       /* mach completion port (async selectors only)        */
+0x10  scalarInput[]       /* also holds the 0x20-byte asyncReference block       */
+0x20  scalarInputCount
+0x30  structureInput      /* pointer to the typed argument struct                */
+0x38  structureInputSize
+0x48  scalarOutput[]
+0x50  scalarOutputCount
+0x58  structureOutput
+0x60  structureOutputSize

Listing 27.2. The IOExternalMethodArguments field offsets every selector shim reads.

Three return constants recur across the shims: 0xe00002c2 is kIOReturnBadArgument, returned on a size or null-pointer check failure, 0xe00002c7 is kIOReturnUnsupported, returned on a disabled or stub path, and 0xe00002c5 is the closed-or-closing client state.

User-client dispatch-array format

Each user client routes a selector through the standard IOKit 2022 dispatch pattern. H11ANEInUserClient::externalMethod loads the dispatch-array pointer and the method count, then tail-calls the framework dispatcher, which validates the declared scalar and structure sizes against the array entry before it calls the named handler. Both dispatch arrays were read byte for byte out of the read-only data section of the kernel cache, and Listing 27.3 gives the disassembled selector dispatch for each user client with its array pointer and selector count.

/* H11ANEInUserClient::externalMethod, disassembled (M1, T6000): */
add x3, x3, #0xc08    /* x3 = &_sANEDriverClientMethods           */
mov w4, #0x11         /* count = 17 selectors (0..16)             */
bl  IOUserClient2022::dispatchExternalMethod

/* H11ANEInDirectPathClient::externalMethod: */
add x3, x3, #0xeb0    /* x3 = &_sANEDriverDirectPathClientMethods */
mov w4, #0x9          /* count = 9 selectors (0..8)               */
bl  IOUserClient2022::dispatchExternalMethod

Listing 27.3. The disassembled selector dispatch for each user client, with its dispatch-array pointer and selector count.

Each array element is one IOExternalMethodDispatch2022 record at a 40-byte stride, packing the authenticated handler pointer and the four size checks the dispatcher enforces, shown in Listing 27.4.

/* IOExternalMethodDispatch2022 element, 40-byte stride, field offsets: */
struct IOExternalMethodDispatch2022 {
  void *function;                  /* +0x00  pointer-authenticated handler   */
  uint32_t checkScalarInputCount;  /* +0x08  exact scalar-input count        */
  uint32_t checkStructureInputSize;/* +0x0c  exact struct-input bytes        */
  uint32_t checkScalarOutputCount; /* +0x10  exact scalar-output count       */
  uint32_t checkStructureOutputSize;/* +0x14 exact struct-output bytes       */
  uint8_t  reserved[0x10];         /* +0x18  reserved (debug-WP group flag)  */
};

Listing 27.4. The dispatch-array element layout, packing the handler pointer and the four size checks the dispatcher enforces.

No entry on either client uses the sentinel 0xffffffff that means "do not check". Every selector pins an exact scalar count and an exact struct size, and the dispatcher rejects any other size with kIOReturnBadArgument. Size-based overloading does not exist here: each selector index has exactly one record with one fixed size tuple.

Control-client selector table

The control client has 17 selectors covering the open handshake, the program lifecycle, status and version reads, and the firmware-driven debug work-processor channel. Table 27.2 lists every control-client selector with its handler and kernel-authoritative sizes read from the dispatch array.

SelHandlerScalar inStruct inScalar outStruct out
0ANE_DeviceOpen01040104
1ANE_DeviceClose0000
2ANE_ProgramSendRequest12376040
3ANE_ProgramCreate03200
4ANE_ProgramPrepare056056
5ANE_ProgramUnprepare05600
6ANE_ProgramDestroy01600
7ANE_GetStatus00032
8ANE_ProgramCreateInstance03200
9ANE_ProgramChainingPrepare016024
10ANE_GetVersion0010
11ANE_RegisterDebugWorkProcessor02400
12ANE_UnregisterDebugWorkProcessor0000
13ANE_GetDebugWorkProcessorItem2000
14ANE_CompleteDebugWorkProcessorItem2000
15ANE_ReleaseDebugWorkProcessorBuffers0000
16ANE_LoadFirmware3000

Table 27.2. The seventeen control-client selectors, with handler name and kernel-authoritative scalar and structure sizes.

Selector 0 is the open handshake. It passes a 104-byte device-info structure as both the input and the output buffer, echoes the caller header back, fills the output half with the device descriptor, and returns the session token at offset +0x00 that every later request holds, with the handshake fields given in Table 27.3.

OffsetInput (client to kernel)Output (kernel to client)
+0x00usage type byte (1 standard, 2 unsupported)program / session token (u64)
+0x08callback function pointer; board id 0x1111222233334444echoed
+0x10receiver context pointerechoed
+0x18timeout 0x2710 = 10000echoed
+0x48(output only)ANE version 0x20 = 32, 256
+0x50(output only)number of engines = 1
+0x60(output only)CPU subtype = 4

Table 27.3. The ANEDeviceInfo handshake structure passed in and echoed back by selector 0.

The usage-type byte selects the standard client profile: usage 1 opens, usage 2 returns the unsupported code 24. Selector 16 is inactive on this build: its shim returns kIOReturnUnsupported unconditionally, and the real firmware load runs internally at driver start.

A compiled program reaches the kernel in one of two representations: ANEProgramLegacyResource, a loader for the program-image executable, and ANEProgramRTResource, a runtime op-graph variant.

Direct-path selector table

The direct-path client has 9 selectors, listed in Table 27.4 with their handlers and sizes. Selectors 0, 1, and 2 reuse the control client's handler functions; the remaining six are the enqueue, memory-map, and session-hint methods of the low-latency submission model.

SelHandlerScalar inStruct inScalar outStruct out
0ANE_DeviceOpen01040104
1ANE_DeviceClose0000
2ANE_ProgramSendRequest12376040
3ANE_ProgramOutputSetEnqueue04000
4ANE_ProgramInputsReady0310400
5ANE_MemoryMapRequest1208010
6ANE_MemoryUnMapRequest0208000
7ANE_SessionHintRequest016024
8ANE_ProgramChainingSetActiveProcedure03200

Table 27.4. The nine direct-path selectors with their handler names and scalar and structure sizes; the full reference is in Appendix C.

Selector 5 is the device-IOMMU map. Its 2080-byte parameter structure describes a host buffer, and on success the handler writes the resulting engine-visible device address back into the single scalar output slot. Selectors 3 and 4 are the pre-post and trigger of the resident submission model: an output buffer set is enqueued, the inputs-ready signal fires, and the same doorbell path as selector 2 rings the engine.

Register and exclave method catalog

Beyond the nine kernel selectors, the direct-path client exports a wider register, power, firmware, and secure-world method surface. These are not distinct kernel selector indices: each routes through one of the nine kernel selectors or through a separate entry point, and Table 27.5 names the surface by role.

MethodRole
ANE_PowerOn / ANE_PowerOff / ANE_IsPoweredpower-domain control
ANE_LoadFirmware / ANE_ForgetFirmwarefirmware image lifecycle
ANE_SendCommandraw firmware command injection
ANE_SetPowerManagement / ANE_SetDynamicPowerGating / ANE_SetPowerGatingHysteresisTimepower policy
ANE_SetThrottlingPercentagethermal throttle
ANE_SetDARTCacheTTL / ANE_FlushInactiveDARTMappings / ANE_UnmapDartBuffersaddress-translation controls
ANE_ReadANERegister / ANE_WriteANERegisterraw memory-mapped register read and write
ANE_FWSharedEventDoorbellRingring the firmware shared-event doorbell
ANE_AddPersistentClient / ANE_RemovePersistentClientkeep the device resident
ANE_MPMMemoryMapRequest / ANE_MPMMemoryUnmapRequestthe multi-process managed-memory region
ANE_ExclaveCycle / ANE_ExclaveLoad / ANE_ExclaveEvaluate / ANE_ExclaveUnloadsecure-world load and evaluate
ANE_ExclaveReadPropertyValue / ANE_ExclaveWritePropertyValuesecure-world property access
ANE_GetClientsInfo / ANE_ShowSharedMemoryAllocations / ANE_ShowModelMemoryStatusdiagnostics

Table 27.5. The register, power, firmware, and exclave method surface exported by the direct-path client beyond its nine kernel selectors.

A second access check beyond the kernel entitlement gates the raw register read and write, command injection, and exclave methods: a privileged-virtual-machine-access property probed at client open, distinct from the device-open entitlement.

From a user-space call to the doorbell

A submit on selector 2 crosses four layers from the user-space call down to the hardware doorbell write, traced in Listing 27.5.

/* The submit path for selector 2 / direct-path selector 4: */
H11ANEInUserClient::externalMethod(sel=2, args)
  -> dispatchExternalMethod                 /* validates 2376-in / 40-out */
  -> ANE_ProgramSendRequest(client, ref, args)            /* arg shim     */
  -> ANEClientDevice::programSendRequest(ANEProgramRequestArgs*, ...)
  -> ANEDriver::ANE_ProgramSendRequest(...)               /* gated        */
  -> ANEHWDevice::doorBellRing(db)
  -> ANERegisterControl::write32(reg, 1 << idx)           /* MMIO mailbox */

Listing 27.5. The four-layer call path from a user-space submit selector down to the hardware doorbell write.

Below the dispatcher, the thin shim re-checks the argument sizes and unmarshals the typed argument structure, the client object method builds the memory descriptors and retains the shared-event fences, and the gated driver method runs on the command-gate workloop. The 2376-byte request structure holds the program handle minted at create time, a sequence number, the quality-of-service and execution-priority pair, and the array of surface identifiers for the input, output, and intermediate buffers, with the measured field layout in Table 27.6.

OffsetFieldObserved
+0x000program / instance token (u64)the handle minted by program-create
+0x008sequence number0, then 1 on the next submit
+0x010priority / quality-of-service pair(5, 21), qos class and execution priority
+0x01cio category2
+0x020surface identifier arrayinput, output, intermediate surfaces

Table 27.6. The measured field layout of the 2376-byte request structure submitted on selector 2.

The 40-byte output returns the sequence and the echoed token: +0x00 is the sequence and result, +0x08 is the echoed token, and +0x20 is a status flag. Selector 2 alone uses the asynchronous machinery, and it is the only path that reaches a hardware doorbell from user space. Completion arrives at a mach wake port as a callback, not by a shared-memory poll. The doorbell write itself reads the doorbell index from the request, requires it below 32, computes the mask 1 << index, and stores that mask into the engine register aperture, the mailbox signal that triggers the firmware.

The reverse signal, the engine telling the host a job is complete, travels the same windowed store mechanism in firmware. The engine rings a host interrupt with an interrupt-atomic memory-mapped store to the host-supplied target register, bracketed by clearing and then setting bit 39 of the implementation-defined AArch64 system register S3_3_C15_C8_0. Clearing bit 39 opens the posted-write window, the firmware stores the doorbell value into the host aperture, and a barrier separates the store from the status sample. The firmware then reads back the uncorrectable-cache-error bit (bit 1) and the transaction-reject bit (bit 7) to confirm the store has committed before setting bit 39 to close the window. The whole sequence runs with interrupts disabled so a nested handler cannot corrupt the status read.

Entitlement gate and broker model

The client open checks the two kernel entitlements of Listing 27.6, the hard device-open gate and the resident data-chaining gate.

/* checked at H11ANEInUserClient::init via copyClientEntitlement: */
"com.apple.ane.iokit-user-access"            /* the hard device-open gate   */
"com.apple.ane.allow-dataChaining-access"    /* resident data-chaining gate */

Listing 27.6. The two kernel entitlements checked when a user client is opened.

A single kernel entitlement gates opening either user client. The check runs once at client construction and is a boolean on the client object, not re-checked per selector. Across the whole system, exactly two binaries hold com.apple.ane.iokit-user-access: the system broker daemon and its per-user sibling. No application process opens the device. Every other consumer reaches the engine through the broker over a cross-process call, proving itself with an entitlement from the broker's own private family rather than the kernel gate.

The driver stamps the capability at client creation in ANEClientInfo::create, which reads each entitlement through copyClientEntitlement and records isPrivileged and allowDataChaining as bits on the client. Beyond the two open gates, the driver enforces six further com.apple.ane and com.apple.private.ane entitlements covering scheduling priority, memory and data access, and client and coalition hints, as Table 27.7 lists.

EntitlementCapability
com.apple.ane.realtime-priority-clientthe real-time-priority client grant
com.apple.ane.allow-system-reserved-prioritiesuse of the system-reserved scheduling priorities
com.apple.ane.memorya memory-access grant
com.apple.ane.allow-dataa data-access grant
com.apple.private.ane.allow-set-client-hintsset per-client hints
com.apple.private.ane.allow-share-coalition-hintsshare hints across a coalition

Table 27.7. The kernel-driver entitlements beyond the two open gates, with the capability each grants.

The broker keys are a private entitlement family, checked per connection and per method over a cross-process call, with each key, its capability, and its holder count given in Table 27.8.

EntitlementCapabilityHolders
com.apple.ane.iokit-user-accessthe hard kernel gate: open the user client, privileged device open2: the broker and its per-user sibling
com.apple.ane.allow-dataChaining-accessresident data-chaining on the direct-path clientkernel-checked at client init
com.apple.aned.private.allowbaseline: compile, load, instantiate through the broker18
com.apple.aned.private.ANEAccess.allowinference-client access-grant variant14
com.apple.aned.private.adapterWeight.allowstream adapter weights onto a shared resident base model5
com.apple.aned.private.processModelShare.allowshare one resident model across processes4
com.apple.aned.private.secondaryANECompilerServiceAccess.allowthe longer-duration compiler service for large models1
com.apple.aned.private.aggressivePowerSavingthe aggressive-power-saving execution modegate helper only
com.apple.aned.private.modelPurgeInAllPartitionspurge models across all cache partitionsgate helper only
com.apple.security.temporary-exception.iokit-user-client-classopen the direct-path user client for own submission27
com.apple.security.ts.ane-clienttrust-cache blessed-client slot for latency-critical consumers5

Table 27.8. The entitlement family that gates the engine, from a static scan of the M1 system binaries, independent of boot-security state.

The broker is a listener that enforces per-connection and per-method entitlement checks. It sorts admitted clients into a restricted tier, unrestricted tier, and per-user tier, and it threads a quality-of-service argument through every compile, load, and instantiate method. The restricted tier admits the adapter-weight, model-share, and aggressive-power-saving requests through a per-method admission helper, and the per-user tier serves the per-user broker.

The adapter-weight path is the mechanism behind swappable model weights without recompilation. A base model is loaded once, and each adapter is a new instance bound to a named base-model identifier holding only its per-adapter weight files, through a create-instance-with-weights method that names the base-model identifier and the weight-file count. Residency and power are explicit per-instance arguments on the create-instance method. They are an enable-power-saving flag, more-aggressive variant gated by the restricted tier, opt-out-of-model-memory-unwiring flag that keeps a hot client's weights resident at the cost of footprint, and queue-depth that the broker down-adjusts under contention. A queue-index function and a program-priority function map the quality-of-service argument to hardware scheduling. A privileged subset of system daemons also holds a sandbox exception that opens the direct-path user client and drives per-inference submission on its own connection, skipping the per-call round trip through the broker. That exception is a latency optimization, not a capability grant: the privileged device open still happens in the broker, which hands the client a program handle and an intermediate-buffer handle.

The kernel binds residency to code-signing identity. A second client may attach to an already-resident program only when its team identifier and code-directory hash match the owner, so a shared resident model or key-value cache cannot leak across tenants. The kernel resolves the caller's team identifier and code-directory hash, and the attach path tests them against the resident owner before a sibling instance reuses the shared intermediate-buffer handle. With one physical engine on this generation, cross-client arbitration is time-division multiplexing on a single gated request queue, biased by the per-stream quality of service the clients declare.

Live device properties

The driver publishes its topology and version constants into the registry, read live on the M1 host and decoded in Table 27.9.

PropertyValueMeaning
architecture type stringh13gmicroarchitecture family, drives per-family codegen
version96 = 0x60major hardware version
minor version17minor revision
board type96board and system-on-chip type id
board subtype0board sub-variant
number of cores16compute cores in this engine
number of engines1distinct engine units, load balancer is a pass-through
CPU subtype4program-ABI gate
internal buildNorelease build, gates the debug surfaces

Table 27.9. The device properties the driver publishes into the registry on the M1 host.

The number-of-cores value is a topology count and not the throughput-relevant multiply-array width, so a floating-point rate is taken from the measured cost-model anchor rather than inferred from the core count. At rest the registry shows the device, load-balancer instance, and standing hints client present, with zero control clients and zero direct-path clients open, confirming the brokered, lazily-opened model. The driver opens the user clients on demand per active client and tears them down when idle.