Android kernel ABI monitoring

You can use application binary interface (ABI) monitoring tooling, available in Android 11 and higher, to stabilize the in-kernel ABI of Android kernels. The tooling collects and compares ABI representations from existing kernel binaries (vmlinux+ GKI modules). These ABI representations are the .stg files and the symbol lists. The interface on which the representation gives a view is called the Kernel Module Interface (KMI). You can use the tooling to track and mitigate changes to the KMI.

The ABI monitoring tooling is developed in AOSP and uses STG (or libabigail in Android 13 and lower) to generate and compare representations.

This page describes the tooling, the process of collecting and analyzing ABI representations, and the usage of such representations to provide stability to the in-kernel ABI. This page also provides information for contributing changes to the Android kernels.

Process

Analyzing the kernel's ABI takes multiple steps, most of which can be automated:

  1. Build the kernel and its ABI representation.
  2. Analyze ABI differences between the build and a reference.
  3. Update the ABI representation (if required).
  4. Work with symbol lists.

The following instructions work for any kernel that you can build using a supported toolchain (such as the prebuilt Clang toolchain). repo manifests are available for all Android common kernel branches and for several device-specific kernels, they ensure that the correct toolchain is used when you build a kernel distribution for analysis.

Symbol lists

The KMI doesn't include all symbols in the kernel or even all of the 30,000+ exported symbols. Instead, the symbols that can be used by vendor modules are explicitly listed in a set of symbol list files maintained publicly in the root of the kernel tree. The union of all the symbols in all of the symbol list files defines the set of KMI symbols maintained as stable. An example symbol list file is abi_gki_aarch64_db845c, which declares the symbols required for the DragonBoard 845c.

Only the symbols listed in a symbol list and their related structures and definitions are considered part of the KMI. You can post changes to your symbol lists if the symbols you need aren't present. After new interfaces are in a symbol list, and are part of the KMI description, they're maintained as stable and must not be removed from the symbol list or modified after the branch is frozen.

Each Android Common Kernel (ACK) KMI kernel branch has its own set of symbol lists. No attempt is made to provide ABI stability between different KMI kernel branches. For example, the KMI for android12-5.10 is completely independent of the KMI for android13-5.10.

ABI tools use KMI symbol lists to limit which interfaces must be monitored for stability. The main symbol list contains the symbols that are required by the GKI kernel modules. Vendors are expected to submit and update additional symbol lists to ensure that the interfaces they rely on maintain ABI compatibility. For example, to see a list of symbol lists for the android13-5.15, refer to https://android.googlesource.com/kernel/common/+/refs/heads/android13-5.15/android

A symbol list contains the symbols reported to be needed for the particular vendor or device. The complete list used by the tools is the union of all of the KMI symbol list files. ABI tools determine the details of each symbol, including function signature and nested data structures.

When the KMI is frozen, no changes are allowed to the existing KMI interfaces; they're stable. However, vendors are free to add symbols to the KMI at any time as long as additions don't affect the stability of the existing ABI. Newly added symbols are maintained as stable as soon as they're cited by a KMI symbol list. Symbols shouldn't be removed from a list for a kernel unless it can be confirmed that no device has ever shipped with a dependency on that symbol.

You can generate a KMI symbol list for a device using the instructions from How to work with symbol lists. Many partners submit one symbol list per ACK, but this isn't a hard requirement. If it helps with maintenance, you can submit multiple symbol lists.

Extend the KMI

While KMI symbols and related structures are maintained as stable (meaning changes that break stable interfaces in a kernel with a frozen KMI cannot be accepted) the GKI kernel remains open to extensions so that devices shipping later in the year don't need to define all their dependencies before the KMI is frozen. To extend the KMI, you can add new symbols to the KMI for new or existing exported kernel functions, even if the KMI is frozen. New kernel patches might also be accepted if they don't break the KMI.

About KMI breakages

A kernel has sources and binaries are built from those sources. ABI-monitored kernel branches include an ABI representation of the current GKI ABI (in the form of a .stg file). After the binaries (vmlinux, Image and any GKI modules) are built, an ABI representation can be extracted from the binaries. Any change made to a kernel source file can affect the binaries and in turn also affect the extracted .stg. The AbiAnalyzer analyzer compares the committed .stg file with the one extracted from build artefacts and sets a Lint-1 label on the change in Gerrit if it finds a semantic difference.

Handle ABI breakages

As an example, the following patch introduces a very obvious ABI breakage:

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 42786e6364ef..e15f1d0f137b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -657,6 +657,7 @@ struct mm_struct {
                ANDROID_KABI_RESERVE(1);
        } __randomize_layout;

+       int tickle_count;
        /*
         * The mm_cpumask needs to be at the end of mm_struct, because it
         * is dynamically sized based on nr_cpu_ids.

When you run build ABI with this patch applied, the tooling exits with a non-zero error code and reports an ABI difference similar to this:

function symbol 'struct block_device* I_BDEV(struct inode*)' changed
  CRC changed from 0x8d400dbd to 0xabfc92ad

function symbol 'void* PDE_DATA(const struct inode*)' changed
  CRC changed from 0xc3c38b5c to 0x7ad96c0d

function symbol 'void __ClearPageMovable(struct page*)' changed
  CRC changed from 0xf489e5e8 to 0x92bd005e

... 4492 omitted; 4495 symbols have only CRC changes

type 'struct mm_struct' changed
  byte size changed from 992 to 1000
  member 'int tickle_count' was added
  member 'unsigned long cpu_bitmap[0]' changed
    offset changed by 64

ABI differences detected at build time

The most common reason for errors is when a driver uses a new symbol from the kernel that isn't in any of the symbol lists.

If the symbol isn't included in the symbol list (android/abi_gki_aarch64), then you need to first verify that it's exported with EXPORT_SYMBOL_GPL(symbol_name) and then update the ABI XML representation and symbol list. For example, the following changes add the new Incremental FS feature to the android-12-5.10 branch, which includes updating the symbol list and ABI XML representation.

If the symbol is exported (either by you or it was previously exported) but no other driver is using it, you might get a build error similar to the following.

Comparing the KMI and the symbol lists:
+ build/abi/compare_to_symbol_list out/$BRANCH/common/Module.symvers out/$BRANCH/common/abi_symbollist.raw
ERROR: Differences between ksymtab and symbol list detected!
Symbols missing from ksymtab:
Symbols missing from symbol list:
 - simple_strtoull

To resolve, update the KMI symbol list in both your kernel and the ACK (see Update the ABI representation). For an example of updating the ABI XML and symbol list in the ACK, refer to aosp/1367601.

Resolve kernel ABI breakages

You can handle kernel ABI breakages by refactoring the code to not change the ABI or updating the ABI representation. Use the following chart to determine the best approach for your situation.

ABI Breakage Flow Chart

Figure 1. ABI breakage resolution

Refactor code to avoid ABI changes

Make every effort to avoid modifying the existing ABI. In many cases, you can refactor your code to remove changes that affect the ABI.

  • Refactoring struct field changes. If a change modifies the ABI for a debug feature, add an #ifdef around the fields (in the structs and source references) and make sure the CONFIG used for the #ifdef is disabled for the production defconfig and gki_defconfig. For an example of how a debug config can be added to a struct without breaking the ABI, refer to this patchset.

  • Refactoring features to not change the core kernel. If new features need to be added to ACK to support the partner modules, try to refactor the ABI part of the change to avoid modifying the kernel ABI. For an example of using the existing kernel ABI to add additional capabilities without changing the kernel ABI refer to aosp/1312213.

Fix a broken ABI on Android Gerrit

If you didn't intentionally break the kernel ABI, then you need to investigate, using the guidance provided by the ABI monitoring tooling. The most common causes of breakages are changed data structures and the associated symbol CRC changes, or due to config option changes that lead to any of the aforementioned. Begin by addressing the issues found by the tool.

You can reproduce the ABI findings locally, see Build the kernel and its ABI representation.

About Lint-1 labels

If you upload changes to a branch containing a frozen or finalized KMI, the changes must pass the AbiAnalyzer to ensure changes don't affect the stable ABI in an incompatible way. During this process, the AbiAnalyzer looks for the ABI report that's created during the build (an extended build that performs the normal build and then some ABI extraction and comparison steps.

If the AbiAnalyzer finds a non-empty report it sets the Lint-1 label and the change is blocked from submittal until resolved; until the patchset receives a Lint+1 label.

Update the kernel ABI

If modifying the ABI is unavoidable, then you must apply your code changes, the ABI representation, and symbol list to the ACK. To get Lint to remove the -1 and not break GKI compatibility, follow these steps:

  1. Upload code changes to the ACK.

  2. Wait to receive a Code-Review +2 for the patchset.

  3. Update the reference ABI representation.

  4. Merge your code changes and the ABI update change.

Upload ABI code changes to the ACK

Updating the ACK ABI depends on the type of change being made.

  • If an ABI change is related to a feature that affects CTS or VTS tests, the change can usually be cherry-picked to ACK as is. For example:

  • If an ABI change is for a feature that can be shared with the ACK, that change can be cherry-picked to ACK as is. For example, the following changes aren't needed for CTS or VTS test but are OK to be shared with ACK:

  • If an ABI change introduces a new feature that doesn't need to be included in the ACK, you can introduce the symbols to ACK using a stub as as described in the following section.

Use stubs for ACK

Stubs must be necessary only for core kernel changes that don't benefit the ACK, such as performance and power changes. The following list details examples of stubs and partial cherry-picks in ACK for GKI.

  • Core-isolate feature stub (aosp/1284493). The capabilities in ACK isn't necessary, but the symbols need to be present in ACK for your modules to use these symbols.

  • Placeholder symbol for vendor module (aosp/1288860).

  • ABI-only cherry-pick of per-process mm event tracking feature (aosp/1288454). The original patch was cherry-picked to ACK and then trimmed to include only the necessary changes to resolve the ABI diff for task_struct and mm_event_count. This patch also updates the mm_event_type enum to contain the final members.

  • Partial cherry-pick of thermal struct ABI changes that required more than just adding the new ABI fields.

    • Patch aosp/1255544 resolved ABI differences between the partner kernel and ACK.

    • Patch aosp/1291018 fixed the functional issues found during GKI testing of the previous patch. The fix included initializing the sensor parameter struct to register multiple thermal zones to a single sensor.

  • CONFIG_NL80211_TESTMODE ABI changes (aosp/1344321). This patch added the necessary struct changes for ABI and made sure the additional fields didn't cause functional differences, enabling partners to include CONFIG_NL80211_TESTMODE in their production kernels and still maintain GKI compliance.

Enforce the KMI at runtime

The GKI kernels use the TRIM_UNUSED_KSYMS=y and UNUSED_KSYMS_WHITELIST=<union of all symbol lists> configuration options, which limit the exported symbols (such as symbols exported using EXPORT_SYMBOL_GPL()) to those listed on a symbol list. All other symbols are unexported, and loading a module requiring an unexported symbol is denied. This restriction is enforced at build time and missing entries are flagged.

For development purposes, you can use a GKI kernel build that doesn't include symbol trimming (meaning all usually exported symbols can be used). To locate these builds, look for the kernel_debug_aarch64 builds on ci.android.com.

Enforce the KMI using module versioning

The Generic Kernel Image (GKI) kernels use module versioning (CONFIG_MODVERSIONS) as an additional measure to enforce KMI compliance at runtime. Module versioning can cause cyclic redundancy check (CRC) mismatch failures at module load time if the expected KMI of a module doesn't match the vmlinux KMI. For example, the following is a typical failure that occurs at module load time due to a CRC mismatch for the symbol module_layout():

init: Loading module /lib/modules/kernel/.../XXX.ko with args ""
XXX: disagrees about version of symbol module_layout
init: Failed to insmod '/lib/modules/kernel/.../XXX.ko' with args ''

Uses of module versioning

Module versioning is useful for the following reasons:

  • Module versioning catches changes in data structure visibility. If modules change opaque data structures, that is, data structures that aren't part of the KMI, they break after future changes to the structure.

    As an example, consider the fwnode field in struct device. This field MUST be opaque to modules so that they can't make changes to fields of device->fw_node or make assumptions about its size.

    However, if a module includes <linux/fwnode.h> (directly or indirectly), then the fwnode field in the struct device is no longer opaque to it. The module can then make changes to device->fwnode->dev or device->fwnode->ops. This scenario is problematic for several reasons, stated as follows:

    • It can break assumptions the core kernel code is making about its internal data structures.

    • If a future kernel update changes the struct fwnode_handle (the data type of fwnode), then the module no longer works with the new kernel. Moreover, stgdiff won't show any differences because the module is breaking the KMI by directly manipulating internal data structures in ways that can't be captured by only inspecting the binary representation.

  • A current module is deemed KMI-incompatible when it is loaded at a later date by a new kernel that's incompatible. Module versioning adds a run-time check to avoid accidentally loading a module that isn't KMI-compatible with the kernel. This check prevents hard-to-debug runtime issues and kernel crashes that might result from an undetected incompatibility in the KMI.

Enabling module versioning prevents all these issues.

Check for CRC mismatches without booting the device

stgdiff compares and reports CRC mismatches between kernels along with other ABI differences.

In addition, a full kernel build with CONFIG_MODVERSIONS enabled generates a Module.symvers file as part of the normal build process. This file has one line for every symbol exported by the kernel (vmlinux) and the modules. Each line consists of the CRC value, symbol name, symbol namespace, the vmlinux or module name that's exporting the symbol, and the export type (for example, EXPORT_SYMBOL versus EXPORT_SYMBOL_GPL).

You can compare the Module.symvers files between the GKI build and your build to check for any CRC differences in the symbols exported by vmlinux. If there is a CRC value difference in any symbol exported by vmlinux and that symbol is used by one of the modules you load in your device, the module doesn't load.

If you don't have all the build artifacts, but do have the vmlinux files of the GKI kernel and your kernel, you can compare the CRC values for a specific symbol by running the following command on both the kernels and comparing the output:

nm <path to vmlinux>/vmlinux | grep __crc_<symbol name>

For example, the following command checks the CRC value for the module_layout symbol:

nm vmlinux | grep __crc_module_layout
0000000008663742 A __crc_module_layout

Resolve CRC mismatches

Use the following steps to resolve a CRC mismatch when loading a module:

  1. Build the GKI kernel and your device kernel using the --kbuild_symtypes option as shown in the following command:

    tools/bazel run --kbuild_symtypes //common:kernel_aarch64_dist
    

    This command generates a .symtypes file for each .o file. See KBUILD_SYMTYPES in Kleaf for details.

    For Android 13 and lower build the GKI kernel and your device kernel by prepending KBUILD_SYMTYPES=1 to the command you use to build the kernel, as shown in the following command:

    KBUILD_SYMTYPES=1 BUILD_CONFIG=common/build.config.gki.aarch64 build/build.sh
    

    When using build_abi.sh, the KBUILD_SYMTYPES=1 flag is implicitly set already.

  2. Find the .c file in which the symbol with CRC mismatch is exported, using the following command:

    cd common && git grep EXPORT_SYMBOL.*module_layout
    kernel/module.c:EXPORT_SYMBOL(module_layout);
    
  3. The .c file has a corresponding .symtypes file in the GKI, and your device kernel build artifacts. Locate the .c file using the following commands:

    cd out/$BRANCH/common && ls -1 kernel/module.*
    kernel/module.o
    kernel/module.o.symversions
    kernel/module.symtypes
    

    The following are the characteristics of the .c file:

    • The format of the .c file is one (potentially very long) line per symbol.

    • [s|u|e|etc]# at the start of the line means the symbol is of data type [struct|union|enum|etc]. For example:

      t#bool typedef _Bool bool
      
    • A missing # prefix in the start of the line indicates that the symbol is a function. For example:

      find_module s#module * find_module ( const char * )
      
  4. Compare the two files and fix all the differences.

Case 1: Differences due to data type visibility

If one kernel keeps a symbol or data type opaque to the modules and the other kernel doesn't, that difference appears between the .symtypes files of the two kernels. The .symtypes file from one of the kernels has UNKNOWN for a symbol and the .symtypes file from the other kernel has an expanded view of the symbol or data type.

For example, adding the following line to the include/linux/device.h file in your kernel causes CRC mismatches, one of which is for module_layout():

 #include <linux/fwnode.h>

Comparing the module.symtypes for that symbol, exposes the following differences:

 $ diff -u <GKI>/kernel/module.symtypes <your kernel>/kernel/module.symtypes
  --- <GKI>/kernel/module.symtypes
  +++ <your kernel>/kernel/module.symtypes
  @@ -334,12 +334,15 @@
  ...
  -s#fwnode_handle struct fwnode_handle { UNKNOWN }
  +s#fwnode_reference_args struct fwnode_reference_args { s#fwnode_handle * fwnode ; unsigned int nargs ; t#u64 args [ 8 ] ; }
  ...

If your kernel has a value of UNKNOWN and the GKI kernel has the expanded view of the symbol (very unlikely), then merge the latest Android Common Kernel into your kernel so that you are using the latest GKI kernel base.

In most cases, the GKI kernel has a value of UNKNOWN, but your kernel has the internal details of the symbol because of changes made to your kernel. This is because one of the files in your kernel added a #include that isn't present in the GKI kernel.

Often, the fix is just hiding the new #include from genksyms.

#ifndef __GENKSYMS__
#include <linux/fwnode.h>
#endif

Otherwise, to identify the #include that causes the difference, follow these steps:

  1. Open the header file that defines the symbol or data type having this difference. For example, edit include/linux/fwnode.h for the struct fwnode_handle.

  2. Add the following code at the top of the header file:

    #ifdef CRC_CATCH
    #error "Included from here"
    #endif
    
  3. In the module's .c file that has a CRC mismatch, add the following as the first line before any of the #include lines.

    #define CRC_CATCH 1
    
  4. Compile your module. The resulting build-time error shows the chain of header file #include that led to this CRC mismatch. For example:

    In file included from .../drivers/clk/XXX.c:16:`
    In file included from .../include/linux/of_device.h:5:
    In file included from .../include/linux/cpu.h:17:
    In file included from .../include/linux/node.h:18:
    .../include/linux/device.h:16:2: error: "Included from here"
    #error "Included from here"
    

    One of the links in this chain of #include is due to a change made in your kernel, that's missing in the GKI kernel.

  5. Identify the change, revert it in your kernel or upload it to ACK and get it merged.

Case 2: Differences due to data type changes

If the CRC mismatch for a symbol or data type isn't due to a difference in visibility, then it's due to actual changes (additions, removals, or changes) in the data type itself.

For example, making the following change in your kernel causes several CRC mismatches as many symbols are indirectly affected by this type of change:

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
  --- a/include/linux/iommu.h
  +++ b/include/linux/iommu.h
  @@ -259,7 +259,7 @@ struct iommu_ops {
     void (*iotlb_sync)(struct iommu_domain *domain);
     phys_addr_t (*iova_to_phys)(struct iommu_domain *domain, dma_addr_t iova);
     phys_addr_t (*iova_to_phys_hard)(struct iommu_domain *domain,
  -        dma_addr_t iova);
  +        dma_addr_t iova, unsigned long trans_flag);
     int (*add_device)(struct device *dev);
     void (*remove_device)(struct device *dev);
     struct iommu_group *(*device_group)(struct device *dev);

One CRC mismatch is for devm_of_platform_populate().

If you compare the .symtypes files for that symbol, it might look like this:

 $ diff -u <GKI>/drivers/of/platform.symtypes <your kernel>/drivers/of/platform.symtypes
  --- <GKI>/drivers/of/platform.symtypes
  +++ <your kernel>/drivers/of/platform.symtypes
  @@ -399,7 +399,7 @@
  ...
  -s#iommu_ops struct iommu_ops { ... ; t#phy
  s_addr_t ( * iova_to_phys_hard ) ( s#iommu_domain * , t#dma_addr_t ) ; int
    ( * add_device ) ( s#device * ) ; ...
  +s#iommu_ops struct iommu_ops { ... ; t#phy
  s_addr_t ( * iova_to_phys_hard ) ( s#iommu_domain * , t#dma_addr_t , unsigned long ) ; int ( * add_device ) ( s#device * ) ; ...

To identify the changed type, follow these steps:

  1. Find the definition of the symbol in the source code (usually in .h files).

    • For symbol differences between your kernel and the GKI kernel, find the commit by running the following command:
    git blame
    
    • For deleted symbols (where a symbol is deleted in a tree and you also want to delete it in the other tree), you need to find the change that deleted the line. Use the following command on the tree where the line was deleted:
    git log -S "copy paste of deleted line/word" -- <file where it was deleted>
    
  2. Review the returned list of commits to locate the change or deletion. The first commit is probably the one you are searching for. If it isn't, go through the list until you find the commit.

  3. After you identify the change, either revert it in your kernel or upload it to ACK and get it merged.