Implementing Virtual A/B - Patches

Cherry-pick the following patches to address the following known issues.

Check allocatable space correctly when sideloading

Sideloading a full OTA package on a Virtual A/B device that has a super partition with a size smaller than *2 * sum(size of update groups)* may fail with the following in recovery log /tmp/recovery.log:

The maximum size of all groups with suffix _b (...) has exceeded half of allocatable space for dynamic partitions ...

Here is an example of the log:

[INFO:dynamic_partition_control_android.cc(1020)] Will overwrite existing partitions. Slot A may be unbootable until update finishes!
[...]
[ERROR:dynamic_partition_control_android.cc(803)] The maximum size of all groups with suffix _b (2147483648) has exceeded half of allocatable space for dynamic partitions 1073741824.

If you encounter this issue, cherry pick CL 1399393, rebuild, and flash the boot partition or recovery partition if the device doesn't use recovery as boot.

Fix segmentation fault during merge

After applying an OTA update, during the VAB merge process, a call to update_engine_client --cancel causes CleanupPreviousUpdateAction to crash. A potential wild pointer error also exists when markSlotSuccessful comes late.

This was resolved by adding the StopActionInternal function. CleanupPreviousUpdateAction cancels pending tasks on destroy. It maintains a variable that tracks the task ID of the pending task in the message loop. On destroy, the pending task is canceled to avoid segfault.

Ensure the following changes are in your Android 11 source tree to fix SIGSEGV crashes in update_engine during merge:

  • CL 1439792 (A prerequisite to CL 1439372)
  • CL 1439372 (CleanupPreviousUpdateAction: cancel pending tasks on destroy)
  • CL 1663460 (Fix the potential wild pointer error when markSlotSuccessful comes late)

Fix VAB incorrect slot-switching, post OTA update

In Android 11 and higher, failure to synchronize a slot-switch in a device after an OTA update can put a device into an unusable state. If your IBootControl HAL's slot-switching implementation performs writes, you must flush those writes immediately. If the writes aren't flushed, and the device reboots after the merge starts, but before the hardware can flush the slot-switch write, the device may revert to the previous slot and fail to boot.

For an example code solution, view this CL: CL 1535570.

Prevent update_engine premature merge

When a device boots (Android 11 and higher), and the boot completes, the update_engine calls ScheduleWaitMarkBootSuccessful(), and WaitForMergeOrSchedule(). This starts the merge process. However, the device reboots to the old slot. Because the merge already started, the device fails to boot and becomes inoperable.

Add the following changes to your source tree. Note that CL 1664859 is optional.

  • CL 1439792 (A prerequisite to CL 1439372).
  • CL 1439372 (CleanupPreviousUpdateAction: cancel pending tasks on destroy)
  • CL 1663460 (Fix the potential wild pointer error when markSlotSuccessful comes late)
  • CL 1664859 (Optional - add unittest for CleanupPreviousUpdateAction)

Prevent data loss or corruption due to skipped metadata

In Android 11 and higher, if a storage device has a volatile write-back cache, under certain conditions, the metadata of a completed merge gets skipped, resulting in data loss or corruption.

Conditions:

  1. After finishing a merge operation of one set of exceptions, merge_callback() was invoked.
  2. The metadata was updated in the COW device that tracks the merge completion. (This update to COW device is flushed cleanly.)

Result: The system crashed due to the storage device's cache of the recent merge not getting flushed.

See the following to implement a resolution:

Ensure the correct dm-verity configuration

In Android 11 and higher, devices can be inadvertently configured with the following dm-verity options:

  • CONFIG_DM_VERITY_AVB=y in the kernel
  • The bootloader configured to use any verity mode, (such as AVB_HASHTREE_ERROR_MODE_RESTART_AND_INVALIDATE), without AVB_HASHTREE_ERROR_MODE_MANAGED_RESTART_AND_EIO.

With this device configuration, any verity error causes the vbmeta partition to become corrupted, and renders non-A/B devices inoperable. Similarly, if a merge has started, A/B devices might also become inoperable. Only use the AVB_HASHTREE_ERROR_MODE_MANAGED_RESTART_AND_EIO verity mode.

  1. Set CONFIG_DM_VERITY_AVB=n in the kernel
  2. Configure devices to use the AVB_HASHTREE_ERROR_MODE_MANAGED_RESTART_AND_EIO mode instead.

For more information, and as a matter of practice, reference the verity documentation: Handling dm-verity Errors.

Skip verity work in response to an I/O error during emergency system shutdown

In Android 11 and higher, if an emergency system shutdown is called (as in the case of a thermal shutdown), a dm device can be alive while the block device can’t process I/O requests anymore. In this state, I/O errors handled by new dm I/O requests, or by those already in-flight, can lead to a verity corruption state, which is a misjudgment.

To skip verity work in response to an I/O error when the system is shutting down, use the following:

CL 1847875 (Skips verity work in response to I/O error during shutdown)

Ensure DM_ANDROID_VERITY_AT_MOST_ONCE_DEFAULT_ENABLED is off

Android Go devices running the 4.19 kernel or earlier may have DM_ANDROID_VERITY_AT_MOST_ONCE_DEFAULT_ENABLED=y in their kernel configuration. This setting isn't compatible with Virtual A/B, and is known to cause rare page corruption issues when both are enabled together.

For kernels 4.19 and earlier, disable it by setting CONFIG_DM_ANDROID_VERITY_AT_MOST_ONCE_DEFAULT_ENABLED=n in the kernel config.

For kernels 5.4 and later, the code has been removed and the configuration option isn't available.