Cherry-pick the following patches to address the following known issues.
Check allocatable space correctly when sideloading
Sideloading a full OTA package on a Virtual A/B device that has a super
partition with a size smaller than *2 * sum(size of update groups)* may fail
with the following in recovery log /tmp/recovery.log
:
The maximum size of all groups with suffix _b (...) has exceeded half of allocatable space for dynamic partitions ...
Here is an example of the log:
[INFO:dynamic_partition_control_android.cc(1020)] Will overwrite existing partitions. Slot A may be unbootable until update finishes!
[...]
[ERROR:dynamic_partition_control_android.cc(803)] The maximum size of all groups with suffix _b (2147483648) has exceeded half of allocatable space for dynamic partitions 1073741824.
If you encounter this issue, cherry pick CL 1399393, rebuild, and flash the boot partition or recovery partition if the device doesn't use recovery as boot.
Fix segmentation fault during merge
After applying an OTA update, during the VAB merge process, a call to
update_engine_client --cancel
causes CleanupPreviousUpdateAction
to crash. A
potential wild pointer error also exists when markSlotSuccessful
comes late.
This was resolved by adding the StopActionInternal
function.
CleanupPreviousUpdateAction
cancels pending tasks on destroy. It maintains a
variable that tracks the task ID of the pending task in the message loop. On
destroy, the pending task is canceled to avoid segfault.
Ensure the following changes are in your Android 11 source tree to fix SIGSEGV
crashes in update_engine
during merge:
- CL 1439792 (A prerequisite to CL 1439372)
- CL 1439372
(
CleanupPreviousUpdateAction
: cancel pending tasks on destroy) - CL 1663460
(Fix the potential wild pointer error when
markSlotSuccessful
comes late)
Fix VAB incorrect slot-switching, post OTA update
In Android 11 and higher, failure to synchronize a slot-switch in a device after
an OTA update can put a device into an unusable state. If your IBootControl
HAL's slot-switching implementation performs writes, you must flush those writes
immediately. If the writes aren't flushed, and the device reboots after the
merge starts, but before the hardware can flush the slot-switch write, the
device may revert to the previous slot and fail to boot.
For an example code solution, view this CL: CL 1535570.
Prevent update_engine premature merge
When a device boots (Android 11 and higher), and the boot completes, the
update_engine
calls ScheduleWaitMarkBootSuccessful()
, and
WaitForMergeOrSchedule()
. This starts the merge process. However, the device
reboots to the old slot. Because the merge already started, the device fails to
boot and becomes inoperable.
Add the following changes to your source tree. Note that CL 1664859 is optional.
- CL 1439792 (A prerequisite to CL 1439372).
- CL 1439372
(
CleanupPreviousUpdateAction
: cancel pending tasks on destroy) - CL 1663460
(Fix the potential wild pointer error when
markSlotSuccessful
comes late) - CL 1664859
(Optional - add
unittest
forCleanupPreviousUpdateAction
)
Prevent data loss or corruption due to skipped metadata
In Android 11 and higher, if a storage device has a volatile write-back cache, under certain conditions, the metadata of a completed merge gets skipped, resulting in data loss or corruption.
Conditions:
- After finishing a merge operation of one set of exceptions,
merge_callback()
was invoked. - The metadata was updated in the COW device that tracks the merge completion. (This update to COW device is flushed cleanly.)
Result: The system crashed due to the storage device's cache of the recent merge not getting flushed.
See the following to implement a resolution:
Ensure the correct dm-verity configuration
In Android 11 and higher, devices can be inadvertently configured with the following dm-verity options:
CONFIG_DM_VERITY_AVB=y
in the kernel- The bootloader configured to use any verity mode, (such as
AVB_HASHTREE_ERROR_MODE_RESTART_AND_INVALIDATE
), withoutAVB_HASHTREE_ERROR_MODE_MANAGED_RESTART_AND_EIO
.
With this device configuration, any verity error causes the vbmeta partition to
become corrupted, and renders non-A/B devices inoperable. Similarly, if a merge
has started, A/B devices might also become inoperable. Only use the
AVB_HASHTREE_ERROR_MODE_MANAGED_RESTART_AND_EIO
verity mode.
- Set
CONFIG_DM_VERITY_AVB=n
in the kernel - Configure devices to use the
AVB_HASHTREE_ERROR_MODE_MANAGED_RESTART_AND_EIO
mode instead.
For more information, and as a matter of practice, reference the verity documentation: Handling dm-verity Errors.
Skip verity work in response to an I/O error during emergency system shutdown
In Android 11 and higher, if an emergency system shutdown is called (as in the case of a thermal shutdown), a dm device can be alive while the block device can’t process I/O requests anymore. In this state, I/O errors handled by new dm I/O requests, or by those already in-flight, can lead to a verity corruption state, which is a misjudgment.
To skip verity work in response to an I/O error when the system is shutting down, use the following:
CL 1847875 (Skips verity work in response to I/O error during shutdown)
Ensure DM_ANDROID_VERITY_AT_MOST_ONCE_DEFAULT_ENABLED is off
Android Go devices running the 4.19 kernel or earlier may have
DM_ANDROID_VERITY_AT_MOST_ONCE_DEFAULT_ENABLED=y
in their kernel configuration.
This setting isn't compatible with Virtual A/B, and is known to cause rare page
corruption issues when both are enabled together.
For kernels 4.19 and earlier, disable it by setting
CONFIG_DM_ANDROID_VERITY_AT_MOST_ONCE_DEFAULT_ENABLED=n
in the kernel config.
For kernels 5.4 and later, the code has been removed and the configuration option isn't available.
Confirm the merged file is correctly configured
If you are building system images and vendor images separately, then using
merge_target_files
to merge them,
Virtual A/B configurations might be incorrectly dropped during the merge process.
To verify that Virtual A/B configurations are correct in the merged
target file, apply the following patches:
CL 2084183(Merge identical key/val pairs in dynamic partition info
)
Update necessary components
As of Android 13, snapuserd
has been moved from vendor ramdisk to generic
ramdisk. If your device is upgrading to Android 13, it is possible that both
vendor ramdisk and generic ramdisk contain a copy of snapuserd
. In this
situation, Virtual A/B requires the system copy of snapuserd
. To ensure that
the correct copy of snapuserd
is in place,
apply
CL 2031243
(Copy snapuserd to first_stage_ramdisk).