DocDB Core Dump During Backup Restore: Root Cause Analysis
This article dives into the investigation and root cause analysis of a core dump issue encountered in YugabyteDB's DocDB, specifically during backup and restore operations involving Data Definition Language (DDL) changes. The issue, tracked under Jira Link DB-19271, manifests as core dumps during the restore process, potentially leading to service disruptions. We will explore the steps to reproduce the issue, the observed error messages, and the eventual diagnosis of the underlying cause.
Understanding the Problem: DocDB Core Dumps
Core dumps are snapshots of a process's memory at the time of a crash. They are invaluable for debugging, allowing developers to examine the program's state and pinpoint the source of the error. In the context of a database like YugabyteDB, a core dump during a critical operation like backup or restore can indicate a serious problem, potentially leading to data inconsistency or service unavailability.
The specific issue we are addressing involves core dumps occurring during the restore phase of a backup, particularly when DDL operations (like adding, deleting, or modifying columns) are performed concurrently with the backup process. This scenario highlights the complexities of managing database state during schema changes and the importance of ensuring transactional consistency.
Steps to Reproduce the Core Dump Issue
The following steps outline the process used to reproduce the core dump issue in YugabyteDB:
- Set up a YugabyteDB Universe: The first step involves creating a YugabyteDB cluster with a replication factor (RF) of 3. This means that each piece of data is replicated across three nodes, ensuring high availability and fault tolerance.
- Create a Database and Load Data: Next, a database is created within the YugabyteDB universe. Then, data is loaded into four tables within this database over a 30-minute period. This simulates a realistic workload and provides a substantial amount of data for the backup and restore operations.
- Initiate a Loop of Concurrent Operations: This is the crucial step that triggers the issue. A loop is started that performs the following actions:
- Start Workload with DDL Operations: A workload is initiated that includes various DDL operations such as
INSERT,DELETE,UPDATE,DROP COLUMN,ADD COLUMN, andCHANGE TYPE. These operations are executed in a single thread but are designed to run on different nodes. This intentional distribution aims to trigger concurrent catalog rewrites. - Start Backup Creation: While the DDL workload is running, a backup process is initiated.
- Stop Workload: The DDL workload is stopped after a certain period.
- Restore on a Different Database: Finally, the backup is restored to a different database within the same YugabyteDB universe.
- Start Workload with DDL Operations: A workload is initiated that includes various DDL operations such as
By running these steps in a loop, the likelihood of encountering the core dump issue is significantly increased. The concurrency between DDL operations and the backup process creates a race condition that can lead to inconsistencies in the metadata, ultimately triggering the crash.
Observed Core Dump Signatures
During the reproduction process, two distinct core dump signatures were observed. These signatures provide valuable clues about the location and nature of the error within the YugabyteDB codebase.
Core Dump Signature 1
The first core dump signature points to an issue within the TabletSnapshots::RestoreCheckpoint function, specifically related to Google Protocol Buffers (protobuf) and memory management:
(lldb) target create "/home/yugabyte/tserver/bin/yb-tserver" --core "/home/yugabyte/cores/core_yb.1764021179.TabletServer-hi.12393.55129"
Core file '/home/yugabyte/cores/core_yb.1764021179.TabletServer-hi.12393.55129' (x86_64) was loaded.
(lldb) bt all
error: yb-tserver 0x053667ba: DW_TAG_variable has an invalid location: DW_OP_lit0, DW_OP_lit1, DW_OP_shr, DW_OP_convert 0x34, DW_OP_convert 0x3a, DW_OP_stack_value
* thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV
* frame #0: 0x000055b151e5f932 yb-tserver`google::protobuf::UnknownFieldSet::ClearFallback() + 82
frame #1: 0x000055b1534b27ed yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::HybridTime, yb::tablet::TabletSnapshots::RestoreMetadata const&, yb::docdb::ConsensusFrontier const&, bool, yb::OpId const&) [inlined] void google::protobuf::internal::InternalMetadata::Clear<google::protobuf::UnknownFieldSet>(this=<unavailable>) at metadata_lite.h:186:7
frame #2: 0x000055b1534b27db yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::HybridTime, yb::tablet::TabletSnapshots::RestoreMetadata const&, yb::docdb::ConsensusFrontier const&, bool, yb::OpId const&) [inlined] yb::QLValuePB::Clear(this=<unavailable>) at value.pb.cc:1134:23
frame #3: 0x000055b1534b27d3 yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::HybridTime, yb::tablet::TabletSnapshots::RestoreMetadata const&, yb::docdb::ConsensusFrontier const&, bool, yb::OpId const&) [inlined] yb::QLValuePB::CopyFrom(this=<unavailable>, from=0x000012ef30fdc9e0) at value.pb.cc:1905:3
frame #4: 0x000055b1534b27c6 yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::HybridTime, yb::tablet::TabletSnapshots::RestoreMetadata const&, yb::docdb::ConsensusFrontier const&, bool, yb::OpId const&) [inlined] yb::QLValuePB::operator=(this=<unavailable>, from=0x000012ef30fdc9e0) at value.pb.h:560:5
frame #5: 0x000055b1534b27c6 yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::HybridTime, yb::tablet::TabletSnapshots::RestoreMetadata const&, yb::docdb::ConsensusFrontier const&, bool, yb::OpId const&) [inlined] yb::ColumnSchema::set_missing_value(this=<unavailable>, missing_value=0x000012ef30fdc9e0) at schema.h:262:20
frame #6: 0x000055b1534b27c6 yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::HybridTime, yb::tablet::TabletSnapshots::RestoreMetadata const&, yb::docdb::ConsensusFrontier const&, bool, yb::OpId const&) at schema.cc:635:16
frame #7: 0x000055b1534b2770 yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::HybridTime, yb::tablet::TabletSnapshots::RestoreMetadata const&, yb::docdb::ConsensusFrontier const&, bool, yb::OpId const&) at tablet_metadata.cc:572:15
frame #8: 0x000055b1534b24d0 yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::HybridTime, yb::tablet::TabletSnapshots::RestoreMetadata const&, yb::docdb::ConsensusFrontier const&, bool, yb::OpId const&) [inlined] yb::tablet::KvStoreInfo::MergeWithRestored(this=<unavailable>, snapshot_kvstoreinfo=<unavailable>, primary_table_id=<unavailable>, colocated=<unavailable>, overwrite=<unavailable>) at tablet_metadata.cc:538:10
frame #9: 0x000055b1534b2358 yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::HybridTime, yb::tablet::TabletSnapshots::RestoreMetadata const&, yb::docdb::ConsensusFrontier const&, bool, yb::OpId const&) [inlined] yb::tablet::RaftGroupMetadata::MergeWithRestored(this=<unavailable>, path="/mnt/d0/yb-data/tserver/data/rocksdb/table-0000420000003000800000000000401f/tablet-a34b1d297baf4858b970dfbe72333149.snapshots/d73e1e9a-930d-46b3-a73c-b93108859ba6/tablet.metadata", overwrite=<unavailable>) at tablet_metadata.cc:1214:20
frame #10: 0x000055b1534b2358 yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(this=0x000012eff8166bd0, snapshot_dir="/mnt/d0/yb-data/tserver/data/rocksdb/table-0000420000003000800000000000401f/tablet-a34b1d297baf4858b970dfbe72333149.snapshots/d73e1e9a-930d-46b3-a73c-b93108859ba6", restore_at=<unavailable>, restore_metadata=0x00007f0d8dc50940, frontier=0x00007f0d8dc4f900, is_pitr_restore=<unavailable>, op_id=0x00007f0d8dc4fef0) at tablet_snapshots.cc:591:7
frame #11: 0x000055b1534ae200 yb-tserver`yb::tablet::TabletSnapshots::Restore(this=0x000012eff8166bd0, operation=0x000012ef3a3f3700) at tablet_snapshots.cc:394:14
frame #12: 0x000055b1533e851a yb-tserver`yb::tablet::SnapshotOperation::DoReplicated(long, yb::Status*) [inlined] yb::tablet::SnapshotOperation::Apply(this=<unavailable>, leader_term=<unavailable>, complete_status=<unavailable>) at snapshot_operation.cc:132:56
frame #13: 0x000055b1533e84e4 yb-tserver`yb::tablet::SnapshotOperation::DoReplicated(this=0x000012ef3a3f3700, leader_term=<unavailable>, complete_status=0x00007f0d8dc51288) at snapshot_operation.cc:216:3
frame #14: 0x000055b1533de44e yb-tserver`yb::tablet::Operation::Replicated(this=<unavailable>, leader_term=<unavailable>, was_pending=(value_ = true)) at operation.cc:82:3
frame #15: 0x000055b1533e0851 yb-tserver`yb::tablet::OperationDriver::ReplicationFinished(yb::Status const&, long, std::__1::vector<yb::OpId, std::__1::allocator<yb::OpId>>*) at operation_driver.cc:421:31
frame #16: 0x000055b1533e067c yb-tserver`yb::tablet::OperationDriver::ReplicationFinished(this=0x000012ef3e875400, status=<unavailable>, leader_term=-1, applied_op_ids=0x00007f0d8dc51530) at operation_driver.cc:376:5
This stack trace indicates a segmentation fault (signal SIGSEGV) within the google::protobuf::UnknownFieldSet::ClearFallback() function. This suggests a potential issue with how protobuf messages are being handled during the restore process, possibly related to memory corruption or an invalid state.
Core Dump Signature 2
The second core dump signature also points to the TabletSnapshots::RestoreCheckpoint function, but this time the error occurs during the merging of QLValuePB objects, which are used to represent query language values:
(lldb) target create "/home/yugabyte/tserver/bin/yb-tserver" --core "/home/yugabyte/cores/core_yb.1764021179.raft_notificati.12411.51308"
Core file '/home/yugabyte/cores/core_yb.1764021179.raft_notificati.12411.51308' (x86_64) was loaded.
(lldb) bt all
error: yb-tserver GetDIE for DIE 0x40 is outside of its CU 0x53020ec
error: yb-tserver GetDIE for DIE 0x40 is outside of its CU 0x53020ec
* thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV
* frame #0: 0x000055738f7d57a9 yb-tserver`google::protobuf::internal::ArenaStringPtr::Set(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, google::protobuf::Arena*) + 553
frame #1: 0x000055738fbff711 yb-tserver`yb::QLValuePB::MergeImpl(to_msg=0x000030ea6a6290e0, from_msg=0x000030ea7e8c2fa0) at value.pb.cc:0
frame #2: 0x0000557390e0d77b yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::HybridTime, yb::tablet::TabletSnapshots::RestoreMetadata const&, yb::docdb::ConsensusFrontier const&, bool, yb::OpId const&) [inlined] yb::QLValuePB::MergeFrom(this=<unavailable>, from=0x000030ea7e8c2fa0) at value.pb.h:666:5
frame #3: 0x0000557390e0d770 yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::HybridTime, yb::tablet::TabletSnapshots::RestoreMetadata const&, yb::docdb::ConsensusFrontier const&, bool, yb::OpId const&) [inlined] yb::QLValuePB::CopyFrom(this=<unavailable>, from=0x000030ea7e8c2fa0) at value.pb.cc:1906:3
frame #4: 0x0000557390e0d770 yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::HybridTime, yb::tablet::TabletSnapshots::RestoreMetadata const&, yb::docdb::ConsensusFrontier const&, bool, yb::OpId const&) [inlined] yb::QLValuePB::operator=(this=<unavailable>, from=0x000030ea7e8c2fa0) at value.pb.h:560:5
frame #5: 0x0000557390e0d770 yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::HybridTime, yb::tablet::TabletSnapshots::RestoreMetadata const&, yb::docdb::ConsensusFrontier const&, bool, yb::OpId const&) [inlined] yb::ColumnSchema::set_missing_value(this=<unavailable>, missing_value=0x000030ea7e8c2fa0) at schema.h:262:20
frame #6: 0x0000557390e0d770 yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::HybridTime, yb::tablet::TabletSnapshots::RestoreMetadata const&, yb::docdb::ConsensusFrontier const&, bool, yb::OpId const&) at schema.cc:635:16
frame #7: 0x0000557390e0d770 yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::HybridTime, yb::tablet::TabletSnapshots::RestoreMetadata const&, yb::docdb::ConsensusFrontier const&, bool, yb::OpId const&) at tablet_metadata.cc:572:15
frame #8: 0x0000557390e0d4d0 yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::HybridTime, yb::tablet::TabletSnapshots::RestoreMetadata const&, yb::docdb::ConsensusFrontier const&, bool, yb::OpId const&) [inlined] yb::tablet::KvStoreInfo::MergeWithRestored(this=<unavailable>, snapshot_kvstoreinfo=<unavailable>, primary_table_id=<unavailable>, colocated=<unavailable>, overwrite=<unavailable>) at tablet_metadata.cc:538:10
frame #9: 0x0000557390e0d358 yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::HybridTime, yb::tablet::TabletSnapshots::RestoreMetadata const&, yb::docdb::ConsensusFrontier const&, bool, yb::OpId const&) [inlined] yb::tablet::RaftGroupMetadata::MergeWithRestored(this=<unavailable>, path="/mnt/d0/yb-data/tserver/data/rocksdb/table-0000420000003000800000000000401f/tablet-dd2538e3b12e4a8bbf197b798fbf8311.snapshots/d73e1e9a-930d-46b3-a73c-b93108859ba6/tablet.metadata", overwrite=<unavailable>) at tablet_metadata.cc:1214:20
frame #10: 0x0000557390e0d358 yb-tserver`yb::tablet::TabletSnapshots::RestoreCheckpoint(this=0x000030eaa8b8d3b0, snapshot_dir="/mnt/d0/yb-data/tserver/data/rocksdb/table-0000420000003000800000000000401f/tablet-dd2538e3b12e4a8bbf197b798fbf8311.snapshots/d73e1e9a-930d-46b3-a73c-b93108859ba6", restore_at=<unavailable>, restore_metadata=0x00007fcd37d36ea0, frontier=0x00007fcd37d35e60, is_pitr_restore=<unavailable>, op_id=0x00007fcd37d36450) at tablet_snapshots.cc:591:7
frame #11: 0x0000557390e09200 yb-tserver`yb::tablet::TabletSnapshots::Restore(this=0x000030eaa8b8d3b0, operation=0x000030eaa0987500) at tablet_snapshots.cc:394:14
frame #12: 0x0000557390d4351a yb-tserver`yb::tablet::SnapshotOperation::DoReplicated(long, yb::Status*) [inlined] yb::tablet::SnapshotOperation::Apply(this=<unavailable>, leader_term=<unavailable>, complete_status=<unavailable>) at snapshot_operation.cc:132:56
frame #13: 0x0000557390d434e4 yb-tserver`yb::tablet::SnapshotOperation::DoReplicated(this=0x000030eaa0987500, leader_term=<unavailable>, complete_status=0x00007fcd37d377e8) at snapshot_operation.cc:216:3
frame #14: 0x0000557390d3944e yb-tserver`yb::tablet::Operation::Replicated(this=<unavailable>, leader_term=<unavailable>, was_pending=(value_ = true)) at operation.cc:82:3
frame #15: 0x0000557390d3b851 yb-tserver`yb::tablet::OperationDriver::ReplicationFinished(yb::Status const&, long, std::__1::vector<yb::OpId, std::__1::allocator<yb::OpId>>*) at operation_driver.cc:421:31
frame #16: 0x0000557390d3b67c yb-tserver`yb::tablet::OperationDriver::ReplicationFinished(this=0x000030ebaa15bb80, status=<unavailable>, leader_term=1, applied_op_ids=0x00007fcd37d37a90) at operation_driver.cc:376:5
frame #17: 0x000055738fc31893 yb-tserver`yb::consensus::ConsensusRound::NotifyReplicationFinished(this=0x000030eaa09d9680, status=<unavailable>, leader_term=<unavailable>, applied_op_ids=<unavailable>) at consensus_round.cc:50:14
This stack trace shows the crash occurring within google::protobuf::internal::ArenaStringPtr::Set, which is called by yb::QLValuePB::MergeImpl. This further strengthens the hypothesis that the core dump is related to the handling of QLValuePB objects during the restore process, specifically when merging data from the snapshot with the current database state. The crash happens when trying to set a string value within a protobuf arena, which is a memory management technique used by protobuf to optimize memory allocation and deallocation.
Root Cause Analysis
Based on the observed core dump signatures and the reproduction steps, the root cause of the issue can be attributed to a race condition during the restore process when DDL operations are concurrently modifying the database schema. Here's a breakdown of the problem:
- Concurrent DDL Operations: When DDL operations such as
ADD COLUMNorCHANGE TYPEare executed while a backup is being created, the database schema can be modified mid-backup. This means the backup might contain a snapshot of the schema that is inconsistent with the actual data on disk. - Inconsistent Metadata: The backup process captures metadata about the tables, including column schemas. If a DDL operation changes a column's type or adds a new column after the metadata is captured but before the data is backed up, the metadata in the backup will be out of sync with the data.
- QLValuePB Mismatch: During the restore process, the
TabletSnapshots::RestoreCheckpointfunction attempts to merge the data from the snapshot with the current state of the database. This involves deserializingQLValuePBobjects, which represent the values of columns in the tables. If the schema in the backup doesn't match the schema in the database being restored into, theMergeFromorCopyFromoperations onQLValuePBobjects can lead to memory corruption and crashes. - Protobuf Arena Issues: The use of protobuf arenas for memory management adds another layer of complexity. If a
QLValuePBobject is allocated in an arena associated with an older schema and then an attempt is made to merge data into it based on a newer schema, the arena might not have enough space or the memory layout might be incompatible, leading to a crash within theArenaStringPtr::Setfunction.
In essence, the core dump is a consequence of trying to restore data based on an inconsistent snapshot of the database schema. The concurrent DDL operations introduce a race condition that corrupts the metadata captured in the backup, leading to errors during the restore process.
Conclusion
The investigation into the DocDB core dumps during backup restore operations highlights the challenges of ensuring data consistency in a distributed database system, especially when dealing with concurrent schema changes. The root cause analysis points to a race condition between DDL operations and the backup process, leading to inconsistent metadata and eventual crashes during restore. Understanding these types of issues is crucial for maintaining the reliability and integrity of YugabyteDB deployments.
For more information on backup and restore strategies in distributed databases, you can visit the official YugabyteDB documentation.