After re-inventing data migrations for several years, I decided to create a helper for myself.
This script provides an efficient solution for running batch updates on Django model records. By processing records in batches, it ensures minimal disruption and optimal performance. The tool also includes progress tracking and optional validation after updates.
N.B. Both the script and this documentation made use of generative AI for validating, restructuring, debating, and more. I have not yet tested this but will appreciate error reports.
This code is MIT-licensed. Please attribute accordingly when using.
from your_module import BatchUpdate
from your_app.models import YourModel
def your_update_function(record):
# Apply your update logic
record.field1 = 'some_new_value'
record.field2 = record.another_field
return record
queryset = YourModel.objects.filter(your_conditions)
fields_to_update = ['field1', 'field2']
batch_update = BatchUpdate(
queryset=queryset,
record_updater=your_update_function,
fields_to_update=fields_to_update,
batch_size=2000,
validation_query=queryset.filter(validation_conditions),
write_database='default',
read_database='replica'
)
batch_update()
-
queryset: QuerySet
- The initial queryset from which records will be fetched for updating.
-
record_updater: typing.Callable[[Model], Model]
- A function that takes a record as input and returns an updated record.
-
fields_to_update: list[str]
- A list of model field names that should be updated.
-
batch_size: int = 2000
- The number of records to process in each batch. Default is 2000.
-
validation_query: typing.Optional[QuerySet] = None
- An optional queryset used to validate that all records were updated correctly after the process completes.
-
write_database: typing.Optional[str] = None
- The alias of the database where the updates should be written. Default is
None
, meaning the default database is used.
- The alias of the database where the updates should be written. Default is
-
read_database: typing.Optional[str] = None
- The alias of the database from which records should be read. Default is
None
, meaning the default database is used.
- The alias of the database from which records should be read. Default is
Running data migrations, especially on large datasets, can present several challenges:
- Database Locking: Large updates can lock tables for extended periods, leading to downtime or delayed processing.
- Memory Usage: Processing vast numbers of records in one go can lead to high memory consumption, potentially causing the application to crash.
- Transaction Size: Large transactions can overwhelm the database, leading to timeouts or rollback failures.
- Progress Tracking: Monitoring the progress of large migrations is difficult, creating uncertainty about the process.
- Validation: Ensuring all records are correctly updated after the migration is challenging.
To address these challenges effectively:
- Batch Processing: Break down large updates into smaller batches to avoid locking issues and reduce memory usage.
- Error Handling: Implement robust error handling to manage partial updates and avoid inconsistent states.
- Progress Monitoring: Use tools like
tqdm
for progress tracking to maintain visibility over the migration process. - Validation: After completing updates, validate changes to ensure that all records have been updated as expected.
- Database Management: Use separate read and write databases to distribute the load and minimize impact on live systems.
The BatchUpdate
class is designed to manage these challenges effectively:
- Batch Size Control: The batch size is adjustable, allowing you to control how many records are processed at a time, reducing the risk of database locks and high memory usage.
- Custom Record Updater: The
record_updater
function offers flexibility to apply custom logic to each record before saving it back to the database. - Progress Reporting: The script utilizes
tqdm
for a visual progress bar. Iftqdm
is unavailable, it logs progress manually. - Validation Option: After the batch update, you can provide a validation query to ensure that all records have been correctly updated. If validation fails, the process raises an error.
This script simplifies large-scale data migrations in Django applications by processing records in batches, providing detailed progress reporting, and including error handling and validation. It offers a flexible, efficient solution with minimal disruption to your application, particularly in environments that require distributed read and write operations.