I needed to back up a large S3-compatible bucket on a small VPS without running out of memory.
While tools like s5cmd sync work very well in many cases, they can struggle with extremely large buckets when memory is limited. I wanted something that:
- stays within a predictable memory limit
- makes its decisions on disk rather than in RAM
- can be inspected and resumed at any stage
This script is the result of that work.
This script is intended for:
- large S3-compatible buckets
- low-resource servers (VPS, small instances)
- people who prefer transparent, step-by-step tooling
At a high level, the script:
- Lists every file in the bucket
- Writes that listing to disk
- Builds a clean list of remote files and sizes
- Builds a clean list of local files and sizes
- Compares the two lists
- Decides which files need copying
- Creates only the required directories
- Downloads only missing or changed files
All comparison work happens on disk using plain text files.
s5cmd sync is excellent and should be your first choice when it works.
However, on very large buckets and small machines, it can consume large amounts of memory while building and holding object state. In my case, that made it unusable.
In one reported case, a user observed s5cmd sync growing to over 76 GB of RAM after running for several hours on a very large bucket. On a small VPS, that is not survivable.
This script avoids that problem by:
- never holding the full object list in memory
- streaming listings to disk
- using disk-backed sort and join operations
The tradeoff is speed and complexity, but the memory usage remains bounded.
- It does not delete local files when they are removed from S3
- It does not perform checksum verification
- It does not attempt to be a perfect mirror
It is intentionally conservative and copy-only.
This script has been tested on:
- 2 core VPS
- 2 GB RAM
Against a bucket with:
- approximately 1.5 TB of data
- approximately 2.3 million files
Observed behaviour:
- peak memory usage of roughly 530 MB
- around 6 minutes to build listings and copy decisions before downloads began (step 1 to 7)
- Linux
- bash
s5cmd(with json support)jq- sufficient disk space for intermediate files
Edit the file first and set your AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_REGION.
Run the script and pass one or more bucket names. Multiple buckets can be provided in a single run.
./backup.sh my-bucket
./backup.sh bucket-one bucket-two bucket-threeEach bucket is processed independently and backed up into its own directory.
By default, files are downloaded into:
$HOME/<bucket-name>/
You can override this by setting DEST_ROOT:
DEST_ROOT=/mnt/backups ./backup.sh my-bucketAll buckets will be downloaded under this directory, each in its own subdirectory.
All configuration is done via environment variables.
The script itself does not need to be edited.
These flags control which steps of the pipeline are regenerated.
All flags default to 1 (enabled).
| Variable | Description |
|---|---|
REGEN_LISTING |
Re-download the bucket listing from S3 (JSON format) |
REGEN_REMOTE_MANIFEST |
Rebuild the remote file list (path and size) |
REGEN_LOCAL_MANIFEST |
Re-scan the local filesystem |
REGEN_JOINED |
Rebuild the remote vs local comparison |
REGEN_OUTPUTS |
Regenerate the copy commands and directory list |
Reuse an existing bucket listing:
REGEN_LISTING=0 ./backup.sh my-bucketReuse listing and remote manifest, but re-scan local files:
REGEN_LISTING=0 REGEN_REMOTE_MANIFEST=0 ./backup.sh my-bucketOnly rerun the download step using existing command files:
REGEN_LISTING=0 REGEN_REMOTE_MANIFEST=0 REGEN_LOCAL_MANIFEST=0 REGEN_JOINED=0 REGEN_OUTPUTS=0 ./backup.sh my-bucketControl how many parallel downloads s5cmd performs:
S5_NUMWORKERS=2 ./backup.sh my-bucketLower values reduce memory usage and system pressure.
Higher values increase download speed but use more resources.
The default is set to 6.
Before any files are downloaded, the script generates:
<bucket>.commands.txt– exact copy commands that will be executed<bucket>.dirs.txt– directories that will be created
These files make the process predictable and auditable.
The script is safe to rerun.
- It will not re-download files that already exist with the correct size
- You can reuse previous data using regeneration flags
- Local files are never deleted
- The script never deletes local files
- Comparison is size-based only (modified date was far too taxing)
- Deleted or renamed remote files are ignored on subsequent runs
The goal is safe, low-memory, copy-only backups with full visibility.