Skip to content

Instantly share code, notes, and snippets.

@mrl22
Created January 13, 2026 16:34
Show Gist options
  • Select an option

  • Save mrl22/729020ab3d060e8bfe55c95321fbe9f5 to your computer and use it in GitHub Desktop.

Select an option

Save mrl22/729020ab3d060e8bfe55c95321fbe9f5 to your computer and use it in GitHub Desktop.
s5cmd sync script for very large buckets with little to no ram

Low-Memory S3 Backup Script

Why I wrote this

I needed to back up a large S3-compatible bucket on a small VPS without running out of memory.

While tools like s5cmd sync work very well in many cases, they can struggle with extremely large buckets when memory is limited. I wanted something that:

  • stays within a predictable memory limit
  • makes its decisions on disk rather than in RAM
  • can be inspected and resumed at any stage

This script is the result of that work.

Who this is for

This script is intended for:

  • large S3-compatible buckets
  • low-resource servers (VPS, small instances)
  • people who prefer transparent, step-by-step tooling

What it does (high level)

At a high level, the script:

  1. Lists every file in the bucket
  2. Writes that listing to disk
  3. Builds a clean list of remote files and sizes
  4. Builds a clean list of local files and sizes
  5. Compares the two lists
  6. Decides which files need copying
  7. Creates only the required directories
  8. Downloads only missing or changed files

All comparison work happens on disk using plain text files.

Why not just use s5cmd sync

s5cmd sync is excellent and should be your first choice when it works.

However, on very large buckets and small machines, it can consume large amounts of memory while building and holding object state. In my case, that made it unusable.

In one reported case, a user observed s5cmd sync growing to over 76 GB of RAM after running for several hours on a very large bucket. On a small VPS, that is not survivable.

This script avoids that problem by:

  • never holding the full object list in memory
  • streaming listings to disk
  • using disk-backed sort and join operations

The tradeoff is speed and complexity, but the memory usage remains bounded.

What this script does not do

  • It does not delete local files when they are removed from S3
  • It does not perform checksum verification
  • It does not attempt to be a perfect mirror

It is intentionally conservative and copy-only.

Tested environment

This script has been tested on:

  • 2 core VPS
  • 2 GB RAM

Against a bucket with:

  • approximately 1.5 TB of data
  • approximately 2.3 million files

Observed behaviour:

  • peak memory usage of roughly 530 MB
  • around 6 minutes to build listings and copy decisions before downloads began (step 1 to 7)

Requirements

  • Linux
  • bash
  • s5cmd (with json support)
  • jq
  • sufficient disk space for intermediate files

How to use this script

Basic usage

Edit the file first and set your AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_REGION.

Run the script and pass one or more bucket names. Multiple buckets can be provided in a single run.

./backup.sh my-bucket
./backup.sh bucket-one bucket-two bucket-three

Each bucket is processed independently and backed up into its own directory.


Destination directory

By default, files are downloaded into:

$HOME/<bucket-name>/

You can override this by setting DEST_ROOT:

DEST_ROOT=/mnt/backups ./backup.sh my-bucket

All buckets will be downloaded under this directory, each in its own subdirectory.


Configuration method

All configuration is done via environment variables.
The script itself does not need to be edited.


Regeneration flags

These flags control which steps of the pipeline are regenerated.
All flags default to 1 (enabled).

Variable Description
REGEN_LISTING Re-download the bucket listing from S3 (JSON format)
REGEN_REMOTE_MANIFEST Rebuild the remote file list (path and size)
REGEN_LOCAL_MANIFEST Re-scan the local filesystem
REGEN_JOINED Rebuild the remote vs local comparison
REGEN_OUTPUTS Regenerate the copy commands and directory list

Examples

Reuse an existing bucket listing:

REGEN_LISTING=0 ./backup.sh my-bucket

Reuse listing and remote manifest, but re-scan local files:

REGEN_LISTING=0 REGEN_REMOTE_MANIFEST=0 ./backup.sh my-bucket

Only rerun the download step using existing command files:

REGEN_LISTING=0 REGEN_REMOTE_MANIFEST=0 REGEN_LOCAL_MANIFEST=0 REGEN_JOINED=0 REGEN_OUTPUTS=0 ./backup.sh my-bucket

Download concurrency

Control how many parallel downloads s5cmd performs:

S5_NUMWORKERS=2 ./backup.sh my-bucket

Lower values reduce memory usage and system pressure.
Higher values increase download speed but use more resources.

The default is set to 6.


Inspecting what will happen

Before any files are downloaded, the script generates:

  • <bucket>.commands.txt – exact copy commands that will be executed
  • <bucket>.dirs.txt – directories that will be created

These files make the process predictable and auditable.


Resuming and retrying

The script is safe to rerun.

  • It will not re-download files that already exist with the correct size
  • You can reuse previous data using regeneration flags
  • Local files are never deleted

Notes

  • The script never deletes local files
  • Comparison is size-based only (modified date was far too taxing)
  • Deleted or renamed remote files are ignored on subsequent runs

The goal is safe, low-memory, copy-only backups with full visibility.

#!/usr/bin/env bash
set -euo pipefail
export AWS_ACCESS_KEY_ID=''
export AWS_SECRET_ACCESS_KEY=''
export AWS_PROFILE='default'
export AWS_REGION='lon1.digitaloceanspaces.com'
export AWS_ENDPOINT_URL="https://${AWS_REGION}"
export S3_ENDPOINT_URL="$AWS_ENDPOINT_URL"
DEST_ROOT="${DEST_ROOT:-$HOME}"
mkdir -p "$DEST_ROOT"
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
# Regeneration switches (set to 0 to reuse existing files)
REGEN_LISTING="${REGEN_LISTING:-1}"
REGEN_REMOTE_MANIFEST="${REGEN_REMOTE_MANIFEST:-1}"
REGEN_LOCAL_MANIFEST="${REGEN_LOCAL_MANIFEST:-1}"
REGEN_JOINED="${REGEN_JOINED:-1}"
REGEN_OUTPUTS="${REGEN_OUTPUTS:-1}"
if [ "$#" -eq 0 ]; then
echo "Usage: $0 bucket1 bucket2 ..."
exit 1
fi
buckets=()
for b in "$@"; do
buckets+=("s3://$b")
done
S5_NUMWORKERS="${S5_NUMWORKERS:-6}"
failed=()
for bucket in "${buckets[@]}"; do
bucket_name="${bucket#s3://}"
bucket_dir="$DEST_ROOT/$bucket_name"
mkdir -p "$bucket_dir"
listing_json="$SCRIPT_DIR/${bucket_name}.listing.json"
remote_tsv="$SCRIPT_DIR/${bucket_name}.remote.tsv"
local_tsv="$SCRIPT_DIR/${bucket_name}.local.tsv"
joined_tsv="$SCRIPT_DIR/${bucket_name}.joined.tsv"
dirs_file="$SCRIPT_DIR/${bucket_name}.dirs.txt"
commands_file="$SCRIPT_DIR/${bucket_name}.commands.txt"
# ---- listing (JSON) with live counter ----
if [ "$REGEN_LISTING" -eq 1 ]; then
echo "Fetching JSON listing for $bucket"
: > "$listing_json"
s5cmd --json ls "$bucket/*" | awk '
BEGIN { start = systime(); count = 0 }
{
count++
print $0
if (count % 1000 == 0) {
now = systime()
elapsed = now - start
if (elapsed > 0) {
rate = count / elapsed
printf "\rDownloaded %d JSON lines (%.0f/sec)", count, rate > "/dev/stderr"
} else {
printf "\rDownloaded %d JSON lines", count > "/dev/stderr"
}
}
}
END {
if (count > 0) {
printf "\rDownloaded %d JSON lines\n", count > "/dev/stderr"
}
}
' >> "$listing_json"
else
echo "Reusing existing JSON listing: $listing_json"
if [ ! -s "$listing_json" ]; then
echo "Listing JSON missing or empty: $listing_json"
failed+=("$bucket")
continue
fi
fi
# ---- remote manifest (exact key, size) ----
if [ "$REGEN_REMOTE_MANIFEST" -eq 1 ]; then
echo "Building remote manifest"
: > "$remote_tsv"
jq -r --arg b "$bucket_name" '
select(.type=="file") |
(.key
| sub("^s3://"; "")
| sub("^" + $b + "/"; "")
) + "\t" + (.size|tostring)
' "$listing_json" > "$remote_tsv"
else
echo "Reusing remote manifest: $remote_tsv"
if [ ! -s "$remote_tsv" ]; then
echo "Remote manifest missing or empty: $remote_tsv"
failed+=("$bucket")
continue
fi
fi
# ---- local manifest ----
if [ "$REGEN_LOCAL_MANIFEST" -eq 1 ]; then
echo "Building local manifest"
: > "$local_tsv"
if [ -d "$bucket_dir" ]; then
(cd "$bucket_dir" && find . -type f -printf '%P\t%s\n') > "$local_tsv" || :
fi
else
echo "Reusing local manifest: $local_tsv"
if [ ! -f "$local_tsv" ]; then
echo "Local manifest missing: $local_tsv"
failed+=("$bucket")
continue
fi
fi
# ---- sort manifests (required for join) ----
echo "Sorting manifests"
LC_ALL=C sort -t $'\t' -k1,1 "$remote_tsv" -o "$remote_tsv"
LC_ALL=C sort -t $'\t' -k1,1 "$local_tsv" -o "$local_tsv"
# ---- join manifests ----
if [ "$REGEN_JOINED" -eq 1 ]; then
echo "Joining manifests"
: > "$joined_tsv"
join -t $'\t' -a 1 -e '' -o 1.1,1.2,2.2 "$remote_tsv" "$local_tsv" > "$joined_tsv"
else
echo "Reusing joined file: $joined_tsv"
if [ ! -s "$joined_tsv" ]; then
echo "Joined file missing or empty: $joined_tsv"
failed+=("$bucket")
continue
fi
fi
# ---- reset outputs ----
if [ "$REGEN_OUTPUTS" -eq 1 ]; then
: > "$dirs_file"
: > "$commands_file"
else
echo "Reusing outputs: $dirs_file and $commands_file"
fi
total_files=$(wc -l < "$joined_tsv" | tr -d ' ')
echo "Total objects to scan: $total_files"
echo "Building copy list for $bucket (size-only)"
count=0
start_ts=$(date +%s)
while IFS=$'\t' read -r rel remote_size local_size; do
count=$((count + 1))
if (( count % 1000 == 0 )); then
now=$(date +%s)
elapsed=$((now - start_ts))
if (( elapsed > 0 )); then
rate=$((count / elapsed))
if (( rate > 0 )); then
remaining=$((total_files - count))
eta=$((remaining / rate))
printf '\rProcessed %d / %d | ETA %02d:%02d:%02d' \
"$count" "$total_files" \
$((eta/3600)) $(((eta%3600)/60)) $((eta%60))
else
printf '\rProcessed %d / %d | ETA --:--:--' "$count" "$total_files"
fi
else
printf '\rProcessed %d / %d | ETA --:--:--' "$count" "$total_files"
fi
fi
# Copy if missing locally or size differs
if [ -z "${local_size:-}" ] || [ "$local_size" -ne "$remote_size" ]; then
dest="$bucket_dir/$rel"
printf '%s\n' "$(dirname "$dest")" >> "$dirs_file"
printf 'cp "%s/%s" "%s"\n' "$bucket" "$rel" "$dest" >> "$commands_file"
fi
done < "$joined_tsv"
echo
files_to_copy=$(wc -l < "$commands_file" | tr -d ' ')
echo "Files to copy: $files_to_copy"
dirs_to_create=$(sort -u "$dirs_file" | wc -l | tr -d ' ')
echo "Directories to create: $dirs_to_create"
if [ "$files_to_copy" -eq 0 ]; then
echo "Nothing to copy for $bucket"
continue
fi
echo "Creating directories for $bucket"
sort -u "$dirs_file" | xargs -r mkdir -p
echo "Copying files for $bucket (workers: $S5_NUMWORKERS)"
if ! s5cmd --numworkers "$S5_NUMWORKERS" run "$commands_file"; then
echo "FAILED: $bucket"
failed+=("$bucket")
else
echo "Done: $bucket"
fi
done
if [ "${#failed[@]}" -ne 0 ]; then
echo "Some buckets failed:"
printf ' - %s\n' "${failed[@]}"
exit 2
fi
echo "All buckets have been downloaded to $DEST_ROOT"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment