Use AWS multi upload commands to upload large files to S3

Description

AWS has two ways of performing S3 uploads: the "Low-Level" "aws s3api" set of commands and the "High-Level" "aws s3 cp" command. This page outlines how to use the S3's "Low-Level" "aws s3api" commands that allows us to upload very large files.

We have run into a common scenario where S3 uploads of very large files will fail if they exceed the security token expiration of 1-hour window. Other workarounds, such as using a mesos slave to upload from may have disk space limitations. In these cases, we cannot easily use the typical "High-Level" "aws s3 cp" command. Although this command also performs automatic multipart uploads behind the scenes, any timeout will completely cancel file uploads with no way of resuming where the upload last left off.

On the other hand, although the "Low-Level" "aws s3api" set of commands can be pretty tedious, any failed uploads can be re-tried without interrupting or canceling any other successful uploads.

Step-by-step guide

Split original file into smaller files. Here is an example splitting into 2GB files:

$ split -b 2048m SOME_LARGE_FILE.CSV SPLIT_FILES.CSV

Generate an md5 hash to supply for later integrity checks:

$ openssl md5 -binary SOME_LARGE_FILE.CSV | base64

Output: eL39z5x4rbcNxpoWCMB77w==

Note: As an optional step, you may also choose to calculate MD5 values for each individual file. See the AWS link in resources section for more details.

Generate a multipart-upload and upload ID. Specify your bucket and key information along with your previously generated md5 hash. The command will generate an "UploadId" that you will use for each individual file upload.

$ aws s3api create-multipart-upload --bucket my-s3-bucket-name --key some/s3/folder/SOME_LARGE_FILE.CSV --metadata md5=eL39z5x4rbcNxpoWCMB77w==

Output:

{
    "Bucket": "my-s3-bucket-name",
    "Key": "some/s3/folder/SOME_LARGE_FILE.CSV",
    "UploadId": "some-generated-unique-upload_id"
}

Upload the individual file parts. Note that the --part-number should be incremented for each file. The --body should include the name of the actual file part. Use the UploadId from above for --upload-id. Save each generated ETag metadata for later.

$ aws s3api upload-part --bucket my-s3-bucket-name --key some/s3/folder/SOME_LARGE_FILE.CSV --part-number 1 --body SPLIT_FILES.CSVaa --upload-id some-generated-unique-upload_id

Output:

{
"ETag": "\"aec5e3c5dc78e17945647a23a05917c7\""
}

$ aws s3api upload-part --bucket my-s3-bucket-name --key some/s3/folder/SPLIT_FILES_20190114.CSV --part-number 2 --body SPLIT_FILES.CSVab --upload-id some-generated-unique-upload_id

Output:

{
"ETag": "\"e68f43f2ffd58799bf071008d0ea8da3\""
}

$ aws s3api upload-part --bucket my-s3-bucket-name --key some/s3/folder/SPLIT_FILES_20190114.CSV --part-number 3 --body SPLIT_FILES.CSVac --upload-id some-generated-unique-upload_id

Output:

{
"ETag": "\"f624e91916df77ad9ac2f7468da7da94\""
}

... ... ... ...

Once all files are uploaded, list uploaded parts and confirm uploads are complete:

$ aws s3api list-parts --bucket my-s3-bucket-name --key some/s3/folder/SPLIT_FILES_20190114.CSV --upload-id some-generated-unique-upload_id

Output:

{
"Parts": [
{
"PartNumber": 1,
"LastModified": "2019-01-18T05:18:14.000Z",
"ETag": "\"aec5e3c5dc78e17945647a23a05917c7\"",
"Size": 2147483648
},
{
"PartNumber": 2,
"LastModified": "2019-01-18T05:19:27.000Z",
"ETag": "\"e68f43f2ffd58799bf071008d0ea8da3\"",
"Size": 2147483648
},
{
"PartNumber": 3,
"LastModified": "2019-01-18T05:21:31.000Z",
"ETag": "\"f624e91916df77ad9ac2f7468da7da94\"",
"Size": 2147483648
},
...
...
...
],
"Initiator": {
"ID": "arn:aws:sts::755865716437:assumed-role/christine.le",
"DisplayName": "christine.le"
},
"Owner": {
"DisplayName": "aws-755865716437",
"ID": "1b70b9aeaf418b727a7036f46fcab451989d591023f5ba8cb8b8d1277d64d1cc"
},
"StorageClass": "STANDARD"
}

Create a file called fileparts.json that contains all the part numbers along with their associated ETags:

{
"Parts": [
{
"PartNumber": 1,
"ETag": "aec5e3c5dc78e17945647a23a05917c7"
},
{
"PartNumber": 2,
"ETag": "e68f43f2ffd58799bf071008d0ea8da3"
},
{
"PartNumber": 3,
"ETag": "f624e91916df77ad9ac2f7468da7da94"
},
...
...
...]
}

Finally, complete the multipart upload. The following reconstructs the uploaded file parts based on the fileparts.json you created. The final output can be found in the bucket and key you specified.

$ aws s3api complete-multipart-upload --multipart-upload file://fileparts.json --bucket my-s3-bucket-name --key some/s3/folder/RECONSTRUCTED_FILE.CSV --upload-id some-generated-unique-upload_id

Output:

{
"VersionId": "lNUkhVLzgOK0lSNDZORTDPYZvnwA3Y1k",
"Location": "https://my-s3-bucket-name.s3.amazonaws.com/some/s3/folder/RECONSTRUCTED_FILE.CSV",
"Bucket": "my-s3-bucket-name",
"Key": "some/s3/folder/RECONSTRUCTED_FILE.CSV",
"ETag": "\"eadbb336f01b1ab44b58ddf6f019c585-9\""
}

Other helpful notes/tips:

For any failed/incomplete uploads, you must manually delete that specific file part before re-attempting to upload it again.

List all incomplete parts for the bucket:

$ aws s3api list-multipart-uploads --bucket my-s3-bucket-name

Find the failed upload and delete it:

$ aws s3api abort-multipart-upload --bucket my-s3-bucket-name --key some/s3/folder/SPLIT_FILES.CSVAA --upload-id _zlIMB0qtvqGKiTrgcW9iyF8tAI_hjxJjeA_AWoeopSxCmltyWLFotGMZzURkxzC0ShHc_4F2QEqGoN634U2Yy9pAll0IGNPMx9AfOgTtjE66mNAl.1Dj6b78MvQ40mekIoJbGw4luropMFcLu9bIoDlQcSduVbwNAAZfihpVHc-

Related articles

christine-le/aws_s3_multipart_uploads.md