Skip to content

Instantly share code, notes, and snippets.

@JacobFV
Last active April 21, 2024 15:35
Show Gist options
  • Save JacobFV/7fe4f0e14280a4fb76b0f6f455f1b325 to your computer and use it in GitHub Desktop.
Save JacobFV/7fe4f0e14280a4fb76b0f6f455f1b325 to your computer and use it in GitHub Desktop.
huggingface_to_s3

Hugging Face Repository to S3 Transfer Script

This Python script allows you to transfer files from a Hugging Face repository to an Amazon S3 bucket. It iterates over all the files in the specified repository, downloads them one at a time, and uploads them to the designated S3 bucket.

Prerequisites

Before running the script, ensure that you have the following:

  • choose the aws linux ami with 10kIOPS and IO2 storage. make sure you can https out and ssh in

OR

  • Python 3.x installed on your system
  • AWS account with access to S3
  • Hugging Face repository details (owner, repository name, branch)
  • S3 bucket name for storing the transferred files

Setup

  1. ssh into your instance by clicking connect in the top right corner of the console, or using trad methods

  2. Open nano and copypasta the script into huggingface_to_s3.py

  3. Install the required Python packages by running the following command:

    pip install boto3 requests
    
  4. Configure your AWS credentials using one of the following methods:

    • Assign an IAM role with S3 put permissions to your ec2 instance

    OR

    • Set up environment variables: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
    • Use an AWS credentials file (~/.aws/credentials)
    • Configure the AWS CLI using aws configure

Usage

To run the script, use the following command:

python huggingface_to_s3.py --repo-owner <repo_owner> --repo-name <repo_name> --branch <branch> --s3-bucket <s3_bucket_name>

Replace the placeholders with the appropriate values:

  • <repo_owner>: The owner of the Hugging Face repository.
  • <repo_name>: The name of the Hugging Face repository.
  • <branch>: The branch of the repository to transfer files from (default: "main").
  • <s3_bucket_name>: The name of the S3 bucket to store the transferred files.

The script will create the S3 bucket if it doesn't already exist.

Advice

  • Ensure that you have the necessary permissions to access the Hugging Face repository and the S3 bucket.
  • Be cautious when transferring large repositories, as it may take a considerable amount of time and consume significant network bandwidth.
  • Monitor the script's output for any error messages or warnings during the transfer process.
  • Regularly review and clean up the S3 bucket to avoid unnecessary storage costs.

Troubleshooting

  • If you encounter any issues related to AWS credentials or permissions, double-check your AWS configuration and ensure that you have the required permissions to access S3.
  • If the script fails to retrieve files from the Hugging Face repository, verify that the repository details (owner, name, branch) are correct and that you have the necessary access rights.
  • If you experience network-related issues, check your internet connection and ensure that you can access the Hugging Face API and AWS S3 endpoints.

For more detailed information and advanced usage, please refer to the script's source code and the documentation of the respective libraries (boto3 and requests):

#!/usr/bin/env python3
import os
import argparse
import boto3
import requests
# Parse command-line arguments
parser = argparse.ArgumentParser(description="Transfer files from Hugging Face repository to S3")
parser.add_argument("--repo-owner", required=True, help="Owner of the Hugging Face repository")
parser.add_argument("--repo-name", required=True, help="Name of the Hugging Face repository")
parser.add_argument("--branch", default="main", help="Branch of the Hugging Face repository (default: main)")
parser.add_argument("--s3-bucket", required=True, help="Name of the S3 bucket")
args = parser.parse_args()
# Hugging Face repository details
repo_owner = args.repo_owner
repo_name = args.repo_name
branch = args.branch
# S3 bucket details
s3_bucket_name = args.s3_bucket
# Create an S3 client
s3_client = boto3.client("s3")
# Create the S3 bucket if it doesn't exist
try:
s3_client.head_bucket(Bucket=s3_bucket_name)
print(f"S3 bucket '{s3_bucket_name}' already exists")
except boto3.exceptions.ClientError as e:
if e.response["Error"]["Code"] == "404":
s3_client.create_bucket(Bucket=s3_bucket_name)
print(f"S3 bucket '{s3_bucket_name}' created successfully")
else:
raise
# Hugging Face API endpoint
api_url = f"https://api.huggingface.co/repos/{repo_owner}/{repo_name}/tree/{branch}"
def download_file(file_path):
file_url = f"https://huggingface.co/{repo_owner}/{repo_name}/resolve/{branch}/{file_path}"
response = requests.get(file_url)
if response.status_code == 200:
return response.content
else:
print(f"Failed to download file: {file_path}")
return None
def upload_to_s3(file_path, file_content):
try:
s3_client.put_object(Body=file_content, Bucket=s3_bucket_name, Key=file_path)
print(f"Uploaded file to S3: {file_path}")
except Exception as e:
print(f"Failed to upload file to S3: {file_path}")
print(f"Error: {str(e)}")
def process_files(files):
for file in files:
file_path = file["path"]
if file["type"] == "blob":
file_content = download_file(file_path)
if file_content is not None:
upload_to_s3(file_path, file_content)
elif file["type"] == "tree":
response = requests.get(file["url"])
if response.status_code == 200:
subfolder_files = response.json()
process_files(subfolder_files)
else:
print(f"Failed to retrieve subfolder: {file_path}")
def main():
response = requests.get(api_url)
if response.status_code == 200:
files = response.json()
process_files(files)
else:
print("Failed to retrieve repository files")
if __name__ == "__main__":
main()
@JacobFV
Copy link
Author

JacobFV commented Apr 21, 2024

eg, python huggingface_to_s3.py --repo-owner HuggingFaceFW --repo-name finewe --s3-bucket my-finewe-dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment