This Python script allows you to transfer files from a Hugging Face repository to an Amazon S3 bucket. It iterates over all the files in the specified repository, downloads them one at a time, and uploads them to the designated S3 bucket.
Before running the script, ensure that you have the following:
- choose the aws linux ami with 10kIOPS and IO2 storage. make sure you can https out and ssh in
OR
- Python 3.x installed on your system
- AWS account with access to S3
- Hugging Face repository details (owner, repository name, branch)
- S3 bucket name for storing the transferred files
-
ssh into your instance by clicking connect in the top right corner of the console, or using trad methods
-
Open nano and copypasta the script into
huggingface_to_s3.py
-
Install the required Python packages by running the following command:
pip install boto3 requests
-
Configure your AWS credentials using one of the following methods:
- Assign an IAM role with S3 put permissions to your ec2 instance
OR
- Set up environment variables:
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
- Use an AWS credentials file (
~/.aws/credentials
) - Configure the AWS CLI using
aws configure
To run the script, use the following command:
python huggingface_to_s3.py --repo-owner <repo_owner> --repo-name <repo_name> --branch <branch> --s3-bucket <s3_bucket_name>
Replace the placeholders with the appropriate values:
<repo_owner>
: The owner of the Hugging Face repository.<repo_name>
: The name of the Hugging Face repository.<branch>
: The branch of the repository to transfer files from (default: "main").<s3_bucket_name>
: The name of the S3 bucket to store the transferred files.
The script will create the S3 bucket if it doesn't already exist.
- Ensure that you have the necessary permissions to access the Hugging Face repository and the S3 bucket.
- Be cautious when transferring large repositories, as it may take a considerable amount of time and consume significant network bandwidth.
- Monitor the script's output for any error messages or warnings during the transfer process.
- Regularly review and clean up the S3 bucket to avoid unnecessary storage costs.
- If you encounter any issues related to AWS credentials or permissions, double-check your AWS configuration and ensure that you have the required permissions to access S3.
- If the script fails to retrieve files from the Hugging Face repository, verify that the repository details (owner, name, branch) are correct and that you have the necessary access rights.
- If you experience network-related issues, check your internet connection and ensure that you can access the Hugging Face API and AWS S3 endpoints.
For more detailed information and advanced usage, please refer to the script's source code and the documentation of the respective libraries (boto3 and requests):
eg,
python huggingface_to_s3.py --repo-owner HuggingFaceFW --repo-name finewe --s3-bucket my-finewe-dataset