Skip to content

Instantly share code, notes, and snippets.

@thomwolf
Last active May 23, 2025 06:32
Show Gist options
  • Save thomwolf/ecc52ea728d29c9724320b38619bd6a6 to your computer and use it in GitHub Desktop.
Save thomwolf/ecc52ea728d29c9724320b38619bd6a6 to your computer and use it in GitHub Desktop.
Download and load persona-chat json dataset
import json
from pytorch_pretrained_bert import cached_path
url = "https://s3.amazonaws.com/datasets.huggingface.co/personachat/personachat_self_original.json"
# Download and load JSON dataset
personachat_file = cached_path(url)
with open(personachat_file, "r", encoding="utf-8") as f:
dataset = json.loads(f.read())
# Tokenize and encode the dataset using our loaded GPT tokenizer
def tokenize(obj):
if isinstance(obj, str):
return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj))
if isinstance(obj, dict):
return dict((n, tokenize(o)) for n, o in obj.items())
return list(tokenize(o) for o in obj)
dataset = tokenize(dataset)
@oltip
Copy link

oltip commented May 16, 2019

Hi, I am trying to download the file form the s3 bucket you have indicated in the link, but it raises an error:
NoCredentialsError: Unable to locate credentials
This happens at the function s3_etag(url)

At seems as any kind of credentials is needed. Any help would be welcomed.

@mandar1010
Copy link

getting the same error

@Pranav-Goel
Copy link

same error here too

@thomwolf
Copy link
Author

Should be fixed now

@sashank06
Copy link

@thomwolf the error still persists. Unable to download the json dataset due to that issue.

@sashank06
Copy link

@thomwolf the error still persists. Unable to download the json dataset due to that issue.

I fixed the error. It was an error on my end. I had to reconfigure the AWS credentials.

@ShivaShanmuganathan
Copy link

ShivaShanmuganathan commented Jul 31, 2019

Should be fixed now

@thomwolf the error still persists. Unable to download the json dataset due to that issue.

I fixed the error. It was an error on my end. I had to reconfigure the AWS credentials.

I am still getting the same error. Please help.

@naveentvelu
Copy link

@thomwolf the error still persists. Unable to download the json dataset due to that issue.

I fixed the error. It was an error on my end. I had to reconfigure the AWS credentials.

@sashank06 I am still getting the error, can you please share how you rectified the error.

@Khaled-Abdelhamid
Copy link

@CatarauCorina
Copy link

@Houssem96
Copy link

@hppy139
Copy link

hppy139 commented Mar 27, 2025

Hi, could you explain the data format like this?
train_self_original.txt file:
1 your persona: i like to remodel homes.
2 your persona: i like to go hunting.
3 your persona: i like to shoot a bow.
4 your persona: my favorite holiday is halloween.
5 hi , how are you doing ? i am getting ready to do some cheetah chasing to stay in shape . \t you must be very fast . hunting is one of my favorite hobbies . \t my mom was single with 3 boys , so we never left the projects .|i try to wear all black every day . it makes me feel comfortable .|well nursing stresses you out so i wish luck with sister|yeah just want to pick up nba nfl getting old|i really like celine dion . what about you ?|no . i live near farms .|i wish i had a daughter , i am a boy mom . they are beautiful boys though still lucky|yeah when i get bored i play gone with the wind my favorite movie .|hi how are you ? i am eating dinner with my hubby and 2 kids .|were you married to your high school sweetheart ? i was .|that is great to hear ! are you a competitive rider ?|hi , i am doing ok . i am a banker . how about you ?|i am 5 years old|hi there . how are you today ?|i totally understand how stressful that can be .|yeah sometimes you do not know what you are actually watching|mother taught me to cook ! we are looking for an exterminator .|i enjoy romantic movie . what is your favorite season ? mine is summer .|editing photos takes a lot of work .|you must be very fast . hunting is one of my favorite hobbies .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment