Model | AGIEval | GPT4All | TruthfulQA | Bigbench |
---|---|---|---|---|
Llama3-8B-function-calling-dpo-slerp | 39.52 | Error: File does not exist | 56.01 | 42.8 |
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
agieval_aqua_rat | 0 | acc | 25.98 | ± | 2.76 |
acc_norm | 23.62 | ± | 2.67 | ||
agieval_logiqa_en | 0 | acc | 38.25 | ± | 1.91 |
acc_norm | 38.86 | ± | 1.91 | ||
agieval_lsat_ar | 0 | acc | 19.13 | ± | 2.60 |
acc_norm | 19.13 | ± | 2.60 | ||
agieval_lsat_lr | 0 | acc | 44.51 | ± | 2.20 |
acc_norm | 41.18 | ± | 2.18 | ||
agieval_lsat_rc | 0 | acc | 55.39 | ± | 3.04 |
acc_norm | 50.93 | ± | 3.05 | ||
agieval_sat_en | 0 | acc | 68.45 | ± | 3.25 |
acc_norm | 65.05 | ± | 3.33 | ||
agieval_sat_en_without_passage | 0 | acc | 44.17 | ± | 3.47 |
acc_norm | 44.17 | ± | 3.47 | ||
agieval_sat_math | 0 | acc | 36.36 | ± | 3.25 |
acc_norm | 33.18 | ± | 3.18 |
Average: 39.52%
Average: Error: File does not exist%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
truthfulqa_mc | 1 | mc1 | 39.17 | ± | 1.71 |
mc2 | 56.01 | ± | 1.56 |
Average: 56.01%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
bigbench_causal_judgement | 0 | multiple_choice_grade | 62.11 | ± | 3.53 |
bigbench_date_understanding | 0 | multiple_choice_grade | 70.73 | ± | 2.37 |
bigbench_disambiguation_qa | 0 | multiple_choice_grade | 30.23 | ± | 2.86 |
bigbench_geometric_shapes | 0 | multiple_choice_grade | 35.93 | ± | 2.54 |
exact_str_match | 0.00 | ± | 0.00 | ||
bigbench_logical_deduction_five_objects | 0 | multiple_choice_grade | 30.60 | ± | 2.06 |
bigbench_logical_deduction_seven_objects | 0 | multiple_choice_grade | 23.86 | ± | 1.61 |
bigbench_logical_deduction_three_objects | 0 | multiple_choice_grade | 45.33 | ± | 2.88 |
bigbench_movie_recommendation | 0 | multiple_choice_grade | 37.20 | ± | 2.16 |
bigbench_navigate | 0 | multiple_choice_grade | 55.40 | ± | 1.57 |
bigbench_reasoning_about_colored_objects | 0 | multiple_choice_grade | 68.60 | ± | 1.04 |
bigbench_ruin_names | 0 | multiple_choice_grade | 41.52 | ± | 2.33 |
bigbench_salient_translation_error_detection | 0 | multiple_choice_grade | 33.77 | ± | 1.50 |
bigbench_snarks | 0 | multiple_choice_grade | 65.19 | ± | 3.55 |
bigbench_sports_understanding | 0 | multiple_choice_grade | 50.61 | ± | 1.59 |
bigbench_temporal_sequences | 0 | multiple_choice_grade | 36.00 | ± | 1.52 |
bigbench_tracking_shuffled_objects_five_objects | 0 | multiple_choice_grade | 21.60 | ± | 1.16 |
bigbench_tracking_shuffled_objects_seven_objects | 0 | multiple_choice_grade | 16.34 | ± | 0.88 |
bigbench_tracking_shuffled_objects_three_objects | 0 | multiple_choice_grade | 45.33 | ± | 2.88 |
Average: 42.8%
Average score: Not available due to errors
Elapsed time: 02:28:08