Skip to content

Instantly share code, notes, and snippets.

@jrosell
Created July 2, 2025 16:36
Show Gist options
  • Save jrosell/8a310cadb2b06ef62cfc07e6f0c2d717 to your computer and use it in GitHub Desktop.
Save jrosell/8a310cadb2b06ef62cfc07e6f0c2d717 to your computer and use it in GitHub Desktop.
Do you know that you can evaluate IA models and compare their performance? Here an example using the {vitals} package from @posit_pbc by @simonpcouch Evals, evals, evals
rlang::check_installed(c("vitals", "ellmer", "dplyr", "ggplot2"))
library(vitals)
library(ellmer)
library(dplyr)
library(ggplot2)
eval_df <- tibble(
input = c("What's 2+2?", "What's 2+3?", "What's 2+4?"),
target = c("4", "5", "6")
)
tsk <- Task$new(
dataset = eval_df,
solver = generate(),
scorer = model_graded_qa()
)
tsk_llama <- tsk$clone()$eval(solver_chat = chat_ollama(model = "llama3.2:3b"))
tsk_qwen <- tsk$clone()$eval(solver_chat = chat_ollama(model = "qwen3:4b"))
tsk_eval <-
vitals_bind(
llama = tsk_llama,
qwen = tsk_qwen
)
tsk_eval |>
rename(model = task) |>
mutate(
score = factor(
case_when(
score == "I" ~ "Incorrect",
score == "P" ~ "Partially correct",
score == "C" ~ "Correct"
),
levels = c("Incorrect", "Partially correct", "Correct"),
ordered = TRUE
)
) |>
ggplot(aes(y = model, fill = score)) +
geom_bar() +
scale_fill_brewer(breaks = rev, palette = "RdYlGn")
@jrosell
Copy link
Author

jrosell commented Jul 2, 2025

Screenshot from 2025-07-02 18-34-28
Screenshot from 2025-07-02 18-31-29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment