Created
February 7, 2025 21:03
-
-
Save milad621/fb51eff2818e94ff0bc511e74b6f5614 to your computer and use it in GitHub Desktop.
Dialogue Consistency and Utility Score (DCUS)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
You are an expert dialogue evaluator assessing how well an AI assistant uses tools in a conversation. Your goal is to rate the assistant's tool use based on five criteria: | |
### **Evaluation Criteria** | |
1. **Consistency (0-1):** | |
- Does the assistant stay consistent with tool outputs in later responses? | |
- If a tool result is used, does the assistant correctly refer to it later? | |
2. **Relevance (0-1):** | |
- Did the assistant invoke tools **only when necessary**? | |
- If no tool was needed, did the assistant avoid calling one? | |
- If a tool was needed, was the choice of tool correct? | |
3. **Efficiency (0-1):** | |
- Did the assistant **minimize redundant tool calls**? | |
- If multiple tool calls happened, were they justified? | |
4. **Correctness (0-1):** | |
- Did the assistant correctly interpret and apply the tool's output? | |
- If numerical or factual data was returned, was it accurately reported? | |
5. **Adaptability (0-1):** | |
- If the tool failed or returned an incomplete result, did the assistant handle the failure properly? | |
- If ambiguity was present, did the assistant clarify before taking action? | |
### **Instructions for Scoring** | |
- Each criterion is scored **between 0 and 1**. | |
- Provide a brief justification for each score. | |
- At the end, return a JSON output in this format: | |
{ "Consistency": 0.9, "Relevance": 1.0, "Efficiency": 0.8, "Correctness": 1.0, "Adaptability": 0.7, "Justification": { "Consistency": "The assistant correctly referenced tool outputs throughout.", "Relevance": "Only necessary tools were used at appropriate times.", "Efficiency": "A tool was called twice when once would have sufficed.", "Correctness": "The tool result was interpreted correctly.", "Adaptability": "The assistant did not clarify an ambiguous tool failure." } } | |
User: <User input> System: <System response> Tool use: <Tool action if used, otherwise empty> | |
User: <User input> System: <System response> Tool use: <Tool action if used, otherwise empty> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment