Model | Date | Action Justification | Expected Action | Linguistic Habits | Persona Consistency | Toxicity Control | PersonaScore |
---|---|---|---|---|---|---|---|
Claude 3.5 Sonnet | 2024-07-10 | 4.52 | 4.37 | 3.98 | 4.81 | 4.88 | 4.51 |
LLaMA-3 (8b) | 2024-07-10 | 4.55 | 4.43 | 3.97 | 4.77 | 4.74 | 4.49 |
LLaMA-2 (70b) | 2024-07-10 | 4.44 | 4.32 | 3.85 | 4.67 | 4.68 | 4.39 |
GPT-3.5 | 2024-07-10 | 4.31 | 4.28 | 3.63 | 4.70 | 4.96 | 4.38 |
LLaMA-2 (13b) | 2024-07-10 | 3.96 | 3.87 | 3.77 | 4.12 | 4.18 | 3.98 |
Claude 3 Haiku | 2024-07-10 | 2.47 | 4.28 | 3.04 | 4.47 | 4.94 | 3.64 |
Click on any column header to sort the leaderboard by that task. The default ranking is by PersonaScore.
If you are interested in submitting your model to the PersonaGym Leaderboard, please do the following:
scores.json
: The JSON file containing the average score for all tasks in PersonaGym on our benchmark. This file is automatically generated by our code and is saved under scores/{save_name} where save_name is a flag to our run script.README.md
: (Recommended) Include any information you would like to share about your model in this README.The leaderboard will be updated every Monday.