PersonaGym: Evaluating Persona Agents and LLMs

1Carnegie Mellon University 2University of Illinois Chicago 3University of Massachusetts Amherst 4Independent Researcher 5Georgia Tech 6Princeton University

PersonaGym is the first evaluation framework for persona agents in LLMs to assess performance along different dimensions of agent abilities through dynamically seeding agents in relevant environments.

Descriptive Alt Text

Persona Agents

Descriptive Alt Text

The emergence of persona agents in large language models (LLMs) has introduced a novel approach where the LLM assumes a specified persona and generates responses that align with the experiences and traits of that persona. These Persona Agents have the potential to deliver highly personalized and contextually relevant interactions, enhancing user engagement and satisfaction by simulating human-like communication and decision-making across a multitude of domains.

What is PersonaGym?

PersonaGym is the first dynamic evaluation framework for persona agents. As part of PersonaGym, we evaluate persona agents on the tasks of Action Justification, Expected Action, Linguistic Habits, Persona Consistency, and Toxicity Control. By dynamic we mean that for every given persona agent, PersonaGym chooses relevant environments from a list of 150 diverse environments and generates task-specific questions tailored to the given persona and the selected environments. The agent's responses to these questions are then automatically evaluated using an ensemble of strong LLM models. Additionally, we propose PersonaScore which is the first automatic human-aligned metric for quantifying the overall capability of persona agents. Our framework provides comprehensive scores across all tasks along with PersonaScore to enable multidimensional persona agent evaluation and advancement.

Key-Highlights

  • 📊 Multidimensional Evaluation for Persona Agents: PersonaGym allows users to measure the capabilities of arbitrary persona agents on 5 evaluation tasks as well as overall capabilities using PersonaScore. These 5 tasks encompass the different dimensions along which persona agents can interact.
  • 🧠 High Levels of Alignment with Human Judgement: Our experimentations show that PersonaGym is highly correlated with human judgment.
  • 📚 Comprehensive Benchmark: Introduced diverse benchmark containing 200 personas to facilitate research into LLMs' ability to be persona agents.

Results Summary

We introduce a benchmark of 200 diverse personas and evaluate 3 open source and 3 close source LLMs on these personas using PersonaGym including the SOTA Claude 3.5 Sonnet model.

Results Summary Chart
Benchmarked results of 6 LLMs on 200 personas descriptions and 10 questions per task totaling 10K questions. As part of PersonaGym, we propose 5 evaluation tasks all of which are grounded in decision theory to properly evaluate persona agents on different axes of interactions with environments. Bolded results indicate the best scoring model for each task. Standard deviations for each task and model are included within parentheses. The final row presents the variance among the average score for all 6 models for each task.
correlations
Average correlation scores across randomly sampled 100 personas between GPT 3.5, Llama2 (13b), and Llama2 (70b) models and human evaluation scores. In each entry, the format of the scores is Spearman / Kendall-Tau metrics. From our results, we show that PersonaScore is highly correlated with human judgment on the evaluation tasks, thereby giving evidence to the effectiveness of our proposed framework to evaluate LLM persona agents.

We make the following observations:

  • There exits significant opportunity for advancement of persona agent capabilities We observe that SOTA models do not outperform less capable models at the level they do in other domains. For example, Claude 3.5 Sonnet only has a 2.97% relative improvement in PersonaScore than GPT 3.5 despite being a much more advanced model.
  • Size and capability of models is not a direct indication of persona agent capabilities We show that Claude 3 Haiku is very resistant to taking on personas and being persona agents despite being a SOTA model. This finding highlights the pressing need for algorithmic and architectural invention towards faithful and performant persona agents.
  • PersonaGym is highly aligned with human judgment We show that the scores for all tasks and PersonaScore are highly correlated with human judgment thereby showing the efficacy of our framework

BibTeX

@article{samuel2024personagym,
  title={PersonaGym: Evaluating Persona Agents and LLMs},
  author={Samuel, Vinay and Zou, Henry Peng and Zhou, Yue and Chaudhari, Shreyas and Kalyan, Ashwin and Rajpurohit, Tanmay and Deshpande, Ameet and Narasimhan, Karthik and Murahari, Vishvak},
  journal={arXiv preprint arXiv:2407.18416},
  year={2024}
}