Evaluation
-
The Curious Case of LLM Evaluations
Our modeling, scaling and generalization techniques grew faster than our benchmarking abilities - which in turn have resulted in poor evaluation and hyped capabilities.
-
No, GPT4 (RLHF’ed) Does Not Pass The Sally-Anne False Belief Test
Discussing the prospect of deriving instinct and purpose for a prompt and creating examples for evaluation problems focussing the Sally-Anne False-Belief Test and provide a summary of when GPT4 and GPT3.5 pass or fail the test.
-
A Personal Test Suite for LLMs
Most LLM benchmarks are either academic or do not capture what I use them for. So, inspired by some other people, this is my own test suite.
-
Demonstrating Gender Bias in GPT4
A brief demonstration of gender bias in GPT4, as observed from various downstream task perspectives ft. Taylor Swift
-
CAPSTONE: Capability Assessment Protocol for Systematic Testing of Natural Language Models Expertise
Enhance classification with a text annotation framework for improved systemization in prompt-based language model evaluation
-
Controlled Evaluation of Explanations: What Might Have Influenced Your Model Explanation Efficacy Evaluation?
End users affect explanation efficacy. NLP papers overlook other factors. This paper examines how utiltiy of saliency-based explanations change with controlled variables. We aim to provide a standardized list of variables to evaluate and show how SoTA algorithms rank differently when controlling for evaluation criteria.