Evaluation 6

Jun 25, 2023
The Curious Case of LLM Evaluations
Our modeling, scaling and generalization techniques grew faster than our benchmarking abilities - which in turn have resulted in poor evaluation and hyped capabilities.
🪴 Potted LLMs Evaluation Survey
Apr 10, 2023
No, GPT4 (RLHF’ed) Does Not Pass The Sally-Anne False Belief Test
Discussing the prospect of deriving instinct and purpose for a prompt and creating examples for evaluation problems focussing the Sally-Anne False-Belief Test and provide a summary of when GPT4 and GPT3.5 pass or fail the test.
LLMs Evaluation
Mar 4, 2024
A Personal Test Suite for LLMs
Most LLM benchmarks are either academic or do not capture what I use them for. So, inspired by some other people, this is my own test suite.
🪴 Potted NLP LLMs Evaluation
Apr 27, 2023
Demonstrating Gender Bias in GPT4
A brief demonstration of gender bias in GPT4, as observed from various downstream task perspectives ft. Taylor Swift
🪴 Potted LLMs Evaluation Prompting
Mar 1, 2023
CAPSTONE: Capability Assessment Protocol for Systematic Testing of Natural Language Models Expertise
Enhance classification with a text annotation framework for improved systemization in prompt-based language model evaluation
Evaluation Framework Prompting LLMs
Mar 1, 2022
Controlled Evaluation of Explanations: What Might Have Influenced Your Model Explanation Efficacy Evaluation?
End users affect explanation efficacy. NLP papers overlook other factors. This paper examines how utiltiy of saliency-based explanations change with controlled variables. We aim to provide a standardized list of variables to evaluate and show how SoTA algorithms rank differently when controlling for evaluation criteria.
Evaluation Explanation NLP

Evaluation 6

The Curious Case of LLM Evaluations

No, GPT4 (RLHF’ed) Does Not Pass The Sally-Anne False Belief Test

A Personal Test Suite for LLMs

Demonstrating Gender Bias in GPT4

CAPSTONE: Capability Assessment Protocol for Systematic Testing of Natural Language Models Expertise

Controlled Evaluation of Explanations: What Might Have Influenced Your Model Explanation Efficacy Evaluation?