Research

Jun 25, 2023
The Curious Case of LLM Evaluations
Our modeling, scaling and generalization techniques grew faster than our benchmarking abilities - which in turn have resulted in poor evaluation and hyped capabilities.
🪴 Potted LLMs Evaluation Survey
Apr 10, 2023
No, GPT4 (RLHF’ed) Does Not Pass The Sally-Anne False Belief Test
Discussing the prospect of deriving instinct and purpose for a prompt and creating examples for evaluation problems focussing the Sally-Anne False-Belief Test and provide a summary of when GPT4 and GPT3.5 pass or fail the test.
LLMs Evaluation
Mar 4, 2024
A Personal Test Suite for LLMs
Most LLM benchmarks are either academic or do not capture what I use them for. So, inspired by some other people, this is my own test suite.
🪴 Potted NLP LLMs Evaluation
Apr 27, 2023
Demonstrating Gender Bias in GPT4
A brief demonstration of gender bias in GPT4, as observed from various downstream task perspectives ft. Taylor Swift
🪴 Potted LLMs Evaluation Prompting
Mar 1, 2023
CAPSTONE: Capability Assessment Protocol for Systematic Testing of Natural Language Models Expertise
Enhance classification with a text annotation framework for improved systemization in prompt-based language model evaluation
Evaluation Framework Prompting LLMs
Mar 1, 2022
Controlled Evaluation of Explanations: What Might Have Influenced Your Model Explanation Efficacy Evaluation?
End users affect explanation efficacy. NLP papers overlook other factors. This paper examines how utiltiy of saliency-based explanations change with controlled variables. We aim to provide a standardized list of variables to evaluate and show how SoTA algorithms rank differently when controlling for evaluation criteria.
Evaluation Explanation NLP

Research

The Curious Case of LLM Evaluations

No, GPT4 (RLHF’ed) Does Not Pass The Sally-Anne False Belief Test

A Personal Test Suite for LLMs

Demonstrating Gender Bias in GPT4

CAPSTONE: Capability Assessment Protocol for Systematic Testing of Natural Language Models Expertise

Controlled Evaluation of Explanations: What Might Have Influenced Your Model Explanation Efficacy Evaluation?