Research Notes

Quick Navigation

Homepage, Email, Resume, Research Notes, and, Publications

The Curious Case of LLM Evaluations

Text

Evaluation

Metric Design

Opinion

Foundation Models

Our modeling, scaling and generalization techniques grew faster than our benchmarking abilities - which in turn have resulted in poor evaluation and hyped capabilities. Every ability is amazing and great, if we do not have the tools to figure out what that ability is, or how good the model is at that ability. We might always believe the model will win every race, if all we do, is have the race on paved roads, with yellow trees on every right turns, and green trees on every left turn.

06-25-2023

No, GPT4 (RLHF’ed) Does Not Pass The Sally-Anne False Belief Test

Foundation Models

Evaluation

Prompting

Theory of Mind

Discussing the prospect of deriving instinct and purpose for a prompt and creating examples for evaluation problems focussing the Sally-Anne False-Belief Test and provide a summary of when GPT4 and GPT3.5 pass or fail the test.

04-10-2023

CAPSTONE: Capability Assessment Protocol for Systematic Testing of Natural Language Models Expertise

Text

Evaluation

Metric Design

Schema

Interpretation

Data Annotation

Foundation Models

Prompt-based language models have limitations in classification and often require users to test multiple prompts with varying temperatures to identify the best fit. A text annotation framework addresses this by introducing explicit prompt definition and validation, and can improve performance in labeling or retrieval tasks at scale.

03-01-2023

Designing Interfaces for Delivering and Obtaining Generation Explanation Annotations

Text

Data Annotation

Design

Designing a user interface where human annotators can provide explanations for text data. This can help improve the transparency and interpretability of machine learning models, as well as improve their performance.

02-01-2023

Controlled Evaluation of Explanations: What Might Have Influenced Your Model Explanation Efficacy Evaluation?

Text

Evaluation

Metric Design

Schema

Interpretation

Data Annotation

End users affect explanation efficacy. NLP papers overlook other factors. This paper examines how utiltiy of saliency-based explanations change with controlled variables. We aim to provide a standardized list of variables to evaluate and show how SoTA algorithms rank differently when controlling for evaluation criteria.

03-01-2022

Categories

The Curious Case of LLM Evaluations

No, GPT4 (RLHF’ed) Does Not Pass The Sally-Anne False Belief Test

CAPSTONE: Capability Assessment Protocol for Systematic Testing of Natural Language Models Expertise

Designing Interfaces for Delivering and Obtaining Generation Explanation Annotations

Controlled Evaluation of Explanations: What Might Have Influenced Your Model Explanation Efficacy Evaluation?