Research Notes
Homepage, Email, Resume, Research Notes, and, Publications
The Curious Case of LLM Evaluations
Our modeling, scaling and generalization techniques grew faster than our benchmarking abilities - which in turn have resulted in poor evaluation and hyped capabilities. Every ability is amazing and great, if we do not have the tools to figure out what that ability is, or how good the model is at that ability. We might always believe the model will win every race, if all we do, is have the race on paved roads, with yellow trees on every right turns, and green trees on every left turn.
No, GPT4 (RLHF’ed) Does Not Pass The Sally-Anne False Belief Test
Discussing the prospect of deriving instinct and purpose for a prompt and creating examples for evaluation problems focussing the Sally-Anne False-Belief Test and provide a summary of when GPT4 and GPT3.5 pass or fail the test.
CAPSTONE: Capability Assessment Protocol for Systematic Testing of Natural Language Models Expertise
Prompt-based language models have limitations in classification and often require users to test multiple prompts with varying temperatures to identify the best fit. A text annotation framework addresses this by introducing explicit prompt definition and validation, and can improve performance in labeling or retrieval tasks at scale.
Designing Interfaces for Delivering and Obtaining Generation Explanation Annotations
Designing a user interface where human annotators can provide explanations for text data. This can help improve the transparency and interpretability of machine learning models, as well as improve their performance.
Controlled Evaluation of Explanations: What Might Have Influenced Your Model Explanation Efficacy Evaluation?
End users affect explanation efficacy. NLP papers overlook other factors. This paper examines how utiltiy of saliency-based explanations change with controlled variables. We aim to provide a standardized list of variables to evaluate and show how SoTA algorithms rank differently when controlling for evaluation criteria.