LLMs
-
The Curious Case of LLM Evaluations
Our modeling, scaling and generalization techniques grew faster than our benchmarking abilities - which in turn have resulted in poor evaluation and hyped capabilities.
-
No, GPT4 (RLHF’ed) Does Not Pass The Sally-Anne False Belief Test
Discussing the prospect of deriving instinct and purpose for a prompt and creating examples for evaluation problems focussing the Sally-Anne False-Belief Test and provide a summary of when GPT4 and GPT3.5 pass or fail the test.
-
I am Stuck in a Loop of Datasets ↔ Techniques
I keep jumping between - I do not trust the evaluation, the data is poor, to the dataset I created only has 100 samples.
-
Add packages to ChatGPT code interpreter environment
-
A Personal Test Suite for LLMs
Most LLM benchmarks are either academic or do not capture what I use them for. So, inspired by some other people, this is my own test suite.
-
Random Research Ideas On Social Media That I Liked
Sometimes I come across random research ideas across the twitter and social media universe that really resonate with me at that moment. Often they get lost in doom scrolling, so I am considering compiling those into a running log.
-
Biases in ML
-
LLM Powered Literature Reviews
-
Demonstrating Gender Bias in GPT4
A brief demonstration of gender bias in GPT4, as observed from various downstream task perspectives ft. Taylor Swift
-
CAPSTONE: Capability Assessment Protocol for Systematic Testing of Natural Language Models Expertise
Enhance classification with a text annotation framework for improved systemization in prompt-based language model evaluation