skip to content
Site header image Mimansa Jaiswal

Random Research Ideas On Social Media That I Liked

Sometimes I come across random research ideas across the twitter and social media universe that really resonate with me at that moment. Often they get lost in doom scrolling, so I am considering compiling those into a running log.

Last Updated:

Sometimes I come across random research ideas across the twitter and social media universe that really resonate with me at that moment. Often they get lost in doom scrolling, so I am considering compiling those into a running log.

2024

April 2024

I'm skeptical that Chatbot Arena is really as informative as people make it out to be, but I'd be glad to learn that I am wrong: 1. Different chatbots have really distinct talking styles. Isn't it easy to tell whether something comes from GPT-4 or Grok? Then it's not really…

lmarena.ai (formerly lmsys.org)
lmarena.ai (formerly lmsys.org)
@lmarena_ai

Exciting news - the latest Arena result are out! @cohere's Command R+ has climbed to the 6th spot, matching GPT-4-0314 level by 13K+ human votes! It's undoubtedly the **best** open model on the leaderboard now🔥 Big congrats to @cohere's incredible work & valuable contribution…

March 2024

The only two numbers worth looking at here are GPQA and HumanEval On GPQA the result is very impressive. On HumanEval, they compare to GPT-4's perf at launch. GPT-4 is now much better- see the EvalPlus leaderboard, where it gets 88.4 I bet OpenAI will respond with GPT-4.5 soon

Anthropic
Anthropic
@AnthropicAI

Today, we're announcing Claude 3, our next generation of AI models. The three state-of-the-art models—Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku—set new industry benchmarks across reasoning, math, coding, multilingual understanding, and vision.

Feb 2024

is anyone doing research on out-of-distribution/unnatural prompts and how aligned models respond to them? Something clearly went wrong in Gemini training, but no one should be ashamed! would be super cool if they wrote a post-mortem that researches how this behavior arises 🙏

Frantastic — e/acc
Frantastic — e/acc
@Frantastic_7

every single person who worked on this should take a long hard look in the mirror. absolutely appalling.

Jan 2024

Exchangeable random variables for training process
Align two text embedding spaces in an unsupervised way.
LM needs to generate a sentence describing a common scenario covering all given concepts. Can you get it closer to human performance?
Paste and match writing style as an option for “AI agents”
How do we modify text visually such that it accounts for ripples?
Do a serious study of prompt extraction attacks by writing prompts for publicly released models and then checking how reliably they can be stolen in a blackbox setting.
Related paper: Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success
How to efficiently adapt embeddings without requiring continued pretraining?
Effect of term frequency on fact learning
Model difference between unlearning a datapoint vs excluding a datapoint while learning
Can you tell the difference between SFT DPO and PPO models that had the same base model and are identical up to the algorithm?
Use multi-translated book for instruction style translation dataset
how much pretraining is required to make a LLM fall into a particular loss basin? In particular, until its path-independent?

2023

Dec 2023

Sparse autoencoder-based feature decomposition for text embeddings
Foundational questions in LLMs that we still do not have answers to
Questions about consistency in LLMs
Current LLMs reflect a Western (pop) culture. On asking ChatGPT "Who are Jack and Rose," it says that they're movie characters from Titanic. On asking ChatGPT "Who are Rahul and Anjali," it says that they're generic South Asian names. It does not reference that they're movie characters from the popular Indian film Kuch Kuch Hota Hai.
Taxonomies for everything
And ideally, taxonomy of errors

November 2023

PhD students: can you please solve the problem of long text evaluation? It is one of the biggest bottlenecks in the quality iteration of LLMs. Which response is more creative? safer? more factual?

Naomi Saphra 🧈🪰=🦋
Naomi Saphra 🧈🪰=🦋
@nsaphra

It's not the first time! A dream team of @enfleisig (human eval expert), Adam Lopez (remembers the Stat MT era), @kchonyc (helped end it), and me (pun in title) are here to teach you the history of scale crises and what lessons we can take from them. 🧵arxiv.org/abs/2311.05020

Generation and ambiguity is always a confounder