Blurbs

Sep 27, 2025
You still need to verify your LLM calculator results
Chat, do I still need to check my pocket LLM’s calculations? Unfortunately, yes. Yes, you do.
LLMs Thoughts
Foreground
For some context, I love Taylor Swift. I grew up with her songs; she was the first artist I had listened to whose songs were in English, and I listened to them back in 2007 on my first school trip out of state. Sure, I had periods where I was embarrassed that she was my favorite artist, and then my roommate in PhD brought me back to being a “Swiftie”, as they say. But I never had disposable income when she toured (and unfortunately last year, visa issues made it so that I couldn't attend the last leg of her Eras Tour). So, I have never been to her concerts, ever. And then she announced her new album last month.

Today, I saw this post on Twitter showing the tracks and their lengths. And you might think, no big deal, what's the thing here. The thing is — the reason I wanted the Eras Tour so much was because it was a 3.5-hour performance. So, of course, I wanted to know how long this album is, if she ever ended up touring it.

🔗 Taylor Swift Edits · @TSwiftEdits_13 · 01:20 PM · Sep 26, 2025 UTC
‼️| ‘The Life of a Showgirl’ tracks length!
💬 ❤️

And so, I did what any person does seeing the tweet on the left: I used Gemini on screen, copied the text (missed the name of the first song, but got everything else), and pasted it into ChatGPT to ask what the length/duration of this album is.
And here is how that went (scroll through the code blocks to read the actual content)
I pasted in the Gemini extracted text
Add these numbers and give me album length 3:46 Elizabeth Taylor 3:28 Opalite 3:55 Father Figure 3:32 Eldest Daughter 4:06 Ruin The Friendship 3:40 Actually Romantic 2:43 Wish List 3:27 Wood 2:30 3:31 CANCELLED! Honey The Life of a Showgirl (w/ Sabrina Carpenter) 3:01 4:01
GPT5 very confidently tells me that the album length is 43 minutes and 40 seconds
Got it — let’s carefully add these track times step by step to get the full album length. Here are the times you gave me: • Elizabeth Taylor → 3:46 • Opalite → 3:28 • Father Figure → 3:55 • Eldest Daughter → 3:32 • Ruin The Friendship → 4:06 • Actually Romantic → 3:40 • Wish List → 2:43 • Wood → 3:27 • CANCELLED! → 2:30 • Honey → 3:31 • The Life of a Showgirl (w/ Sabrina Carpenter) → 3:01 • (final track) → 4:01 ⸻ Step 1: Separate minutes and seconds Minutes = 3 + 3 + 3 + 3 + 4 + 3 + 2 + 3 + 2 + 3 + 3 + 4 = 38 minutes Seconds = 46 + 28 + 55 + 32 + 6 + 40 + 43 + 27 + 30 + 31 + 1 + 1 = 340 seconds ⸻ Step 2: Convert seconds 340 seconds ÷ 60 = 5 minutes, remainder 40 seconds So, add 5:40 ⸻ Step 3: Add minutes 38 minutes + 5 minutes = 43 minutes ⸻ ✅ Final Album Length = 43:40 ⸻ Do you want me to also give you the average song length across these 12 tracks?
I acknowledge the response
cool
GPT5 very confidently uses the previous calculation, doesn't double-check, and proactively provides me with the average track length
Great! Just for fun, let’s quickly calculate the average track length. Total album length = 43 minutes 40 seconds → convert to seconds: 43 × 60 + 40 = 2580 + 40 = 2620 seconds Number of tracks = 12 Average = 2620 ÷ 12 ≈ 218.33 seconds Convert back to minutes: 218 ÷ 60 = 3 minutes, remainder 38.33 seconds → roughly 3:38 per track. So the average track length is about 3 minutes 38 seconds. If you want, I can also make a nice table showing all tracks with cumulative time so it’s easier to visualize the album flow. Do you want me to do that?
I don't know if you've spotted the problem yet, because I hadn't until I saw this tweet.

🔗 Taylor Swift Updates+ · @chartstswifty · 01:06 PM · Sep 26, 2025 UTC

🚨 “The Life of a Showgirl” is officially 41 minutes long.

“The Fate of Ophelia” — 3:46
“Elizabeth Taylor” — 3:28
“Opalite” — 3:55
“Father Figure” — 3:32
“Eldest Daughter” — 4:06
“Ruin The Friendship” — 3:40
“Actually Romantic” — 2:43
“Wi$h Li$t” — 3:27
“Wood” — 2:30
“CANCELLED!” — 3:31
“Honey” — 3:01
“The Life of a Showgirl” — 4:01

💬 ❤️

Huh? How?
Because I thought it was 43 minutes 40 seconds long, and even if they had skipped out on the seconds (which spoiler alert, they did), it should still say 43 minutes.
At this point, I've used GPT5 (yes, the free version) for simple math enough that I trust its results. So, my first thought wasn't even that this calculation was wrong. I initially thought that Gemini's text extraction messed something up (other than omitting the name of the first song). Apparently not, all the track times seemed to be correctly extracted—and the messed up formatting didn't matter either, as GPT5 did arrange them correctly (even if the titles were incorrectly paired because of Gemini).
So, I added them on my own — and yeah, it is 41 minutes 40 seconds.
So, what gives?
Well, the separate minutes and seconds idea was great, but this part: Minutes = 3 + 3 + 3 + 3 + 4 + 3 + 2 + 3 + 2 + 3 + 3 + 4 = 38 minutes
Well, 3 + 3 + 3 + 3 + 4 + 3 + 2 + 3 + 2 + 3 + 3 + 4 is actually 36 minutes. That's where the extra 2 minutes came from. And I wouldn't have even cared to check or verify had I not come across this tweet.
So chat, what did we learn? That you still need to verify your pocket LLM calculator's results, sometimes even for basic math. Color me surprised.

Note: I am not an LLM skeptic, and I'm not advocating for using LLMs for calculator-like math. But the fact is, very often, numbers are entangled in text and you want an LLM to extract corresponding numbers and perform operations on them.
I've been doing that this year pretty confidently — for example, pasting in a menu and asking what it would cost with Seattle tax to order these 3 items, or pasting a price list with complicated rules for times/weekdays/number of people and asking for the total cost in my situation, or pasting multiple Jupyter cell output results for benchmarks across 5 models and asking the model to create a table with a column for performance differences between the worst and best model and the percentage difference.
This situation just reminded me that I still need to verify the basic math output — I honestly didn't expect that to be the case.

ChatGPT - TLOAS album total duration calculation

Shared via ChatGPT

https://chatgpt.com/share/68d76c25-ad00-800e-8fa5-e065c8d77fd0

Bookmark for https://chatgpt.com/share/68d76c25-ad00-800e-8fa5-e065c8d77fd0
Cite This Page
@article{jaiswal2025youstillneedtov, title = {You still need to verify your LLM calculator results}, author = {Jaiswal, Mimansa}, journal = {mimansajaiswal.github.io}, year = {2025}, month = {Sep}, url = {https://mimansajaiswal.github.io/posts/you-still-need-to-verify-your-llm-calculator-results/} }

Dec 2, 2024
Recraft ai for Presentation Graphics

Tools Illustration
I am having so much fun playing around with recraft.ai line and fill generator (I get credits if you use this link on desktop and create your first image). Somehow, it fits my vibe perfectly well.
All of these were created using Line & Fill style and then I used the background removal option. I didn’t need to do that though because Line & Fill is a vectorized style.

Prompt: a meme about a girl looking at a calculator or abacus and saying "why do we manually tune inference parameters" with a questioning expression.
Prompt: a meme about a a human and robot looking at each other and the robot says "what is difficult for you is not difficult for me”

Prompt: a meme about a robot sitting on the desk in the office looking at something on the laptop and wondering "have i already seen something like this?"
Prompt: a meme about a human telling the robot "no, this is not what i consider to be a good output, figure out what I like”.

Background removal removed “I” in “This” though, so I had to draw it back.
Cite This Page
@article{jaiswal2024recraftaiforpre, title = {Recraft ai for Presentation Graphics}, author = {Jaiswal, Mimansa}, journal = {mimansajaiswal.github.io}, year = {2024}, month = {Dec}, url = {https://mimansajaiswal.github.io/posts/recraft-ai-for-presentation-graphics/} }

Nov 24, 2024
Trying Bluesky

Tools List Information
Like everyone else disappointed with the political stronghold over Twitter, I wanted to try Bluesky too. So, here is my account:

Mimansa Jaiswal (@mimansaj.bsky.social)

Robustness, Data & Annotations, Evaluation & Interpretability in LLMs http://mimansajaiswal.github.io/

https://bsky.app/profile/mimansaj.bsky.social

Bookmark for https://bsky.app/profile/mimansaj.bsky.social

But also, here are some tools that you might find helpful:
- Most tools mentioned here
  
  GitHub - fishttp/awesome-bluesky: A list of all known tools available for the Bluesky platform
  
  A list of all known tools available for the Bluesky platform - fishttp/awesome-bluesky
  
  https://github.com/fishttp/awesome-bluesky
  
  Bookmark for https://github.com/fishttp/awesome-bluesky
- Looking to find people on Bluesky that you used to follow on twitter? You can use this extension, which unfortunately is not perfect, but a good starting medium:
  
  Sky Follower Bridge - Chrome Web Store
  
  Easily transfer your following users and list members from X to Bluesky.
  
  https://chromewebstore.google.com/detail/sky-follower-bridge/behhbpbpmailcnfbjagknjngnfdojpko?hl=en
  
  Bookmark for https://chromewebstore.google.com/detail/sky-follower-bridge/behhbpbpmailcnfbjagknjngnfdojpko?hl=en
- Want to move your tweets over from Twitter to Bluesky? You can try blueark which is paid (but really really cheap and maintains original post dates) or use a script.
  
  https://blueark.app/
  
  Bookmark for https://blueark.app/
  
  GitHub - marcomaroni-github/twitter-to-bluesky: Import all tweets exported from X/Twitter to a Bluesky account.
  
  Import all tweets exported from X/Twitter to a Bluesky account. - marcomaroni-github/twitter-to-bluesky
  
  https://github.com/marcomaroni-github/twitter-to-bluesky
  
  Bookmark for https://github.com/marcomaroni-github/twitter-to-bluesky
- Follow everyone that someone else is following:
  
  Run like this: uv run https://gist.githubusercontent.com/simonw/848a3b91169a789bc084a459aa7ecf83/raw/397ad07c8be0601eaf272d9d5ab7675c7fd3c0cf/follow_theirs.py
  
  Run like this: uv run https://gist.githubusercontent.com/simonw/848a3b91169a789bc084a459aa7ecf83/raw/397ad07c8be0601eaf272d9d5ab7675c7fd3c0cf/follow_theirs.py - follow_theirs.py
  
  https://gist.github.com/simonw/848a3b91169a789bc084a459aa7ecf83
  
  Bookmark for https://gist.github.com/simonw/848a3b91169a789bc084a459aa7ecf83
- Did you subscribe to a block list (I am not going to recommend any here) and realized later that the list was maliciously created? Once that list has been deleted by bluesky, there is no easy to way to “unsubscribe” from that blocklist. For that, you might need:
  
  Wayback Machine
  
  https://cohost.org/signpostkw/post/7429398-how-to-unblock-one-o
  
  Bookmark for https://cohost.org/signpostkw/post/7429398-how-to-unblock-one-o
- Embed your bluesky feed on your website:
  
  GitHub - Vincenius/bsky-embed: A web-component to easily embed a bluesky feed.
  
  A web-component to easily embed a bluesky feed. Contribute to Vincenius/bsky-embed development by creating an account on GitHub.
  
  https://github.com/Vincenius/bsky-embed
  
  Bookmark for https://github.com/Vincenius/bsky-embed
- And finally starter packs. These are the most common ones for NLP/ML as of Nov 24, 2024, but might be deleted/modified in the future.
Cite This Page
@article{jaiswal2024tryingbluesky, title = {Trying Bluesky}, author = {Jaiswal, Mimansa}, journal = {mimansajaiswal.github.io}, year = {2024}, month = {Nov}, url = {https://mimansajaiswal.github.io/posts/trying-bluesky/} }

Nov 24, 2024

Configure citation style and bibliography style separately

Academia Research Code

On Nov 18, 2024, Yoav asks:

latex q: i need all bibliography items to be numbered, but i want the text cites to still be author-year. how?

And, I wanted to document Jackson’s solution here just in case anyone else needs it again:

\documentclass{article}

\usepackage[
    citestyle=authoryear,
    bibstyle=numeric,
    natbib=true
]{biblatex}

\addbibresource{references.bib}

\title{Mixed Citations}
\author{Jackson Petty}
\date{November 2024}

\begin{document}

\maketitle

As argued in \citet{macdonald-2008-syntactic}, inner aspect occupies a functional AspP projection intermediate between the \emph{v}P and VP phrases.

\printbibliography

\end{document}

with the comment that:

natbib=true isn't necessary, it's just to provide natbib-style commands like \citet and \citep which people tend to be more familiar with over the biblatex-native commands

and this gives us:

Cite This Page

@article{jaiswal2024configurecitati,
  title   = {Configure citation style and bibliography style separately},
  author  = {Jaiswal, Mimansa},
  journal = {mimansajaiswal.github.io},
  year    = {2024},
  month   = {Nov},
  url     = {https://mimansajaiswal.github.io/posts/configure-citation-style-and-bibliography-style-separately/}
}

Aug 2, 2024

Show Only Appendix ToC

Academia Research Code

If you have ever wanted to create a Table of Contents section in appendix without the contents from the main paper — here is how you can do it.

\appendix
\clearpage % Start on a new page

% Local command to print the header without affecting ToC or bookmarks
\newcommand{\localheader}[1]{%
  \vspace*{-\topskip}\vspace*{\dimexpr\topskip-\baselineskip\relax}
  {\large\bfseries #1\par}
  \vspace{\baselineskip}
}

% Print the local header
\localheader{Appendix Contents}

\begin{center}
\begin{minipage}{0.9\textwidth}
    \startcontents[appendix] % Start a new table of contents for the appendix
    \printcontents[appendix]{}{0}{% Print the appendix ToC
        \renewcommand{\contentsname}{}% Remove default "Contents" heading
    }
\end{minipage}
\end{center}

The good part is that it still maintains proper bookmarks in the exported pdf — and you can still use \tabaleofcontents while drafting to get the complete Table of Contents.

Cite This Page

@article{jaiswal2024showonlyappendi,
  title   = {Show Only Appendix ToC},
  author  = {Jaiswal, Mimansa},
  journal = {mimansajaiswal.github.io},
  year    = {2024},
  month   = {Aug},
  url     = {https://mimansajaiswal.github.io/posts/show-only-appendix-toc/}
}

Aug 1, 2024
Saving Money by Rounding Expenses Up

Life
I honestly am not a person who is on TikTok or Instagram really, so I know I wouldn't get a TikTok account. I never even installed them; I don't have an account on them. But YouTube is something that I use often. YouTube Shorts, it's hard to avoid those. One of the most interesting things I learned on YouTube was saving money by rounding things up.
I've always been an extremely meticulous budgeted person. Like, I record every single money that goes in and out, and I divide it. Automatic apps don't work for me because I often use Venmo and Splitwise and things like those. I've always been the person who would ignore money coming in from, say, research studies or things like those, just so I have this nice net of savings that don't remain visible to me. I've always been this way, even with my allowance as a kid.
But one of the newest ways I've found to kind of save money is I round my transactions up. So if I spend $5, I mark it as $10 in my budgeting database in Notion. Basically, it still stores $5, but what I see is the sealed amount, and that sealed amount is $10. Or instead of $45, it is $50, or instead of $60, it is $100. It's kind of the same way of saving change in pennies, but instead in dollar notations. I really like that. I just wanted to share it.
Cite This Page
@article{jaiswal2024savingmoneybyro, title = {Saving Money by Rounding Expenses Up}, author = {Jaiswal, Mimansa}, journal = {mimansajaiswal.github.io}, year = {2024}, month = {Aug}, url = {https://mimansajaiswal.github.io/posts/saving-money-by-rounding-expenses-up/} }

Aug 1, 2024
I am Stuck in a Loop of Datasets ↔ Techniques
I keep jumping between - I do not trust the evaluation, the data is poor, to the dataset I created only has 100 samples.
Thoughts Research LLMs
Over the past three years, starting in 2021, I've had multiple ideas about the best ways to evaluate certain processes we currently do with LLMs. Whether it be general evaluation on one single task, multi-run evaluations, theory of mind evaluations , or something like RAG where you're retrieving over a set of documents, I've always had those ideas and really wanted to implement them.
I think the problem I keep getting stuck on is that data sets don't actually make sense to me. Sure, there's a lot of conversation about how many of these data sets are useless and companies need private data sets, etc. But you're publishing a paper on companies' private data sets, you're publishing it on public data sets, and so many public data sets right now are very much just generated by large models. That makes no sense to me because you're using them to evaluate these large models.
Every single time I “look at the data”, the thing that everyone asks you to do, and the thing that I've been doing throughout my PhD, I find it disappointing. Then I go into this realm of "Oh, so I should create it myself," and that is something that I really like to do. I have done it before, but I think the thing that is holding me back is this idea of how many data samples the data set needs to have for it to be a viable data set to be published, and for the research that is done using that data set to be published and be useful.
I think I have been inflating the number in my head, whereas some of the data sets that I like are barely 500 samples. Those that I like recently are like 100 to 150 samples. I think I need to change my mindset about creating a data set and putting it out there. If I'm running experiments on it and putting the subsequent publication out there, it doesn't need to be a data set that has thousands of samples.
I do have an eye for good data and good data curation. I should be able to create a good data set that is just 150 samples and is still a viable prospect in its usage and tells something useful about the model that we are using it on.

LiveBench has 150 questions in each category

CiteMe has a total of 130 samples
Cite This Page
@article{jaiswal2024iamstuckinaloop, title = {I am Stuck in a Loop of Datasets ↔ Techniques}, author = {Jaiswal, Mimansa}, journal = {mimansajaiswal.github.io}, year = {2024}, month = {Aug}, url = {https://mimansajaiswal.github.io/posts/i-am-stuck-in-a-loop-of-datasets-techniques/} }

Jul 25, 2024
Please Don’t Have 10 Textareas in Review Forms

Thoughts Academia
Another summer, another round of paper reviewing issues. I kinda just wanted to record this twitter conversation somewhere, so that I have someplace to point people to when they ask me what I want to be changed in reviewing.

My hot take is reviews should be a single text box. Asking people to fill in 10 text boxes for a paper increases the chances that no one wants to do it.

Especially looking at NeurIPS Datasets & Benchmarks, why does it have NINE text areas other than "Summary And Contributions" and "Review".
All fields in Neurips Datasets and Benchmarking review: Title, Summary And Conclusions, Review, Biography, Rating, Opportunities For Improvement, Confidence, Limitations, Conclusions, Clarity, Relation To Prior Work, Reproducibility, Ethics, Flag For Ethics Review.
ACs have said that it allows inexperienced reviewers to decompose the review.
I guess it is the way I review which makes it really hard to do this format. I feel guilty about writing a single line in the textboxes there, but I review line by line on ipad (similar to overleaf comments, just in pdf format) and would very much prefer to be able to extract those and just paste that in. For example, I often write “typo: fix this word” or “cite X here, which negates the results you are obtaining, talk about why that might be the case” inline. Extracting this into separate content areas (especially with overlaps in purpose) feels really overwhelming.
Yanai here makes a point about highlighting global, aka, main strengths/weaknesses of the paper. And that, eventually these are the things the AC should focus on, and reviewers should make it easier for the AC to detect these.
And I do not disagree, but for a review load of >4 in a month, I end up dreading the second pass requirement, instead of being able to actually provide the review. I have always wondered if there's an opportunity to run reviewing experiments in newer conferences like @COLM_conf where people can review the paper inline like @hypothes_is/@askalphaxiv, and you tag comments with their kind (suggestion, missing etc) -- would that improve the reviewing culture? It would still be easy for AC’s to filter out the main global points based off tags, but it doesn’t require a second pass? As a cherry on top, it is also hard to do unlike the GPT based summary reviews of the paper?
Something I did not mention in that conversation is, I often want to extract these highlights, and ask GPT/Claude/LLM of the month to separate them out into categories to put into those text areas — but the outcome doesn’t ever sound natural; and there are always slight hallucinations that are difficult to catch. So, if that doesn’t work as well, maybe we try to change the review UX itself?
Cite This Page
@article{jaiswal2024pleasedonthave1, title = {Please Don’t Have 10 Textareas in Review Forms}, author = {Jaiswal, Mimansa}, journal = {mimansajaiswal.github.io}, year = {2024}, month = {Jul}, url = {https://mimansajaiswal.github.io/posts/please-dont-have-10-textareas-in-review-forms/} }

Jul 21, 2024
Add packages to ChatGPT code interpreter environment

Code LLMs
TIL you can add unavailable packages into ChatGPT environment (for example, if you want your output to use seaborn). You can use the custom GPT pip install to download these libraries that calls remote actions in the background.
From Simon Willson’s blog post:

If there's a package that you want to use you can grab the .whl files for it from PyPI, upload them to ChatGPT and it will install them into its own environment for the duration of that chat session.
If a library has dependencies you have to upload those, too.
Cite This Page
@article{jaiswal2024addpackagestoch, title = {Add packages to ChatGPT code interpreter environment}, author = {Jaiswal, Mimansa}, journal = {mimansajaiswal.github.io}, year = {2024}, month = {Jul}, url = {https://mimansajaiswal.github.io/posts/add-packages-to-chatgpt-code-interpreter-environment/} }

Mar 18, 2024
Some people have a knack for interesting math problems

Math
I have read this and calculated it uptil $n=10$ , but this is still so unintuitive for me.
Daniel Litt on Twitter asks:

🔗 Daniel Litt · @littmath · 04:55 PM · Mar 16, 2024 UTC
Flip a fair coin 100 times—it gives a sequence of heads (H) and tails (T). For each HH in the sequence of flips, Alice gets a point; for each HT, Bob does, so e.g. for the sequence THHHT Alice gets 2 points and Bob gets 1 point. Who is most likely to win?
💬 ❤️

And the answer is:

🔗 Daniel Litt · @littmath · 05:00 PM · Mar 17, 2024 UTC
The correct answer is “Bob.” Congrats to the 10% who got it right — those few brave dreamers. pic.x.com/menJ81BrKJ
💬 ❤️

But how? Why? This is so beyond my mathematical intuition
Cite This Page
@article{jaiswal2024somepeoplehavea, title = {Some people have a knack for interesting math problems}, author = {Jaiswal, Mimansa}, journal = {mimansajaiswal.github.io}, year = {2024}, month = {Mar}, url = {https://mimansajaiswal.github.io/posts/some-people-have-a-knack-for-interesting-math-problems/} }

Blurbs

You still need to verify your LLM calculator results

Cite This Page

Recraft ai for Presentation Graphics

Cite This Page

Trying Bluesky

Cite This Page

Configure citation style and bibliography style separately

Cite This Page

Show Only Appendix ToC

Cite This Page

Saving Money by Rounding Expenses Up

Cite This Page

I am Stuck in a Loop of Datasets ↔ Techniques

The Curious Case of LLM Evaluations

No, GPT4 (RLHF’ed) Does Not Pass The Sally-Anne False Belief Test

Cite This Page

Please Don’t Have 10 Textareas in Review Forms

Cite This Page

Add packages to ChatGPT code interpreter environment

Cite This Page

Some people have a knack for interesting math problems

Cite This Page