skip to content
Site header image Mimansa Jaiswal

Blurbs

  • You still need to verify your LLM calculator results

    Chat, do I still need to check my pocket LLM’s calculations? Unfortunately, yes. Yes, you do.
    Foreground

    For some context, I love Taylor Swift. I grew up with her songs; she was the first artist I had listened to whose songs were in English, and I listened to them back in 2007 on my first school trip out of state. Sure, I had periods where I was embarrassed that she was my favorite artist, and then my roommate in PhD brought me back to being a “Swiftie”, as they say. But I never had disposable income when she toured (and unfortunately last year, visa issues made it so that I couldn't attend the last leg of her Eras Tour). So, I have never been to her concerts, ever. And then she announced her new album last month.

    Today, I saw this post on Twitter showing the tracks and their lengths. And you might think, no big deal, what's the thing here. The thing is — the reason I wanted the Eras Tour so much was because it was a 3.5-hour performance. So, of course, I wanted to know how long this album is, if she ever ended up touring it.

    🔗 Taylor Swift Edits · @TSwiftEdits_13 · 01:20 PM · Sep 26, 2025 UTC

    ‼️| ‘The Life of a Showgirl’ tracks length!

    💬 ❤️

    And so, I did what any person does seeing the tweet on the left: I used Gemini on screen, copied the text (missed the name of the first song, but got everything else), and pasted it into ChatGPT to ask what the length/duration of this album is.

    And here is how that went (scroll through the code blocks to read the actual content)

    I pasted in the Gemini extracted text
    Add these numbers and give me album length 3:46
    
    Elizabeth Taylor
    
    3:28
    
    Opalite
    
    3:55
    
    Father Figure
    
    3:32
    
    Eldest Daughter
    
    4:06
    
    Ruin The Friendship
    
    3:40
    
    Actually Romantic
    
    2:43
    
    Wish List
    
    3:27
    
    Wood
    
    2:30
    
    3:31
    
    CANCELLED!
    
    Honey
    
    The Life of a Showgirl (w/ Sabrina Carpenter)
    
    3:01
    
    4:01
    GPT5 very confidently tells me that the album length is 43 minutes and 40 seconds
    Got it — let’s carefully add these track times step by step to get the full album length.
    
    Here are the times you gave me:
    	•	Elizabeth Taylor → 3:46
    	•	Opalite → 3:28
    	•	Father Figure → 3:55
    	•	Eldest Daughter → 3:32
    	•	Ruin The Friendship → 4:06
    	•	Actually Romantic → 3:40
    	•	Wish List → 2:43
    	•	Wood → 3:27
    	•	CANCELLED! → 2:30
    	•	Honey → 3:31
    	•	The Life of a Showgirl (w/ Sabrina Carpenter) → 3:01
    	•	(final track) → 4:01
    
    
    
    Step 1: Separate minutes and seconds
    
    Minutes = 3 + 3 + 3 + 3 + 4 + 3 + 2 + 3 + 2 + 3 + 3 + 4 = 38 minutes
    Seconds = 46 + 28 + 55 + 32 + 6 + 40 + 43 + 27 + 30 + 31 + 1 + 1 = 340 seconds
    
    
    
    Step 2: Convert seconds
    
    340 seconds ÷ 60 = 5 minutes, remainder 40 seconds
    So, add 5:40
    
    
    
    Step 3: Add minutes
    
    38 minutes + 5 minutes = 43 minutes
    
    
    
    ✅ Final Album Length = 43:40
    
    
    
    Do you want me to also give you the average song length across these 12 tracks?
    I acknowledge the response
    cool
    GPT5 very confidently uses the previous calculation, doesn't double-check, and proactively provides me with the average track length
    Great! Just for fun, let’s quickly calculate the average track length.
    
    Total album length = 43 minutes 40 seconds → convert to seconds:
    
    43 × 60 + 40 = 2580 + 40 = 2620 seconds
    
    Number of tracks = 12
    
    Average = 2620 ÷ 12 ≈ 218.33 seconds
    
    Convert back to minutes:
    218 ÷ 60 = 3 minutes, remainder 38.33 seconds → roughly 3:38 per track.
    
    So the average track length is about 3 minutes 38 seconds.
    
    If you want, I can also make a nice table showing all tracks with cumulative time so it’s easier to visualize the album flow. Do you want me to do that?

    I don't know if you've spotted the problem yet, because I hadn't until I saw this tweet.

    🔗 Taylor Swift Updates+ · @chartstswifty · 01:06 PM · Sep 26, 2025 UTC

    🚨 “The Life of a Showgirl” is officially 41 minutes long.

    “The Fate of Ophelia” — 3:46
    “Elizabeth Taylor” — 3:28
    “Opalite” — 3:55
    “Father Figure” — 3:32
    “Eldest Daughter” — 4:06
    “Ruin The Friendship” — 3:40
    “Actually Romantic” — 2:43
    “Wi$h Li$t” — 3:27
    “Wood” — 2:30
    “CANCELLED!” — 3:31
    “Honey” — 3:01
    “The Life of a Showgirl” — 4:01

    💬 ❤️

    Huh? How?

    Because I thought it was 43 minutes 40 seconds long, and even if they had skipped out on the seconds (which spoiler alert, they did), it should still say 43 minutes.

    At this point, I've used GPT5 (yes, the free version) for simple math enough that I trust its results. So, my first thought wasn't even that this calculation was wrong. I initially thought that Gemini's text extraction messed something up (other than omitting the name of the first song). Apparently not, all the track times seemed to be correctly extracted—and the messed up formatting didn't matter either, as GPT5 did arrange them correctly (even if the titles were incorrectly paired because of Gemini).

    So, I added them on my own — and yeah, it is 41 minutes 40 seconds.

    So, what gives?

    Well, the separate minutes and seconds idea was great, but this part: Minutes = 3 + 3 + 3 + 3 + 4 + 3 + 2 + 3 + 2 + 3 + 3 + 4 = 38 minutes

    Well, 3 + 3 + 3 + 3 + 4 + 3 + 2 + 3 + 2 + 3 + 3 + 4 is actually 36 minutes. That's where the extra 2 minutes came from. And I wouldn't have even cared to check or verify had I not come across this tweet.

    So chat, what did we learn? That you still need to verify your pocket LLM calculator's results, sometimes even for basic math. Color me surprised.

    Note: I am not an LLM skeptic, and I'm not advocating for using LLMs for calculator-like math. But the fact is, very often, numbers are entangled in text and you want an LLM to extract corresponding numbers and perform operations on them.

    I've been doing that this year pretty confidently — for example, pasting in a menu and asking what it would cost with Seattle tax to order these 3 items, or pasting a price list with complicated rules for times/weekdays/number of people and asking for the total cost in my situation, or pasting multiple Jupyter cell output results for benchmarks across 5 models and asking the model to create a table with a column for performance differences between the worst and best model and the percentage difference.

    This situation just reminded me that I still need to verify the basic math output — I honestly didn't expect that to be the case.



  • Recraft ai for Presentation Graphics

    I am having so much fun playing around with recraft.ai line and fill generator (I get credits if you use this link on desktop and create your first image). Somehow, it fits my vibe perfectly well.

    All of these were created using Line & Fill style and then I used the background removal option. I didn’t need to do that though because Line & Fill is a vectorized style.

    Prompt:  a meme about a girl looking at a calculator or abacus and saying "why do we manually tune inference parameters" with a questioning expression.
    Prompt: a meme about a girl looking at a calculator or abacus and saying "why do we manually tune inference parameters" with a questioning expression.
    Prompt:  a meme about a a human and robot looking at each other and the robot says "what is difficult for you is not difficult for me”
    Prompt: a meme about a a human and robot looking at each other and the robot says "what is difficult for you is not difficult for me”
    Prompt:  a meme about a robot sitting on the desk in the office looking at something on the laptop and wondering "have i already seen something like this?"
    Prompt: a meme about a robot sitting on the desk in the office looking at something on the laptop and wondering "have i already seen something like this?"
    Prompt:  a meme about a human telling the robot "no, this is not what i consider to be a good output, figure out what I like”.

 Background removal removed “I” in “This” though, so I had to draw it back.
    Prompt: a meme about a human telling the robot "no, this is not what i consider to be a good output, figure out what I like”.

    Background removal removed “I” in “This” though, so I had to draw it back.

  • Trying Bluesky

    Like everyone else disappointed with the political stronghold over Twitter, I wanted to try Bluesky too. So, here is my account:


    But also, here are some tools that you might find helpful:


  • Configure citation style and bibliography style separately

    On Nov 18, 2024, Yoav asks:

    latex q: i need all bibliography items to be numbered, but i want the text cites to still be author-year. how?

    And, I wanted to document Jackson’s solution here just in case anyone else needs it again:

    \documentclass{article}
    
    \usepackage[
        citestyle=authoryear,
        bibstyle=numeric,
        natbib=true
    ]{biblatex}
    
    \addbibresource{references.bib}
    
    \title{Mixed Citations}
    \author{Jackson Petty}
    \date{November 2024}
    
    \begin{document}
    
    \maketitle
    
    As argued in \citet{macdonald-2008-syntactic}, inner aspect occupies a functional AspP projection intermediate between the \emph{v}P and VP phrases.
    
    \printbibliography
    
    \end{document}

    with the comment that:

    natbib=true isn't necessary, it's just to provide natbib-style commands like \citet and \citep which people tend to be more familiar with over the biblatex-native commands

    and this gives us:


  • Show Only Appendix ToC

    If you have ever wanted to create a Table of Contents section in appendix without the contents from the main paper — here is how you can do it.

    \appendix
    \clearpage % Start on a new page
    
    % Local command to print the header without affecting ToC or bookmarks
    \newcommand{\localheader}[1]{%
      \vspace*{-\topskip}\vspace*{\dimexpr\topskip-\baselineskip\relax}
      {\large\bfseries #1\par}
      \vspace{\baselineskip}
    }
    
    % Print the local header
    \localheader{Appendix Contents}
    
    \begin{center}
    \begin{minipage}{0.9\textwidth}
        \startcontents[appendix] % Start a new table of contents for the appendix
        \printcontents[appendix]{}{0}{% Print the appendix ToC
            \renewcommand{\contentsname}{}% Remove default "Contents" heading
        }
    \end{minipage}
    \end{center}
    Only appendix as table of contents
    Only appendix as table of contents

    The good part is that it still maintains proper bookmarks in the exported pdf — and you can still use \tabaleofcontents while drafting to get the complete Table of Contents.

    Complete table of contents
    Complete table of contents

  • Saving Money by Rounding Expenses Up

    I honestly am not a person who is on TikTok or Instagram really, so I know I wouldn't get a TikTok account. I never even installed them; I don't have an account on them. But YouTube is something that I use often. YouTube Shorts, it's hard to avoid those. One of the most interesting things I learned on YouTube was saving money by rounding things up.

    I've always been an extremely meticulous budgeted person. Like, I record every single money that goes in and out, and I divide it. Automatic apps don't work for me because I often use Venmo and Splitwise and things like those. I've always been the person who would ignore money coming in from, say, research studies or things like those, just so I have this nice net of savings that don't remain visible to me. I've always been this way, even with my allowance as a kid.

    But one of the newest ways I've found to kind of save money is I round my transactions up. So if I spend $5, I mark it as $10 in my budgeting database in Notion. Basically, it still stores $5, but what I see is the sealed amount, and that sealed amount is $10. Or instead of $45, it is $50, or instead of $60, it is $100. It's kind of the same way of saving change in pennies, but instead in dollar notations. I really like that. I just wanted to share it.


  • I am Stuck in a Loop of Datasets ↔ Techniques

    I keep jumping between - I do not trust the evaluation, the data is poor, to the dataset I created only has 100 samples.

    Over the past three years, starting in 2021, I've had multiple ideas about the best ways to evaluate certain processes we currently do with LLMs. Whether it be general evaluation on one single task, multi-run evaluations, theory of mind evaluations , or something like RAG where you're retrieving over a set of documents, I've always had those ideas and really wanted to implement them.

    I think the problem I keep getting stuck on is that data sets don't actually make sense to me. Sure, there's a lot of conversation about how many of these data sets are useless and companies need private data sets, etc. But you're publishing a paper on companies' private data sets, you're publishing it on public data sets, and so many public data sets right now are very much just generated by large models. That makes no sense to me because you're using them to evaluate these large models.

    Every single time I “look at the data”, the thing that everyone asks you to do, and the thing that I've been doing throughout my PhD, I find it disappointing. Then I go into this realm of "Oh, so I should create it myself," and that is something that I really like to do. I have done it before, but I think the thing that is holding me back is this idea of how many data samples the data set needs to have for it to be a viable data set to be published, and for the research that is done using that data set to be published and be useful.

    I think I have been inflating the number in my head, whereas some of the data sets that I like are barely 500 samples. Those that I like recently are like 100 to 150 samples. I think I need to change my mindset about creating a data set and putting it out there. If I'm running experiments on it and putting the subsequent publication out there, it doesn't need to be a data set that has thousands of samples.

    I do have an eye for good data and good data curation. I should be able to create a good data set that is just 150 samples and is still a viable prospect in its usage and tells something useful about the model that we are using it on.

    LiveBench  has 150 questions in each category
    LiveBench has 150 questions in each category
    CiteMe  has a total of 130 samples
    CiteMe has a total of 130 samples

  • Please Don’t Have 10 Textareas in Review Forms

    Another summer, another round of paper reviewing issues. I kinda just wanted to record this twitter conversation somewhere, so that I have someplace to point people to when they ask me what I want to be changed in reviewing.

    My hot take is reviews should be a single text box. Asking people to fill in 10 text boxes for a paper increases the chances that no one wants to do it.
    Especially looking at NeurIPS Datasets & Benchmarks, why does it have NINE text areas other than "Summary And Contributions" and "Review".
    All fields in Neurips Datasets and Benchmarking review: Title, Summary And Conclusions, Review, Biography, Rating, Opportunities For Improvement, Confidence, Limitations, Conclusions, Clarity, Relation To Prior Work, Reproducibility, Ethics, Flag For Ethics Review.
    All fields in Neurips Datasets and Benchmarking review: Title, Summary And Conclusions, Review, Biography, Rating, Opportunities For Improvement, Confidence, Limitations, Conclusions, Clarity, Relation To Prior Work, Reproducibility, Ethics, Flag For Ethics Review.

    ACs have said that it allows inexperienced reviewers to decompose the review.

    I guess it is the way I review which makes it really hard to do this format. I feel guilty about writing a single line in the textboxes there, but I review line by line on ipad (similar to overleaf comments, just in pdf format) and would very much prefer to be able to extract those and just paste that in. For example, I often write “typo: fix this word” or “cite X here, which negates the results you are obtaining, talk about why that might be the case” inline. Extracting this into separate content areas (especially with overlaps in purpose) feels really overwhelming.

    Yanai here makes a point about highlighting global, aka, main strengths/weaknesses of the paper. And that, eventually these are the things the AC should focus on, and reviewers should make it easier for the AC to detect these.

    And I do not disagree, but for a review load of >4 in a month, I end up dreading the second pass requirement, instead of being able to actually provide the review. I have always wondered if there's an opportunity to run reviewing experiments in newer conferences like @COLM_conf where people can review the paper inline like @hypothes_is/@askalphaxiv, and you tag comments with their kind (suggestion, missing etc) -- would that improve the reviewing culture? It would still be easy for AC’s to filter out the main global points based off tags, but it doesn’t require a second pass? As a cherry on top, it is also hard to do unlike the GPT based summary reviews of the paper?

    Something I did not mention in that conversation is, I often want to extract these highlights, and ask GPT/Claude/LLM of the month to separate them out into categories to put into those text areas — but the outcome doesn’t ever sound natural; and there are always slight hallucinations that are difficult to catch. So, if that doesn’t work as well, maybe we try to change the review UX itself?



  • Add packages to ChatGPT code interpreter environment

    TIL you can add unavailable packages into ChatGPT environment (for example, if you want your output to use seaborn). You can use the custom GPT pip install to download these libraries that calls remote actions in the background.

    From Simon Willson’s blog post:

    If there's a package that you want to use you can grab the .whl files for it from PyPI, upload them to ChatGPT and it will install them into its own environment for the duration of that chat session.

    If a library has dependencies you have to upload those, too.


  • Some people have a knack for interesting math problems

    I have read this and calculated it uptil n=10n=10, but this is still so unintuitive for me.

    Daniel Litt on Twitter asks:

    🔗 Daniel Litt · @littmath · 04:55 PM · Mar 16, 2024 UTC

    Flip a fair coin 100 times—it gives a sequence of heads (H) and tails (T). For each HH in the sequence of flips, Alice gets a point; for each HT, Bob does, so e.g. for the sequence THHHT Alice gets 2 points and Bob gets 1 point. Who is most likely to win?

    💬 ❤️

    And the answer is:

    🔗 Daniel Litt · @littmath · 05:00 PM · Mar 17, 2024 UTC

    The correct answer is “Bob.” Congrats to the 10% who got it right — those few brave dreamers. pic.x.com/menJ81BrKJ

    💬 ❤️

    But how? Why? This is so beyond my mathematical intuition