---
title: "LLM (ML) Job Interviews - Resources"
slug: llm-ml-job-interviews-resources
canonical_url: https://mimansajaiswal.github.io/posts/llm-ml-job-interviews-resources/
collection: Fieldnotes
published_at: 2024-12-24T00:00:00.000Z
updated_at: 2025-09-11T00:00:00.000Z
tags: 
  - "🌲 Evergreen"
  - LLMs
  - NLP
  - Research
  - Guide
  - List
excerpt: "A collection of all the resources I used to prepare for my ML/LLM research science/engineering focused interviews in Fall 2024."
author: "Mimansa Jaiswal"
---

## Navigation Context

- Canonical URL: https://mimansajaiswal.github.io/posts/llm-ml-job-interviews-resources/
- You are here: Home > Posts > Fieldnotes > LLM (ML) Job Interviews - Resources

### Useful Next Links
- [Home](https://mimansajaiswal.github.io/)
- [Publications](https://mimansajaiswal.github.io/papers/)
- [Blurbs](https://mimansajaiswal.github.io/collections/blurbs/)
- [Fieldnotes](https://mimansajaiswal.github.io/collections/fieldnotes/)
- [Personal](https://mimansajaiswal.github.io/collections/personal/)
- [Research](https://mimansajaiswal.github.io/collections/research/)

### Related Content

#### Pages That Mention This Page
- [LLM (ML) Job Interviews (Fall 2024) - Process](https://mimansajaiswal.github.io/posts/llm-ml-job-interviews-fall-2024-process/)

#### Other Pages Mentioned On This Page
- [LLM (ML) Job Interviews (Fall 2024) - Process](https://mimansajaiswal.github.io/posts/llm-ml-job-interviews-fall-2024-process/)

![](https://mimansajaiswal.github.io/_astro/info-alternate_blue.DUh-DayC_Z1EYdWV.svg)

**Navigation**

This post has two parts:

1.  Job Search Mechanics (including context, applying, and industry information), which you can read at [![](https://mimansajaiswal.github.io/_astro/briefcase_yellow.CAPP3IXy_Z1EYdWV.svg)LLM (ML) Job Interviews (Fall 2024) - Process](https://mimansajaiswal.github.io/posts/llm-ml-job-interviews-fall-2024-process/), and,
2.  Preparation Material and Overview of Questions, which you can continue reading below.

![](https://mimansajaiswal.github.io/_astro/report_green.sxn3TqPd_Z1EYdWV.svg)

**Disclaimer**

Last Updated: Feb 24, 2025

This is my personal process, which may work differently for others depending on their circumstances. I'm writing this in December 2024, based on my experience during Fall 2024. While the field of LLMs evolves rapidly and specific details may become outdated, the core principles should remain useful.

I want to be clear: this isn't a definitive guide or preparation manual, nor is it a guaranteed path to success. It's simply a record of what I did, without any claims about its effectiveness. If you find it helpful, wonderful—if not, I'd love to hear what worked for you instead. I'm sharing my experience, not prescribing a universal approach.

_I typically avoid rewriting my personal posts using LLMs, but this content has been heavily_ _**edited,**_ _**not**_ _**generated**_ _(primarily using Notion AI, Claude 3.5 Sonnet and GPT-4o) to keep it concise and professional while maintaining my authentic voice—more or less._

## Database View and Contributions

Some readers just want a simple list without the detailed explanations and context—and that's perfectly fine.

I've included two toggles below. The first shows these same resources in a Notion database view that you can freely explore—expand it, open it in a new page, or browse as needed.

The second toggle contains a contribution form for a separate database. I personally review all submissions to filter out spam, and I'll move valuable contributions to the main database. Contributing is entirely optional—while I don't expect anyone to do so, I welcome any worthwhile additions.

The collection currently lacks resources in certain areas, particularly multimodal and speech-based content. Though I can't write extensively about these topics myself, I'm happy to serve as a curator. If you have relevant resources to share, please use the form—I'll review submissions and incorporate valuable ones into the main database.

**To view this content as a searchable database, expand this toggle. You can explore the embedded database by expanding it or opening it in a new page. (Last Updated: Feb 6, 2025)**

**To contribute to this collection, expand this toggle to find the submission form. I review all entries to filter out spam and add valuable contributions to the main database.**

## Content Storage

I use [Notion](https://www.notion.so/) **pages** for content storage. Though [Capacities](https://capacities.io/) would have been ideal, its lack of a reliable output API meant I'd need to copy content into Notion anyway for this post. Additionally, Capacities doesn't support multi-user collaboration—a feature I occasionally need—and I prefer keeping my content in one place rather than scattered across multiple apps. I've ruled out Obsidian since I prefer rich-text block-based systems. Capacities does have one standout feature: the ability to mark individual blocks as database objects with properties while keeping the block content embedded in place. This feature is particularly valuable for learning, and I highly recommend it. This isn't meant to be a comprehensive overview of my content/knowledge management practices—just a brief explanation of my organizational preferences for interview preparation.

### Classification

I maintained 7 major sections in my interview preparation page: [Statistical Knowledge](https://mimansajaiswal.github.io/posts/llm-ml-job-interviews-resources/#statistics), [ML Knowledge](https://mimansajaiswal.github.io/posts/llm-ml-job-interviews-resources/#ml-knowledge), [ML Design](https://mimansajaiswal.github.io/posts/llm-ml-job-interviews-resources/#ml-system-design), [DSA Coding](https://mimansajaiswal.github.io/posts/llm-ml-job-interviews-resources/#leetcode), [ML Coding](https://mimansajaiswal.github.io/posts/llm-ml-job-interviews-resources/#ml-dl-llm-coding), [Behavioral](https://mimansajaiswal.github.io/posts/llm-ml-job-interviews-resources/#behavioral), and “[I know of these papers](https://mimansajaiswal.github.io/posts/llm-ml-job-interviews-resources/#i-know-of-these-papers)”. Each section (other than the papers had a Question subpage with 4 categories: "Aced it," "Took time," "Didn't get it," and "Just saw it somewhere."

As mentioned in my [![](https://mimansajaiswal.github.io/_astro/briefcase_yellow.CAPP3IXy_Z1EYdWV.svg)LLM (ML) Job Interviews (Fall 2024) - Process](https://mimansajaiswal.github.io/posts/llm-ml-job-interviews-fall-2024-process/) post, I meticulously documented and categorized every interview question in the question bank, along with my performance evaluation. For those familiar with JEE preparation, this approach might sound familiar—it's my tried and tested method (though I failed at JEE), and it's the only system I know (fingers crossed for better results this time). Whenever questions repeated across interviews, I added comments in Notion noting their previous appearances and updated their categories based on my latest performance.

## Leetcode

Let me walk you through my LeetCode interview preparation approach. Companies consistently test candidates with LeetCode questions during interviews—yes, even startups. Here's the preparation strategy that worked for me.

Most companies don't allow code execution during interviews. While some Microsoft and Apple teams did permit running code (which was helpful), I learned to spot and prevent common mistakes without needing to test the code.

I tackled all questions from both recommended lists systematically. My first pass focused on finding solutions—whether brute force or optimized. After at least two days, I'd return to each question to implement an optimized solution. When successful, I'd move on. When stuck, I'd consult ChatGPT for the algorithm, implement it myself, and submit the solution. To maintain accurate progress tracking, I'd then intentionally submit an incorrect version to keep the question unmarked. After another two-day break following failed attempts, I'd try implementing the optimized version again.

I continued this process until I mastered all questions on the [Grind 75 list](https://leetcode.com/problem-list/rab78cw1/), completed all easy and medium questions, and solved 50% of hard questions on the [NeetCode 150 list](https://leetcode.com/problem-list/plakya4j/).

[

![title](https://www.google.com/s2/favicons?domain=leetcode.com)

https://leetcode.com/problem-list/rab78cw1/


](https://leetcode.com/problem-list/rab78cw1/)

Bookmark for [Just a moment...](https://leetcode.com/problem-list/rab78cw1/)

[

![title](https://www.google.com/s2/favicons?domain=leetcode.com)

https://leetcode.com/problem-list/plakya4j/


](https://leetcode.com/problem-list/plakya4j/)

Bookmark for [Just a moment...](https://leetcode.com/problem-list/plakya4j/)

## ML/DL/LLM Coding

While these concepts can be coded given sufficient time and debugging capabilities, interview settings present unique challenges. You'll typically have just 25-35 minutes, need flawless execution, and must maintain precise matrix dimensions throughout. That's why practicing implementation is crucial, even if you're already comfortable with the concepts. To help you prepare, I've compiled repositories showcasing interview-style code examples, categorized into basic ML and LLMs. Following my Contextual interview, I expanded this collection by implementing various RAG/LLM inference papers—a category I found particularly engaging.

-   For basic ML questions:
    
    [
    
    Machine-Learning-Interviews/src/MLC/notebooks at main · alirezadir/Machine-Learning-Interviews
    
    This repo is meant to serve as a guide for Machine Learning/AI technical interviews. - alirezadir/Machine-Learning-Interviews
    
    ![title](https://www.google.com/s2/favicons?domain=github.com)
    
    https://github.com/alirezadir/Machine-Learning-Interviews/tree/main/src/MLC/notebooks
    
    ![title](https://opengraph.githubassets.com/41077913ce3c96b349c0e5e175e381ec6d76e7ca7fe3a309d1ab7408e0b8b7a2/alirezadir/Machine-Learning-Interviews)
    
    ](https://github.com/alirezadir/Machine-Learning-Interviews/tree/main/src/MLC/notebooks)
    
    Bookmark for [https://github.com/alirezadir/Machine-Learning-Interviews/tree/main/src/MLC/notebooks](https://github.com/alirezadir/Machine-Learning-Interviews/tree/main/src/MLC/notebooks)
    
    -   One major omission in this resource is the implementation of basic neural networks using NumPy and PyTorch. To address this gap, I practiced coding feedforward neural networks (FNN), LSTMs, and RNNs from scratch using random values, then validated my implementations against online examples.
-   For LLMs, I worked with the "Hands-on Large Language Models" repository. Rather than following the book directly, I turned the notebooks into fill-in-the-blank exercises and completed them systematically.
    
    [
    
    GitHub - HandsOnLLM/Hands-On-Large-Language-Models: Official code repo for the O’Reilly Book - “Hands-On Large Language Models”
    
    Official code repo for the O’Reilly Book - “Hands-On Large Language Models” - HandsOnLLM/Hands-On-Large-Language-Models
    
    ![title](https://www.google.com/s2/favicons?domain=github.com)
    
    https://github.com/HandsOnLLM/Hands-On-Large-Language-Models/
    
    ![title](https://opengraph.githubassets.com/4c6ab2cbee7ff5e2dd11a22029a922659bfea411609b1f42a9232a882bc24b7f/HandsOnLLM/Hands-On-Large-Language-Models)
    
    ](https://github.com/HandsOnLLM/Hands-On-Large-Language-Models/)
    
    Bookmark for [https://github.com/HandsOnLLM/Hands-On-Large-Language-Models/](https://github.com/HandsOnLLM/Hands-On-Large-Language-Models/)
    
    -   Be prepared for questions about implementing different attention mechanisms — including cached, grouped query, multi-head, and single-head attention. You'll need to know how to code these using just NumPy and PyTorch. I recently found this post, which provides a thorough overview.
        
        [
        
        Transformers Laid Out
        
        I have encountered that there are mainly three types of blogs/videos/tutorials talking about transformers
        
        ![title](https://www.google.com/s2/favicons?domain=goyalpramod.github.io)
        
        https://goyalpramod.github.io/blogs/Transformers\_laid\_out/
        
        ![title](https://goyalpramod.github.io/assets/transformers_laid_out/meme.png)
        
        ](https://goyalpramod.github.io/blogs/Transformers_laid_out/)
        
        Bookmark for [https://goyalpramod.github.io/blogs/Transformers\_laid\_out/](https://goyalpramod.github.io/blogs/Transformers_laid_out/)
        
    -   Master implementing key components of the transformer architecture, including token embeddings, layer normalization, encoder layers, and techniques for integrating Mixture of Experts (MoE) into existing model architectures.
        -   I studied several codebases thoroughly, particularly the LLaMA model implementation and OLMo repository, to deepen my understanding of these concepts.
            
            [
            
            transformers/src/transformers/models/llama at main · huggingface/transformers
            
            🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training. - huggingface/transformers
            
            ![title](https://www.google.com/s2/favicons?domain=github.com)
            
            https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama
            
            ![title](https://repository-images.githubusercontent.com/155220641/a16c4880-a501-11ea-9e8f-646cf611702e)
            
            ](https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama)
            
            Bookmark for [https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama](https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama)
            
            [
            
            GitHub - allenai/OLMo: Modeling, training, eval, and inference code for OLMo
            
            Modeling, training, eval, and inference code for OLMo - allenai/OLMo
            
            ![title](https://www.google.com/s2/favicons?domain=github.com)
            
            https://github.com/allenai/OLMo
            
            ![title](https://opengraph.githubassets.com/4aa47db63aeff30d4d1ede1882b637f15679407c0500430bec0c1b041be24a61/allenai/OLMo)
            
            ](https://github.com/allenai/OLMo)
            
            Bookmark for [https://github.com/allenai/OLMo](https://github.com/allenai/OLMo)
            
    -   Make sure to include dimensions in your variable names or comments. This helps with debugging, and interviewers frequently check dimensions to test your understanding—not just your ability to memorize. While I started by using comments, I later found Noam Shazeer's Shape Suffixes approach ( [![](https://miro.medium.com/v2/resize:fill:304:304/10fd5c419ac61637245384e7099e131627900034828f4f386bdaa47a74eae156) Shape Suffixes — Good Coding Style](https://medium.com/@NoamShazeer/shape-suffixes-good-coding-style-f836e72e24fd)), which proved to be an excellent—perhaps even superior—method.
-   For RAG/LLM inference papers, I practiced implementing core components: a basic embedding model with clustering capabilities, along with common decoding strategies like top-p, top-k, temperature control, and beam vs. greedy search. I also selected 10 papers from my favorite researchers and worked through their implementations independently. Since running the full models wasn't practical due to their size, I used LLM chatbots to validate my approaches, drawing on my ML expertise to critically evaluate their responses and identify potential issues.
-   Finally, for additional practice, I worked through every question here using [the same approach I used with LeetCode](https://mimansajaiswal.github.io/posts/llm-ml-job-interviews-resources/#leetcode), though I rarely needed multiple attempts this time.
    
    [
    
    Deep-ML | Practice Machine Learning
    
    Practice machine learning and data science with hands-on coding challenges, real datasets, and interactive labs.
    
    ![title](https://www.google.com/s2/favicons?domain=www.deep-ml.com)
    
    https://www.deep-ml.com/
    
    ![title](https://www.deep-ml.com/deepml-logo.jpeg)
    
    ](https://www.deep-ml.com/)
    
    Bookmark for [https://www.deep-ml.com/](https://www.deep-ml.com/)
    

## Statistics

I originally hadn't planned to include statistics in my preparation, but after being caught off guard by an unexpected Amazon interview loop early in my job search, I quickly realized I needed to. I then set out to find and practice statistics questions, using the resources listed below.

I'll be candid: there were many times when I struggled to remember concepts or needed to better understand the reasoning behind material I had simply memorized in college. I turned to ChatGPT and Claude to help clarify these topics. Though I always fact-checked their responses and ensured I had proper context, this approach actually improved my ability to explain concepts clearly and systematically (even if I still tended to ramble too much).

[

GitHub - wzchen/probability\_cheatsheet: A comprehensive 10-page probability cheatsheet that covers a semester’s worth of introduction to probability.

A comprehensive 10-page probability cheatsheet that covers a semester’s worth of introduction to probability. - wzchen/probability\_cheatsheet

![title](https://www.google.com/s2/favicons?domain=github.com)

https://github.com/wzchen/probability\_cheatsheet

![title](https://opengraph.githubassets.com/330f0666f8ecd0f5854940b126f499388183e1791723f35fea606ef56103a641/wzchen/probability_cheatsheet)

](https://github.com/wzchen/probability_cheatsheet)

Bookmark for [https://github.com/wzchen/probability\_cheatsheet](https://github.com/wzchen/probability_cheatsheet)

Some linear algebra concepts from here (I know it's not statistics, but it's still relevant)

[

![title](https://www.google.com/s2/favicons?domain=ocw.mit.edu)

https://ocw.mit.edu/courses/18-06-linear-algebra-spring-2010/4d876a9159e32543eb0d73b4d4382f4c\_MIT18\_06S10ZoomNotes.pdf


](https://ocw.mit.edu/courses/18-06-linear-algebra-spring-2010/4d876a9159e32543eb0d73b4d4382f4c_MIT18_06S10ZoomNotes.pdf)

Bookmark for [https://ocw.mit.edu/courses/18-06-linear-algebra-spring-2010/4d876a9159e32543eb0d73b4d4382f4c\_MIT18\_06S10ZoomNotes.pdf](https://ocw.mit.edu/courses/18-06-linear-algebra-spring-2010/4d876a9159e32543eb0d73b4d4382f4c_MIT18_06S10ZoomNotes.pdf)

For data science question preparation, I used this resource:

[

GitHub - youssefHosni/Data-Science-Interview-Questions-Answers: Curated list of data science interview questions and answers

Curated list of data science interview questions and answers - youssefHosni/Data-Science-Interview-Questions-Answers

![title](https://www.google.com/s2/favicons?domain=github.com)

https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/tree/main

![title](https://opengraph.githubassets.com/36f56ac5cdb079c0ce26b5e37e7bd311ffb0add9e89a20b6dc4f333efc95332b/youssefHosni/Data-Science-Interview-Questions-Answers)

](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/tree/main)

Bookmark for [https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/tree/main](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/tree/main)

I found Chapter 5 of Chip Huyen's book particularly useful:

[

Chapter 5. Math · MLIB

![title](https://www.google.com/s2/favicons?domain=huyenchip.com)

https://huyenchip.com/ml-interviews-book/contents/chapter-5.-math.html


](https://huyenchip.com/ml-interviews-book/contents/chapter-5.-math.html)

Bookmark for [https://huyenchip.com/ml-interviews-book/contents/chapter-5.-math.html](https://huyenchip.com/ml-interviews-book/contents/chapter-5.-math.html)

## ML Knowledge

I organized the Machine Learning knowledge section into two parts: (a) ML fundamentals and (b) Large Language Models (LLMs) and Transformers. Looking back, I should have created three parts by making NLP fundamentals its own section. Depending on your specialization, this third section could instead cover Computer Vision with Vision Language Models, or other cutting-edge architectures.

### ML Fundamentals

I began my preparation with this ML cheatsheet, which covers the fundamentals:

[

The Little Book of Deep Learning

The Little Book of Deep Learning

![title](https://www.google.com/s2/favicons?domain=fleuret.org)

https://fleuret.org/francois/lbdl.html

![title](https://fleuret.org/francois/pics/lbdl4p.png)

](https://fleuret.org/francois/lbdl.html)

Bookmark for [https://fleuret.org/francois/lbdl.html](https://fleuret.org/francois/lbdl.html)

[

AI Flashcards - Visual Machine Learning Concepts

Master machine learning and AI concepts with 119 beautifully designed flashcards. Perfect for students, engineers, and researchers.

![title](https://www.google.com/s2/favicons?domain=aiflashcards.com)

https://aiflashcards.com/

![title](https://aiflashcards.com/images/machine_learning_flashcards1.png)

](https://aiflashcards.com/)

Bookmark for [https://aiflashcards.com/](https://aiflashcards.com/)

[

Machine Learning Glossary — ML Glossary documentation

![title](https://www.google.com/s2/favicons?domain=ml-cheatsheet.readthedocs.io)

https://ml-cheatsheet.readthedocs.io/en/latest/


](https://ml-cheatsheet.readthedocs.io/en/latest/)

Bookmark for [https://ml-cheatsheet.readthedocs.io/en/latest/](https://ml-cheatsheet.readthedocs.io/en/latest/)

[

GitHub - soulmachine/machine-learning-cheat-sheet: Classical equations and diagrams in machine learning

Classical equations and diagrams in machine learning - soulmachine/machine-learning-cheat-sheet

![title](https://www.google.com/s2/favicons?domain=github.com)

https://github.com/soulmachine/machine-learning-cheat-sheet

![title](https://opengraph.githubassets.com/e9071c3249469bcf155f25c8ecfba222e13bc77e97cbd8d7d80e1440807a7714/soulmachine/machine-learning-cheat-sheet)

](https://github.com/soulmachine/machine-learning-cheat-sheet)

Bookmark for [https://github.com/soulmachine/machine-learning-cheat-sheet](https://github.com/soulmachine/machine-learning-cheat-sheet)

Next, I worked through chapters 7 and 8 of the ML interviews book:

[

Chapter 7. Machine learning workflows · MLIB

![title](https://www.google.com/s2/favicons?domain=huyenchip.com)

https://huyenchip.com/ml-interviews-book/contents/chapter-7.-machine-learning-workflows.html


](https://huyenchip.com/ml-interviews-book/contents/chapter-7.-machine-learning-workflows.html)

Bookmark for [https://huyenchip.com/ml-interviews-book/contents/chapter-7.-machine-learning-workflows.html](https://huyenchip.com/ml-interviews-book/contents/chapter-7.-machine-learning-workflows.html)

I primarily used these two resources in my preparation. For each question I encountered, I documented it and explored potential follow-up questions. My study method was straightforward: I'd first attempt to answer the question independently. If successful, I'd use ChatGPT to generate follow-up questions and practice those too. When I couldn't answer a question, I'd start a fresh chat to research the topic, document my findings with notes and comments in Notion, and categorize it under the relevant toggle section.

Overall, you list should include:

-   General breadth topics:
    -   Machine Learning Fundamentals: Supervised vs. unsupervised learning, logistic regression for classification, bag-of-words, bagging vs. boosting, decision trees/random forests, confusion matrix, regularization methods, clustering (k-means/k-medians), distance measures, convexity, loss functions.
    -   Statistics and Probability: Bias-variance tradeoff, imputation techniques, handling different dataset sizes, causal inference, statistical significance, regression analysis, Bayesian analysis, time series analysis, active learning, hypothesis testing, probability calculations.
    -   Models and Performance: Overfitting vs. underfitting, model performance monitoring, online/offline evaluation metrics.
    -   Dimensionality Reduction: Feature selection methods, matrix factorization, autoencoders, curse of dimensionality, PCA vs. t-SNE vs. MDS vs. autoencoders.
    -   Time Series Analysis: Stationary time series, frequency domain analysis, forecasting models.
    -   Neural Networks: Attention mechanisms, ReLU activation, deep neural networks (DNNs), recurrent neural networks (RNNs) including LSTMs, convolutional neural networks (CNNs), transformer models.
    -   Algorithms:
        -   Clustering: K-means, LDA, word2vec, hierarchical clustering, Mean-Shift, DBSCAN, expectation-maximization using GMM.
        -   Classification: Logistic regression, Naïve Bayes, k-nearest neighbors (KNN), decision trees, support vector machines (SVMs).
        -   Optimization: Unconstrained continuous optimization, gradient descent methods (batch vs. stochastic), maximizing continuous functions.
    -   Other Topics: Probability distributions and limit theorems, discriminative vs. generative models, methodology (online vs. offline learning, cross-validation), online experimentation (A/B testing), binary and multi-class classification, hyperparameters, feature engineering.
-   Area-specific topics:
    -   Personalization/Recommendation Systems: Collaborative filtering, transformer architecture, active learning, self-attention, multiple attention heads, pooling methods, visualizing attention layers, ranking, multi-armed bandits (MABs), evaluation metrics, cold-start recommenders.
    -   Deep Learning: Transformers, dropout, normalization, maxout, softmax, word embeddings.
    -   Natural Language Processing: Latent Dirichlet Allocation (LDA), language modeling, handling sparse data, text normalization techniques, statistical language modeling, information retrieval, word2vec, CNNs, transformers.
    -   Computer Vision: Image and signal processing, linear algebra, image retrieval, CNNs, residual networks etc.

### LLMs

I organized the material into 5 key sections:

-   Architecture and Training: Covered transformer architecture fundamentals, pre-training versus fine-tuning approaches, training objectives (with emphasis on next-token prediction), tokenization methods, scaling laws, and parameter-efficient techniques like LoRA and Prefix Tuning.
-   Generation Control: Explored methods for controlling model outputs through temperature and top-p sampling, effective prompt engineering, few-shot and in-context learning techniques, chain-of-thought prompting for complex reasoning, and strategies to minimize hallucinations.
-   LLM Evaluation: Examined key metrics (including perplexity, ROUGE, and BLEU), standard benchmarks like MMLU and BigBench, approaches to human evaluation, and tools for bias detection and mitigation.
-   Optimization and Deployment: Studied practical aspects like quantization, model distillation, inference optimization, prompt caching, load balancing, and cost-effective deployment strategies.
-   Safety and Ethics: Focused on essential safeguards including content filtering, output sanitization, protection against jailbreaking attempts, and data privacy protocols for responsible LLM deployment.

* * *

One of the best analogies I learned came from Kevin Murphy's PML book, as quoted in Yuan's blog post ( [![](https://www.yuan-meng.com/favicon.ico) Attention as Soft Dictionary Lookup](https://www.yuan-meng.com/posts/attention_as_dict/)).

[![We can think of attention as a soft dictionary look up, in which we compare the query q to each key k_i, and then retrieve the corresponding value v_i." - Chapter 15, p. 513](https://mimansajaiswal.github.io/_astro/image.DYgpbHZr_ZEnjyz.webp)](https://mimansajaiswal.github.io/_astro/image.DYgpbHZr_ZEnjyz.webp)

We can think of attention as a soft dictionary look up, in which we compare the query qqq to each key kik\_iki​, and then retrieve the corresponding value viv\_ivi​." - Chapter 15, p. 513

While this architecture is widely known, I found it valuable as a framework for organizing my review of architecture-related papers and questions.

[![Differences between Transformer and LLaMA architectures, as found in the Medium post "Unlocking Low-Resource Language Understanding”](https://mimansajaiswal.github.io/_astro/image.D3LRsVtE_Z1H8Fle.webp)](https://mimansajaiswal.github.io/_astro/image.D3LRsVtE_Z1H8Fle.webp)

Differences between Transformer and LLaMA architectures, as found in the Medium post "[Unlocking Low-Resource Language Understanding](https://medium.com/@ccibeekeoc42/unlocking-low-resource-language-understanding-enhancing-translation-with-llama-3-fine-tuning-df8f1d04d206)”

Here are the key resources and materials I studied:

-   General content
    
    [
    
    LLM Notes
    
    Backprop Understanding in Simple Words Most deep learning libraries provide a fast and flexible framework for building deep learning projects by easing the computation of the partial derivatives (also called gradients) that drive learning through backpropagation. When training a model, we use a…
    
    ![title](https://www.google.com/s2/favicons?domain=docs.google.com)
    
    https://docs.google.com/document/d/1K7ahLiopilE0TxpkRcrrzgUPTcjZq0aY\_5J797\_0o98/edit?tab=t.0
    
    ![title](https://lh7-us.googleusercontent.com/docs/AHkbwyKCkFluCGQGp7HJmnqYkAXIzyeON92_Ua9oNoCWWvzUJ6y92Ki6ZU5lcVPO3MOxTFNNLt4ydofylic1GA2pCeingkiZkfSn_ID0kmzNIKYW5ux0JEwW=w1200-h630-p)
    
    ](https://docs.google.com/document/d/1K7ahLiopilE0TxpkRcrrzgUPTcjZq0aY_5J797_0o98/edit?tab=t.0)
    
    Bookmark for [https://docs.google.com/document/d/1K7ahLiopilE0TxpkRcrrzgUPTcjZq0aY\_5J797\_0o98/edit?tab=t.0](https://docs.google.com/document/d/1K7ahLiopilE0TxpkRcrrzgUPTcjZq0aY_5J797_0o98/edit?tab=t.0)
    
    [
    
    Understanding LLMs Part 1 - Transformers
    
    ![title](https://www.google.com/s2/favicons?domain=miro.com)
    
    https://miro.com/app/board/uXjVLBwJaV4=/
    
    ![title](https://miro.com/api/v1/boards/uXjVLBwJaV4=/picture?etag=R3458764516271551246_0_20250610)
    
    ](https://miro.com/app/board/uXjVLBwJaV4=/)
    
    Bookmark for [https://miro.com/app/board/uXjVLBwJaV4=/](https://miro.com/app/board/uXjVLBwJaV4=/)
    
    [
    
    Aman’s AI Journal • Primers • Overview of Large Language Models
    
    Aman’s AI Journal | Course notes and learning material for Artificial Intelligence and Deep Learning Stanford classes.
    
    ![title](https://www.google.com/s2/favicons?domain=aman.ai)
    
    https://aman.ai/primers/ai/LLM/
    
    
    ](https://aman.ai/primers/ai/LLM/)
    
    Bookmark for [https://aman.ai/primers/ai/LLM/](https://aman.ai/primers/ai/LLM/)
    
    [
    
    ![title](https://www.google.com/s2/favicons?domain=blog.eleuther.ai)
    
    https://blog.eleuther.ai/transformer-math/
    
    
    ](https://blog.eleuther.ai/transformer-math/)
    
    Bookmark for [https://blog.eleuther.ai/transformer-math/](https://blog.eleuther.ai/transformer-math/)
    
    [
    
    \- YouTube
    
    Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.
    
    ![title](https://www.google.com/s2/favicons?domain=www.youtube.com)
    
    https://www.youtube.com/watch?v=9vM4p9NN0Ts&feature=youtu.be
    
    
    ](https://www.youtube.com/watch?v=9vM4p9NN0Ts&feature=youtu.be)
    
    Bookmark for [https://www.youtube.com/watch?v=9vM4p9NN0Ts&feature=youtu.be](https://www.youtube.com/watch?v=9vM4p9NN0Ts&feature=youtu.be)
    
    [
    
    \[Public, Approved\] Intro to Transformers
    
    Transformers Lucas Beyer lucasb.eyer.be@gmail.com @giffmana http://lb.eyer.be/transformer
    
    ![title](https://www.google.com/s2/favicons?domain=lucasb.eyer.be)
    
    http://lucasb.eyer.be/transformer
    
    ![title](https://lh7-us.googleusercontent.com/docs/AHkbwyLAAnSLyy1IZqnkmJas_quRXwLwMlscd6lEFOuJBNxqwhqM-eU6SHfhGsUDbJebH33DvIoYcirqoZ2fwtEd5HbMk8Jl7EsOdtBR3uSmZUMHjBDbFXK7=w1200-h630-p)
    
    ](http://lucasb.eyer.be/transformer)
    
    Bookmark for [http://lucasb.eyer.be/transformer](http://lucasb.eyer.be/transformer)
    
    [
    
    LM-class
    
    LM-class is an education resource for contemporary language modeling, broadly construed.
    
    ![title](https://www.google.com/s2/favicons?domain=lm-class.org)
    
    https://lm-class.org/lectures
    
    ![title](https://lm-class.org/img/llm-book-social-media.jpeg)
    
    ](https://lm-class.org/lectures)
    
    Bookmark for [https://lm-class.org/lectures](https://lm-class.org/lectures)
    
    [
    
    ![title](https://www.google.com/s2/favicons?domain=arxiv.org)
    
    https://arxiv.org/pdf/2501.09223
    
    
    ](https://arxiv.org/pdf/2501.09223)
    
    Bookmark for [https://arxiv.org/pdf/2501.09223](https://arxiv.org/pdf/2501.09223)
    
    [
    
    How To Scale Your Model
    
    Training LLMs often feels like alchemy, but understanding and optimizing the performance of your models doesn’t have to. This book aims to demystify the science of scaling language models: how TPUs (and GPUs) work and how they communicate with each other, how LLMs run on real hardware, and how to parallelize your models during training and inference so they run efficiently at massive scale. If you’ve ever wondered “how expensive should this LLM be to train” or “how much memory do I need to serve this model myself” or “what’s an AllGather”, we hope this will be useful to you.
    
    ![title](https://www.google.com/s2/favicons?domain=jax-ml.github.io)
    
    https://jax-ml.github.io/scaling-book/
    
    
    ](https://jax-ml.github.io/scaling-book/)
    
    Bookmark for [https://jax-ml.github.io/scaling-book/](https://jax-ml.github.io/scaling-book/)
    
    [
    
    ![title](https://www.google.com/s2/favicons?domain=www.brandonrohrer.com)
    
    https://www.brandonrohrer.com/transformers
    
    ![title](https://raw.githubusercontent.com/brohrer/blog_images/refs/heads/main/transformers/one_hot_vocabulary.png)
    
    ](https://www.brandonrohrer.com/transformers)
    
    Bookmark for [https://www.brandonrohrer.com/transformers](https://www.brandonrohrer.com/transformers)
    
    [
    
    GenAI Handbook
    
    ![title](https://www.google.com/s2/favicons?domain=genai-handbook.github.io)
    
    https://genai-handbook.github.io/
    
    
    ](https://genai-handbook.github.io/)
    
    Bookmark for [https://genai-handbook.github.io/](https://genai-handbook.github.io/)
    
-   For LLM architecture, I focused on:
    -   Encoder vs decoder
        
        [
        
        The Transformer Family Version 2.0
        
        Many new Transformer architecture improvements have been proposed since my last post on “The Transformer Family” about three years ago. Here I did a big refactoring and enrichment of that 2020 post — restructure the hierarchy of sections and improve many sections with more recent papers. Version 2.0 is a superset of the old version, about twice the length. Notations Symbol Meaning $d$ The model size / hidden state dimension / positional encoding size. $h$ The number of heads in multi-head attention layer. $L$ The segment length of input sequence. $N$ The total number of attention layers in the model; not considering MoE. $\\mathbf{X} \\in \\mathbb{R}^{L \\times d}$ The input sequence where each element has been mapped into an embedding vector of shape $d$, same as the model size. $\\mathbf{W}^k \\in \\mathbb{R}^{d \\times d\_k}$ The key weight matrix. $\\mathbf{W}^q \\in \\mathbb{R}^{d \\times d\_k}$ The query weight matrix. $\\mathbf{W}^v \\in \\mathbb{R}^{d \\times d\_v}$ The value weight matrix. Often we have $d\_k = d\_v = d$. $\\mathbf{W}^k\_i, \\mathbf{W}^q\_i \\in \\mathbb{R}^{d \\times d\_k/h}; \\mathbf{W}^v\_i \\in \\mathbb{R}^{d \\times d\_v/h}$ The weight matrices per head. $\\mathbf{W}^o \\in \\mathbb{R}^{d\_v \\times d}$ The output weight matrix. $\\mathbf{Q} = \\mathbf{X}\\mathbf{W}^q \\in \\mathbb{R}^{L \\times d\_k}$ The query embedding inputs. $\\mathbf{K} = \\mathbf{X}\\mathbf{W}^k \\in \\mathbb{R}^{L \\times d\_k}$ The key embedding inputs. $\\mathbf{V} = \\mathbf{X}\\mathbf{W}^v \\in \\mathbb{R}^{L \\times d\_v}$ The value embedding inputs. $\\mathbf{q}\_i, \\mathbf{k}\_i \\in \\mathbb{R}^{d\_k}, \\mathbf{v}\_i \\in \\mathbb{R}^{d\_v}$ Row vectors in query, key, value matrices, $\\mathbf{Q}$, $\\mathbf{K}$ and $\\mathbf{V}$. $S\_i$ A collection of key positions for the $i$-th query $\\mathbf{q}\_i$ to attend to. $\\mathbf{A} \\in \\mathbb{R}^{L \\times L}$ The self-attention matrix between a input sequence of lenght $L$ and itself. $\\mathbf{A} = \\text{softmax}(\\mathbf{Q}\\mathbf{K}^\\top / \\sqrt{d\_k})$. $a\_{ij} \\in \\mathbf{A}$ The scalar attention score between query $\\mathbf{q}\_i$ and key $\\mathbf{k}\_j$. $\\mathbf{P} \\in \\mathbb{R}^{L \\times d}$ position encoding matrix, where the $i$-th row $\\mathbf{p}\_i$ is the positional encoding for input $\\mathbf{x}\_i$. Transformer Basics The Transformer (which will be referred to as “vanilla Transformer” to distinguish it from other enhanced versions; Vaswani, et al., 2017) model has an encoder-decoder architecture, as commonly used in many NMT models. Later simplified Transformer was shown to achieve great performance in language modeling tasks, like in encoder-only BERT or decoder-only GPT.
        
        ![title](https://www.google.com/s2/favicons?domain=lilianweng.github.io)
        
        https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/
        
        
        ](https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/)
        
        Bookmark for [https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/](https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/)
        
    -   Training objectives
        
        > As of Dec 30, 2024, MTP is an interesting mechanic to examine, and you should be prepared for questions about it. [Deepseek-v3 has an implementation](https://arxiv.org/abs/2412.19437) that you can reference.
        
    -   Self-attention (QKV sizes and weight matrices of QKV),
        
        [
        
        Attention as Soft Dictionary Lookup
        
        The Dictionary Metaphor 🔑📚
        
        ![title](https://www.google.com/s2/favicons?domain=www.yuan-meng.com)
        
        https://www.yuan-meng.com/posts/attention\_as\_dict/
        
        ![title](https://www.dropbox.com/scl/fi/f41wr0rkn85y5tgi3ss7h/Screenshot-2024-03-09-at-5.13.45-PM.png?rlkey=fevuo6cxy5gigzp682244co1q&raw=1)
        
        ](https://www.yuan-meng.com/posts/attention_as_dict/)
        
        Bookmark for [https://www.yuan-meng.com/posts/attention\_as\_dict/](https://www.yuan-meng.com/posts/attention_as_dict/)
        
        This video is great if you want to explain the intuition to someone else:
        
        [
        
        \- YouTube
        
        Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.
        
        ![title](https://www.google.com/s2/favicons?domain=youtu.be)
        
        https://youtu.be/eMlx5fFNoYc?si=bNsHDCbPuU95j7Av
        
        
        ](https://youtu.be/eMlx5fFNoYc?si=bNsHDCbPuU95j7Av)
        
        Bookmark for [https://youtu.be/eMlx5fFNoYc?si=bNsHDCbPuU95j7Av](https://youtu.be/eMlx5fFNoYc?si=bNsHDCbPuU95j7Av)
        
    -   Multi-head attention (concatenation of head outputs rather than averaging), grouped query attention, and other variants.
        
        > Something to note: MLA became a popular mechanism _while_ I was interviewing, so I learned about it during the process, aided by my "I know of these papers" section.
        
    -   Position-wise Feed-Forward Network (FFN size, ReLU activation, comparisons between ReLU, Swish, SiLU, and Swiglu variants),
    -   Positional embeddings (dimensions, sizing, learned embeddings, various embedding types, and Rotary Position Embeddings - RoPE),
    -   Layer normalization (underlying intuition and why LayerNorm parameters are separate from layer parameters),
    -   Cross-attention (comparing cross-attention with masked self-attention, understanding decoder-only architectures, and handling initial inputs in decoder-only models).
-   For MoE, I started with this blogpost:
    
    [
    
    Knowing Enough About MoE to Explain Dropped Tokens in GPT-4
    
    In a previous blogpost, I made a simple observation about GPT-4 from a paper I had incidentally read on a whim. After finishing the post, I realised I didn’t actually ever figure out how token dropping could occur; only learning a black-box rule that it could occur in batched MoE inference for reasons. This post is here to fix that – to collect enough info from important MoE papers (and alleged GPT-4 leaks) to explain the full mechanism of token dropping.
    
    ![title](https://www.google.com/s2/favicons?domain=152334h.github.io)
    
    https://152334h.github.io/blog/knowing-enough-about-moe/
    
    ![title](https://152334h.github.io/undefined.png)
    
    ](https://152334h.github.io/blog/knowing-enough-about-moe/)
    
    Bookmark for [https://152334h.github.io/blog/knowing-enough-about-moe/](https://152334h.github.io/blog/knowing-enough-about-moe/)
    
    I then focused on Feed-Forward Networks (FFNs), studying their structure, placement, gating networks, routing mechanisms, load balancing, and training stability challenges. As of Dec 30, 2024, Deepseek's technical report offers valuable insights into practical MoE implementation.
    
    [
    
    DeepSeek-V3 Technical Report
    
    We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
    
    ![title](https://www.google.com/s2/favicons?domain=arxiv.org)
    
    https://arxiv.org/abs/2412.19437
    
    ![title](https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png)
    
    ](https://arxiv.org/abs/2412.19437)
    
    Bookmark for [https://arxiv.org/abs/2412.19437](https://arxiv.org/abs/2412.19437)
    
-   I focused on embeddings, studying token and position embeddings while building upon basic LLM embedding concepts. I specifically explored type/segment embeddings, directionality, and various tokenization approaches and methods.
-   The key inference topics included:
    
    -   Attention mechanisms in training versus inference, including KV cache architecture and dimensionality
    -   Batch processing strategies and optimizations
    -   Decoding approaches, from simple greedy search to more sophisticated methods like beam search, along with sampling parameters (top-p, top-k, temperature) and their interplay
    -   Context window constraints and management
    -   Model quantization techniques
    -   Special tokens and their roles in chat-tuned models, plus early stopping criteria
    -   Long-range dependency handling through sparse attention mechanisms
    
    And some reference material:
    
    [
    
    Generation configurations: temperature, top-k, top-p, and test time compute
    
    ML models are probabilistic. Imagine that you want to know what’s the best cuisine in the world. If you ask someone this question twice, a minute apart, their answers both times should be the same. If you ask a model the same question twice, its answer can change. If the model thinks that Vietnamese cuisine has a 70% chance of being the best cuisine and Italian cuisine has a 30% chance, it’ll answer “Vietnamese” 70% of the time, and “Italian” 30%.
    
    ![title](https://www.google.com/s2/favicons?domain=huyenchip.com)
    
    https://huyenchip.com/2024/01/16/sampling.html
    
    ![title](https://huyenchip.com/assets/pics/sampling/4-logprobs.png)
    
    ](https://huyenchip.com/2024/01/16/sampling.html)
    
    Bookmark for [https://huyenchip.com/2024/01/16/sampling.html](https://huyenchip.com/2024/01/16/sampling.html)
    
-   The main fine-tuning topics included:
    
    -   Task-specific heads and adapter layers
    -   Parameter-efficient methods like LoRA (including mathematical foundations, comparisons between QLoRA and LoRA, and GaLore)
    -   Fine-tuning objectives and frameworks
    -   Hyperparameters and training approaches (supervised vs. unsupervised, instruction fine-tuning)
    -   Practical implementation steps (loading libraries, datasets, model configuration, tokenization, zero-shot inference, model evaluation)
    -   Direct Preference Optimization (DPO) in fine-tuning
    
    I made sure to practice writing boilerplate code for the libraries section, which also ties into the LLM coding portion.
    
-   Retrieval and RAG. Although Retrieval-Augmented Generation (RAG) could be viewed as its own topic, I've incorporated it as part of the broader LLM pipeline based on my hands-on experience and how it currently integrates with language models. Here are the resources I used:
    
    [
    
    An Evolution of Learning to Rank
    
    First Thing First Enigmas of the universe Cannot be known without a search — Epica, Omega (2021)
    
    ![title](https://www.google.com/s2/favicons?domain=www.yuan-meng.com)
    
    https://www.yuan-meng.com/posts/ltr/
    
    ![title](https://www.dropbox.com/scl/fi/fto9nalobh5ku9nb2yh19/Screenshot-2024-01-18-at-12.32.46-AM.png?rlkey=l7z94c0rfv940tv2xp8u2wdym&raw=1)
    
    ](https://www.yuan-meng.com/posts/ltr/)
    
    Bookmark for [https://www.yuan-meng.com/posts/ltr/](https://www.yuan-meng.com/posts/ltr/)
    
    [
    
    ![title](https://www.google.com/s2/favicons?domain=www.yuan-meng.com)
    
    https://www.yuan-meng.com/posts/negative\_sampling/
    
    
    ](https://www.yuan-meng.com/posts/negative_sampling/)
    
    Bookmark for [https://www.yuan-meng.com/posts/negative\_sampling/](https://www.yuan-meng.com/posts/negative_sampling/)
    
    [
    
    An Introduction to Embedding-Based Retrieval
    
    So, What is an Embedding? Embedding is a classic idea in mathematical topology and machine learning (click ▶ for definitions). You can think of embeddings as a special type of vectors.
    
    ![title](https://www.google.com/s2/favicons?domain=www.yuan-meng.com)
    
    https://www.yuan-meng.com/posts/ebr/
    
    ![title](https://www.dropbox.com/scl/fi/5xi8v3omgam3126dr0ahi/Screenshot-2024-04-21-at-2.58.25-PM.png?rlkey=zt24ehipliz2c1f2o8pol4zax&st=911v312w&raw=1)
    
    ](https://www.yuan-meng.com/posts/ebr/)
    
    Bookmark for [https://www.yuan-meng.com/posts/ebr/](https://www.yuan-meng.com/posts/ebr/)
    
    [
    
    An Overview on RAG Evaluation | Weaviate
    
    Learn about new trends in RAG evaluation and the current state of the art.
    
    ![title](https://www.google.com/s2/favicons?domain=weaviate.io)
    
    https://weaviate.io/blog/rag-evaluation
    
    ![title](https://weaviate.io/assets/images/hero-226b7c28e4ea09d667b845ee3c54c5d3.png)
    
    ](https://weaviate.io/blog/rag-evaluation)
    
    Bookmark for [https://weaviate.io/blog/rag-evaluation](https://weaviate.io/blog/rag-evaluation)
    
    The main RAG topics included:
    
    -   Knowledge base fundamentals (chunking, embeddings such as Word2Vec/Sentence-BERT/GPT models, contrastive loss, and vector storage)
    -   Query understanding
    -   Information retrieval components:
        -   Candidate generation (BM25, TF-IDF, Learning to Rank)
        -   Relevance scoring and dense retrieval
        -   Reranking and filtering
        -   Diversity sampling
        -   Context window selection
        -   Metadata extraction and result aggregation
    -   Context integration and response generation
    -   Evaluation metrics (NDCG, Mean Reciprocal Rank)
-   RLHF (Reinforcement Learning from Human Feedback) is a topic I wish I had studied more thoroughly. Although I covered the basics—comparing PPO and DPO approaches and using ChatGPT to explore the subject through follow-up questions—I reached my knowledge limit fairly quickly. I still feel there's _**much**_ more to learn. Here are some resources I used:
    
    [
    
    RLHF: Reinforcement Learning from Human Feedback
    
    \[LinkedIn discussion, Twitter thread\]
    
    ![title](https://www.google.com/s2/favicons?domain=huyenchip.com)
    
    https://huyenchip.com/2023/05/02/rlhf.html
    
    ![title](https://huyenchip.com/assets/pics/rlhf/1-chatgpt-training.png)
    
    ](https://huyenchip.com/2023/05/02/rlhf.html)
    
    Bookmark for [https://huyenchip.com/2023/05/02/rlhf.html](https://huyenchip.com/2023/05/02/rlhf.html)
    
    [
    
    The N Implementation Details of RLHF with PPO
    
    We’re on a journey to advance and democratize artificial intelligence through open source and open science.
    
    ![title](https://www.google.com/s2/favicons?domain=huggingface.co)
    
    https://huggingface.co/blog/the\_n\_implementation\_details\_of\_rlhf\_with\_ppo
    
    ![title](https://huggingface.co/blog/assets/167_the_n_implementation_details_of_rlhf_with_ppo/thumbnail.png)
    
    ](https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo)
    
    Bookmark for [https://huggingface.co/blog/the\_n\_implementation\_details\_of\_rlhf\_with\_ppo](https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo)
    
    [
    
    The 37 Implementation Details of Proximal Policy Optimization · The ICLR Blog Track
    
    ![title](https://www.google.com/s2/favicons?domain=iclr-blog-track.github.io)
    
    https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
    
    
    ](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/)
    
    Bookmark for [https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/)
    
    [
    
    Illustrating Reinforcement Learning from Human Feedback (RLHF)
    
    We’re on a journey to advance and democratize artificial intelligence through open source and open science.
    
    ![title](https://www.google.com/s2/favicons?domain=huggingface.co)
    
    https://huggingface.co/blog/rlhf
    
    ![title](https://huggingface.co/blog/assets/120_rlhf/thumbnail.png)
    
    ](https://huggingface.co/blog/rlhf)
    
    Bookmark for [https://huggingface.co/blog/rlhf](https://huggingface.co/blog/rlhf)
    
    [
    
    rl-for-llms.md
    
    GitHub Gist: instantly share code, notes, and snippets.
    
    ![title](https://www.google.com/s2/favicons?domain=gist.github.com)
    
    https://gist.github.com/yoavg/6bff0fecd65950898eba1bb321cfbd81
    
    ![title](https://github.githubassets.com/assets/gist-og-image-54fd7dc0713e.png)
    
    ](https://gist.github.com/yoavg/6bff0fecd65950898eba1bb321cfbd81)
    
    Bookmark for [https://gist.github.com/yoavg/6bff0fecd65950898eba1bb321cfbd81](https://gist.github.com/yoavg/6bff0fecd65950898eba1bb321cfbd81)
    
    > Note that GRPO became a popular mechanism _after_ my interviews, so I learned about it later, purely out of interest.
    
    [
    
    A vision researcher’s guide to some RL stuff: PPO & GRPO
    
    ![title](https://www.google.com/s2/favicons?domain=yugeten.github.io)
    
    https://yugeten.github.io/posts/2025/01/ppogrpo/
    
    ![title](https://yugeten.github.io/images/profile.png)
    
    ](https://yugeten.github.io/posts/2025/01/ppogrpo/)
    
    Bookmark for [https://yugeten.github.io/posts/2025/01/ppogrpo/](https://yugeten.github.io/posts/2025/01/ppogrpo/)
    
-   Since I have extensive experience with synthetic data generation and evaluation, I kept my preparation for these topics minimal. Although I have access to various resources on synthetic data generation and evaluation (including one I created), I didn't need to review them in depth.
    
    ![](https://mimansajaiswal.github.io/_astro/square-dashed_pink.D5DgnwzE_Z1EYdWV.svg)
    
    I would appreciate suggestions for additional resources to include in this section.
    
    [
    
    How To T̶r̶a̶i̶n̶ Synthesize Your D̶r̶a̶g̶o̶n̶ Data – Answer.AI
    
    The art and science of crafting synthetic data for AI training
    
    ![title](https://www.google.com/s2/favicons?domain=www.answer.ai)
    
    https://www.answer.ai/posts/2024-10-15-how-to-synthesize-data.html
    
    ![title](https://www.answer.ai/posts/2024-10-15-how-to-synthesize-data/synthetic_data.png)
    
    ](https://www.answer.ai/posts/2024-10-15-how-to-synthesize-data.html)
    
    Bookmark for [https://www.answer.ai/posts/2024-10-15-how-to-synthesize-data.html](https://www.answer.ai/posts/2024-10-15-how-to-synthesize-data.html)
    
-   Since test-time computation was a relatively new concept and I already had limited knowledge of reinforcement learning techniques, I chose not to focus heavily on this area. I realized I would only be able to offer surface-level responses to related questions. Nevertheless, I did review these key resources:
    
    [
    
    \- YouTube
    
    Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.
    
    ![title](https://www.google.com/s2/favicons?domain=t.co)
    
    https://t.co/9tJY26eUcP
    
    
    ](https://t.co/9tJY26eUcP)
    
    Bookmark for [https://t.co/9tJY26eUcP](https://t.co/9tJY26eUcP)
    
    [
    
    \- YouTube
    
    Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.
    
    ![title](https://www.google.com/s2/favicons?domain=t.co)
    
    https://t.co/lnvDrweyfK
    
    
    ](https://t.co/lnvDrweyfK)
    
    Bookmark for [https://t.co/lnvDrweyfK](https://t.co/lnvDrweyfK)
    
    [
    
    Reverse engineering OpenAI’s o1
    
    What productionizing test-time compute shows us about the future of AI. Exploration has landed in language model training.
    
    ![title](https://www.google.com/s2/favicons?domain=www.interconnects.ai)
    
    https://www.interconnects.ai/p/reverse-engineering-openai-o1
    
    ![title](https://substackcdn.com/image/fetch/$s_!mZ2Q!,w_1200,h_675,c_fill,f_jpg,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd28d8c1d-bb4e-4fa2-8b9a-8c353c8c7450_1742x837.png)
    
    ](https://www.interconnects.ai/p/reverse-engineering-openai-o1)
    
    Bookmark for [https://www.interconnects.ai/p/reverse-engineering-openai-o1](https://www.interconnects.ai/p/reverse-engineering-openai-o1)
    
    [
    
    OpenAI’s o1 using “search” was a PSYOP
    
    How to understand OpenAI’s o1 models as really just one wacky, wonderful, long chain of thought
    
    ![title](https://www.google.com/s2/favicons?domain=www.interconnects.ai)
    
    https://www.interconnects.ai/p/openais-o1-using-search-was-a-psyop
    
    ![title](https://substackcdn.com/image/fetch/$s_!ko32!,w_1200,h_675,c_fill,f_jpg,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44840c49-9a71-439b-9a6c-b08abe3b60fa_1980x1113.png)
    
    ](https://www.interconnects.ai/p/openais-o1-using-search-was-a-psyop)
    
    Bookmark for [https://www.interconnects.ai/p/openais-o1-using-search-was-a-psyop](https://www.interconnects.ai/p/openais-o1-using-search-was-a-psyop)
    
-   Since my knowledge of model optimization is limited, I welcome suggestions for additional reading material. Here are some resources that may be helpful:
    
    [
    
    Flash LLM
    
    Share your videos with friends, family, and the world
    
    ![title](https://www.google.com/s2/favicons?domain=www.youtube.com)
    
    https://www.youtube.com/playlist?list=PLO45-80-XKkT6BUKCYeBMTEqnlcpYavxq
    
    ![title](https://i.ytimg.com/vi/76gulNlhiE4/hqdefault.jpg?sqp=-oaymwEXCOADEI4CSFryq4qpAwkIARUAAIhCGAE=&rs=AOn4CLBoHbTgtJdvTAa2_koCwS_j59zZLg&days_since_epoch=20544)
    
    ](https://www.youtube.com/playlist?list=PLO45-80-XKkT6BUKCYeBMTEqnlcpYavxq)
    
    Bookmark for [https://www.youtube.com/playlist?list=PLO45-80-XKkT6BUKCYeBMTEqnlcpYavxq](https://www.youtube.com/playlist?list=PLO45-80-XKkT6BUKCYeBMTEqnlcpYavxq)
    
    [
    
    \- YouTube
    
    Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.
    
    ![title](https://www.google.com/s2/favicons?domain=www.youtube.com)
    
    https://www.youtube.com/watch?v=139UPjoq7Kw
    
    
    ](https://www.youtube.com/watch?v=139UPjoq7Kw)
    
    Bookmark for [https://www.youtube.com/watch?v=139UPjoq7Kw](https://www.youtube.com/watch?v=139UPjoq7Kw)
    
    [
    
    llm\_distillation.pdf
    
    ![title](https://www.google.com/s2/favicons?domain=drive.google.com)
    
    https://drive.google.com/file/d/1xMohjQcTmQuUd\_OiZ3hB1r47WB1WM3Am/view?pli=1
    
    
    ](https://drive.google.com/file/d/1xMohjQcTmQuUd_OiZ3hB1r47WB1WM3Am/view?pli=1)
    
    Bookmark for [https://drive.google.com/file/d/1xMohjQcTmQuUd\_OiZ3hB1r47WB1WM3Am/view?pli=1](https://drive.google.com/file/d/1xMohjQcTmQuUd_OiZ3hB1r47WB1WM3Am/view?pli=1)
    

## ML (System) Design

While I didn't extensively prepare for these interviews, I was fortunate that the companies I targeted mainly focused on research experiment design rather than general system design questions. I encountered a few questions about recommendation systems and sentiment analysis design, which I handled as well as I could. Working with primarily research-based design questions was a relief.

When I first started interviewing, I made the mistake of hesitating to ask for clarification. After finding two helpful blog posts and a YouTube video about asking better questions, I created a ChatGPT prompt to practice. I would pose questions like "How would you set up a research question design for LLaMA Guard fine-tuning?" and use ChatGPT's suggested clarifying questions for practice. This helped me realize that it's completely fine to spend up to 10 minutes asking clarifying questions.

-   Just the free section here:
    
    [
    
    Setting Up Machine Learning Systems for Scalable Design
    
    Learn essential steps to design scalable machine learning systems, focusing on problem setup, metrics, architecture, and model evaluation in interviews.
    
    ![title](https://www.google.com/s2/favicons?domain=www.educative.io)
    
    https://www.educative.io/courses/grokking-the-machine-learning-interview/setting-up-a-machine-learning-system
    
    ![title](https://educative.io/api/collection/10370001/6237869033127936/image/4959370598154240.png)
    
    ](https://www.educative.io/courses/grokking-the-machine-learning-interview/setting-up-a-machine-learning-system)
    
    Bookmark for [https://www.educative.io/courses/grokking-the-machine-learning-interview/setting-up-a-machine-learning-system](https://www.educative.io/courses/grokking-the-machine-learning-interview/setting-up-a-machine-learning-system)
    
-   The tips here
    
    [
    
    ![title](https://www.google.com/s2/favicons?domain=igotanoffer.com)
    
    https://igotanoffer.com/blogs/tech/system-design-interview-questions
    
    
    ](https://igotanoffer.com/blogs/tech/system-design-interview-questions)
    
    Bookmark for [Just a moment...](https://igotanoffer.com/blogs/tech/system-design-interview-questions)
    
-   And these two youtube videos from Exponent
    
    [
    
    \- YouTube
    
    Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.
    
    ![title](https://www.google.com/s2/favicons?domain=www.youtube.com)
    
    https://www.youtube.com/watch?v=ZjNoipQAqRM&list=PLrtCHHeadkHqYX7O5cjHeWHzH2jzQqWg5&index=7
    
    
    ](https://www.youtube.com/watch?v=ZjNoipQAqRM&list=PLrtCHHeadkHqYX7O5cjHeWHzH2jzQqWg5&index=7)
    
    Bookmark for [https://www.youtube.com/watch?v=ZjNoipQAqRM&list=PLrtCHHeadkHqYX7O5cjHeWHzH2jzQqWg5&index=7](https://www.youtube.com/watch?v=ZjNoipQAqRM&list=PLrtCHHeadkHqYX7O5cjHeWHzH2jzQqWg5&index=7)
    
    [
    
    \- YouTube
    
    Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.
    
    ![title](https://www.google.com/s2/favicons?domain=www.youtube.com)
    
    https://www.youtube.com/watch?v=6erP70R\_NGs&list=PLrtCHHeadkHp92TyPt1Fj452\_VGLipJnL&index=38
    
    
    ](https://www.youtube.com/watch?v=6erP70R_NGs&list=PLrtCHHeadkHp92TyPt1Fj452_VGLipJnL&index=38)
    
    Bookmark for [https://www.youtube.com/watch?v=6erP70R\_NGs&list=PLrtCHHeadkHp92TyPt1Fj452\_VGLipJnL&index=38](https://www.youtube.com/watch?v=6erP70R_NGs&list=PLrtCHHeadkHp92TyPt1Fj452_VGLipJnL&index=38)
    

Another concern I had, which I've mentioned before, is that many people now default to using large language models for everything—even for tasks that simpler models like BERT could handle effectively, or when BM25 would work better than text embeddings with cosine similarity for information retrieval problems.

The key question is: When should we use these powerful generative models versus smaller discriminative models? After discussing this with my friends ([Yash](https://yashbhalgat.github.io/) and [June](https://www.linkedin.com/in/june-yuan-shangguan-8b281228/)), they helped me craft the perfect response, which I've since memorized: "There are two ways I can approach this problem. One uses a discriminative method when optimization and latency are priorities, though it's more restrictive in its outputs. The other uses a generative method throughout, which may introduce latency but offers more flexibility. I can briefly discuss the pros and cons of both approaches, and then you can guide which path you'd prefer me to explore."

I really appreciate them suggesting this approach. I now consistently use this method when tackling any design questions.

## Behavioral

You've probably heard of the STAR method by now. Stick to it—I did in my process too. I used natural transition phrases to avoid sounding rehearsed while keeping my responses structured. For example, I'd say "When this happened" or "For some context" as a personal reminder to describe the situation. Then I'd say "I had to do" to signal the task portion—these were cues for myself, not for the interviewer. I kept an eye on the clock to ensure I wasn't spending more than 30 seconds on any one part. Then I'd transition with phrases like "Here's what I did" or "I chose to do this," followed by "This was the result." I made sure the core response fit within two minutes. After that, I'd elaborate with additional details, often without prompting, but those first two minutes were always laser-focused.

I used this question bank set for behavioral questions and had stories for all of them.

[

GitHub - ashishps1/awesome-behavioral-interviews: Tips and resources to prepare for Behavioral interviews.

Tips and resources to prepare for Behavioral interviews. - ashishps1/awesome-behavioral-interviews

![title](https://www.google.com/s2/favicons?domain=github.com)

https://github.com/ashishps1/awesome-behavioral-interviews

![title](https://opengraph.githubassets.com/c1c7cbaffbabb85ca7c8a6052951fa0e044a306ef7ed3a4700fe15f8d71ca0e1/ashishps1/awesome-behavioral-interviews)

](https://github.com/ashishps1/awesome-behavioral-interviews)

Bookmark for [https://github.com/ashishps1/awesome-behavioral-interviews](https://github.com/ashishps1/awesome-behavioral-interviews)

Companies vary in their interview formats. While some may ask just a few behavioral questions per round, others—like Meta, with its extensive multi-round sessions containing up to 10 questions—can catch candidates unprepared.

When interviewing with multiple companies, you'll need distinct examples for each question within a single interview loop. Repeatedly telling the same stories during your job search can make your responses sound mechanical. Here's a useful tip from my advisor: take a sip of water after each paragraph to stay present and keep your responses natural.

This principle extends to your introduction. Using the same introduction repeatedly can make you appear stiff and disconnected. Though interviewers expect candidates to prepare their responses, a natural delivery creates better engagement with your interviewer.

## I Know of These Papers

I deliberately chose the word "know" for an important reason—it captures multiple levels of familiarity: deeply reading papers, skimming them, finding them through social media, learning about them from others' presentations or discussions, and implementing them firsthand. This knowledge extends beyond academic papers to include Twitter threads, implementation guides, verified code examples, blog posts, and other sources. This broad scope is why I kept the heading intentionally general.

When interviewers ask open-ended questions, I make a point to cite my sources, saying things like "I learned this from a blog post" or "I've seen this discussed widely on Twitter." I maintain a broad collection of references and always stay transparent about my depth of understanding for each paper. As someone who's constantly online, I use Zotero and Raindrop to organize papers, related discussions, and emerging research. I skim every paper before adding it to my Zotero library, categorizing them by potential interview questions—which is exactly why I created this page.

I tracked open-ended questions in this section and researched relevant papers I may have overlooked, regularly expanding the list. I also maintained a collection of papers worth mentioning for specific topics, which proved invaluable. I'll share some resources to showcase recent papers that caught my attention. I should admit that my paper knowledge comes primarily from my somewhat excessive social media use rather than formal newsletters or collections. These resources might help you build your own reference library for handling open-ended questions.

Here are some resources I use to stay on top of the vast number of research papers in my field:

-   People who share other people’s work and themselves work on topics that I am interested in
    
    [
    
    ![title](https://www.google.com/s2/favicons?domain=x.com)
    
    https://x.com/LChoshen
    
    
    ](https://x.com/LChoshen)
    
    Bookmark for [https://x.com/LChoshen](https://x.com/LChoshen)
    
    [
    
    ![title](https://www.google.com/s2/favicons?domain=x.com)
    
    https://x.com/yoavgo
    
    
    ](https://x.com/yoavgo)
    
    Bookmark for [https://x.com/yoavgo](https://x.com/yoavgo)
    
-   People who share other people’s recent work (I prefer their curation)
    
    [
    
    ![title](https://www.google.com/s2/favicons?domain=x.com)
    
    https://x.com/fly51fly
    
    
    ](https://x.com/fly51fly)
    
    Bookmark for [https://x.com/fly51fly](https://x.com/fly51fly)
    
    [
    
    ![title](https://www.google.com/s2/favicons?domain=x.com)
    
    https://x.com/gm8xx8
    
    
    ](https://x.com/gm8xx8)
    
    Bookmark for [https://x.com/gm8xx8](https://x.com/gm8xx8)
    
    [
    
    ![title](https://www.google.com/s2/favicons?domain=x.com)
    
    https://x.com/TheXeophon
    
    
    ](https://x.com/TheXeophon)
    
    Bookmark for [https://x.com/TheXeophon](https://x.com/TheXeophon)
    
    [
    
    Bluesky
    
    ![title](https://www.google.com/s2/favicons?domain=bsky.app)
    
    https://bsky.app/profile/reachsumit.bsky.social
    
    
    ](https://bsky.app/profile/reachsumit.bsky.social)
    
    Bookmark for [https://bsky.app/profile/reachsumit.bsky.social](https://bsky.app/profile/reachsumit.bsky.social)
    
    [
    
    ![title](https://www.google.com/s2/favicons?domain=x.com)
    
    https://x.com/omarsar0
    
    
    ](https://x.com/omarsar0)
    
    Bookmark for [https://x.com/omarsar0](https://x.com/omarsar0)
    
    I speed skim this list:
    
    [
    
    arxiv cs.CL (@arxiv-cs-cl.bsky.social)
    
    Computer Science -- Computation and Language source: export.arxiv.org/rss/cs.CL maintainer: @tmaehara.bsky.social
    
    ![title](https://www.google.com/s2/favicons?domain=bsky.app)
    
    https://bsky.app/profile/arxiv-cs-cl.bsky.social
    
    ![title](https://cdn.bsky.app/img/banner/plain/did:plc:jf3oraummcsfodflx5w5pouf/bafkreihydpwv23b4xgwuff6iu4viqpetwj7gpquoldbhnb2gs6ah5iwmsa)
    
    ](https://bsky.app/profile/arxiv-cs-cl.bsky.social)
    
    Bookmark for [https://bsky.app/profile/arxiv-cs-cl.bsky.social](https://bsky.app/profile/arxiv-cs-cl.bsky.social)
    
    [
    
    ![title](https://www.google.com/s2/favicons?domain=x.com)
    
    https://x.com/arXivGPT
    
    
    ](https://x.com/arXivGPT)
    
    Bookmark for [https://x.com/arXivGPT](https://x.com/arXivGPT)
    
    [
    
    ![title](https://www.google.com/s2/favicons?domain=x.com)
    
    https://x.com/TheAITimeline
    
    
    ](https://x.com/TheAITimeline)
    
    Bookmark for [https://x.com/TheAITimeline](https://x.com/TheAITimeline)
    
-   People who summarize recent work or talk about ML fundamentals
    
    [
    
    Ahead of AI | Sebastian Raschka, PhD | Substack
    
    Ahead of AI focuses on machine learning and AI research and is read by more than 150,000 researchers and practitioners who want to stay ahead in a rapidly evolving field. Click to read Ahead of AI, by Sebastian Raschka, PhD, a Substack publication with hundreds of thousands of subscribers.
    
    ![title](https://www.google.com/s2/favicons?domain=magazine.sebastianraschka.com)
    
    https://magazine.sebastianraschka.com/
    
    ![title](https://substackcdn.com/image/fetch/$s_!5hfH!,f_auto,q_auto:best,fl_progressive:steep/https%3A%2F%2Fsebastianraschka.substack.com%2Ftwitter%2Fsubscribe-card.jpg%3Fv%3D-869357572%26version%3D9)
    
    ](https://magazine.sebastianraschka.com/)
    
    Bookmark for [https://magazine.sebastianraschka.com/](https://magazine.sebastianraschka.com/)
    
    [
    
    Artificial Fintelligence | Finbarr Timbers | Substack
    
    I write detailed articles about the frontiers of AI research. Read by over 5000 researchers at OpenAI, DeepMind, Midjourney, Google, Stanford, Berkeley, etc. Click to read Artificial Fintelligence, by Finbarr Timbers, a Substack publication with tens of thousands of subscribers.
    
    ![title](https://www.google.com/s2/favicons?domain=www.artfintel.com)
    
    https://www.artfintel.com/
    
    ![title](https://substackcdn.com/image/fetch/$s_!wRSz!,f_auto,q_auto:best,fl_progressive:steep/https%3A%2F%2Ffinbarrtimbers.substack.com%2Ftwitter%2Fsubscribe-card.jpg%3Fv%3D201642056%26version%3D9)
    
    ](https://www.artfintel.com/)
    
    Bookmark for [https://www.artfintel.com/](https://www.artfintel.com/)
    
    [
    
    Lil’Log
    
    Document my learning notes.
    
    ![title](https://www.google.com/s2/favicons?domain=lilianweng.github.io)
    
    https://lilianweng.github.io/
    
    
    ](https://lilianweng.github.io/)
    
    Bookmark for [https://lilianweng.github.io/](https://lilianweng.github.io/)
    
-   People whose work I follow
    
    [
    
    ![title](https://www.google.com/s2/favicons?domain=x.com)
    
    https://x.com/besanushi
    
    
    ](https://x.com/besanushi)
    
    Bookmark for [https://x.com/besanushi](https://x.com/besanushi)
    
    [
    
    ![title](https://www.google.com/s2/favicons?domain=x.com)
    
    https://x.com/jaseweston?lang=en
    
    
    ](https://x.com/jaseweston?lang=en)
    
    Bookmark for [https://x.com/jaseweston?lang=en](https://x.com/jaseweston?lang=en)
    
    [
    
    ![title](https://www.google.com/s2/favicons?domain=x.com)
    
    https://x.com/BlancheMinerva
    
    
    ](https://x.com/BlancheMinerva)
    
    Bookmark for [https://x.com/BlancheMinerva](https://x.com/BlancheMinerva)
    

## Finally

Interview experiences can vary greatly depending on the luck of the draw. A skilled interviewer makes the process flow naturally, and while strategically steering conversations toward your strengths is valuable, some interviewers may be less flexible in their approach. Since resource quality and relevance can differ widely, it's crucial to prioritize materials that directly support your specific career aspirations.

I'm eager to learn from others who have successfully juggled machine learning studies with their other life responsibilities. If you know of any useful interview preparation resources, please share them—I'll gladly add them here.

Good luck with your preparation!

* * *

## Cite This Page

```
@article{jaiswal2024llmmljobintervi,
  title   = {LLM (ML) Job Interviews - Resources},
  author  = {Jaiswal, Mimansa},
  journal = {mimansajaiswal.github.io},
  year    = {2024},
  month   = {Dec},
  url     = {https://mimansajaiswal.github.io/posts/llm-ml-job-interviews-resources/}
}
```