LLM (ML) Job Interviews - Resources • Mimansa Jaiswal

Navigation

This post has two parts:

Job Search Mechanics (including context, applying, and industry information), which you can read at LLM (ML) Job Interviews (Fall 2024) - Process, and,
Preparation Material and Overview of Questions, which you can continue reading below.

Disclaimer

Last Updated: Feb 24, 2025

This is my personal process, which may work differently for others depending on their circumstances. I'm writing this in December 2024, based on my experience during Fall 2024. While the field of LLMs evolves rapidly and specific details may become outdated, the core principles should remain useful.

I want to be clear: this isn't a definitive guide or preparation manual, nor is it a guaranteed path to success. It's simply a record of what I did, without any claims about its effectiveness. If you find it helpful, wonderful—if not, I'd love to hear what worked for you instead. I'm sharing my experience, not prescribing a universal approach.

I typically avoid rewriting my personal posts using LLMs, but this content has been heavily edited, not generated (primarily using Notion AI, Claude 3.5 Sonnet and GPT-4o) to keep it concise and professional while maintaining my authentic voice—more or less.

Database View and Contributions

Some readers just want a simple list without the detailed explanations and context—and that's perfectly fine.

I've included two toggles below. The first shows these same resources in a Notion database view that you can freely explore—expand it, open it in a new page, or browse as needed.

The second toggle contains a contribution form for a separate database. I personally review all submissions to filter out spam, and I'll move valuable contributions to the main database. Contributing is entirely optional—while I don't expect anyone to do so, I welcome any worthwhile additions.

The collection currently lacks resources in certain areas, particularly multimodal and speech-based content. Though I can't write extensively about these topics myself, I'm happy to serve as a curator. If you have relevant resources to share, please use the form—I'll review submissions and incorporate valuable ones into the main database.

To view this content as a searchable database, expand this toggle. You can explore the embedded database by expanding it or opening it in a new page. (Last Updated: Feb 6, 2025)

To contribute to this collection, expand this toggle to find the submission form. I review all entries to filter out spam and add valuable contributions to the main database.

Content Storage

I use Notion pages for content storage. Though Capacities would have been ideal, its lack of a reliable output API meant I'd need to copy content into Notion anyway for this post. Additionally, Capacities doesn't support multi-user collaboration—a feature I occasionally need—and I prefer keeping my content in one place rather than scattered across multiple apps. I've ruled out Obsidian since I prefer rich-text block-based systems. Capacities does have one standout feature: the ability to mark individual blocks as database objects with properties while keeping the block content embedded in place. This feature is particularly valuable for learning, and I highly recommend it. This isn't meant to be a comprehensive overview of my content/knowledge management practices—just a brief explanation of my organizational preferences for interview preparation.

Classification

I maintained 7 major sections in my interview preparation page: Statistical Knowledge, ML Knowledge, ML Design, DSA Coding, ML Coding, Behavioral, and “I know of these papers”. Each section (other than the papers had a Question subpage with 4 categories: "Aced it," "Took time," "Didn't get it," and "Just saw it somewhere."

As mentioned in my LLM (ML) Job Interviews (Fall 2024) - Process post, I meticulously documented and categorized every interview question in the question bank, along with my performance evaluation. For those familiar with JEE preparation, this approach might sound familiar—it's my tried and tested method (though I failed at JEE), and it's the only system I know (fingers crossed for better results this time). Whenever questions repeated across interviews, I added comments in Notion noting their previous appearances and updated their categories based on my latest performance.

Leetcode

Let me walk you through my LeetCode interview preparation approach. Companies consistently test candidates with LeetCode questions during interviews—yes, even startups. Here's the preparation strategy that worked for me.

Most companies don't allow code execution during interviews. While some Microsoft and Apple teams did permit running code (which was helpful), I learned to spot and prevent common mistakes without needing to test the code.

I tackled all questions from both recommended lists systematically. My first pass focused on finding solutions—whether brute force or optimized. After at least two days, I'd return to each question to implement an optimized solution. When successful, I'd move on. When stuck, I'd consult ChatGPT for the algorithm, implement it myself, and submit the solution. To maintain accurate progress tracking, I'd then intentionally submit an incorrect version to keep the question unmarked. After another two-day break following failed attempts, I'd try implementing the optimized version again.

I continued this process until I mastered all questions on the Grind 75 list, completed all easy and medium questions, and solved 50% of hard questions on the NeetCode 150 list.

https://leetcode.com/problem-list/rab78cw1/

https://leetcode.com/problem-list/plakya4j/

ML/DL/LLM Coding

While these concepts can be coded given sufficient time and debugging capabilities, interview settings present unique challenges. You'll typically have just 25-35 minutes, need flawless execution, and must maintain precise matrix dimensions throughout. That's why practicing implementation is crucial, even if you're already comfortable with the concepts. To help you prepare, I've compiled repositories showcasing interview-style code examples, categorized into basic ML and LLMs. Following my Contextual interview, I expanded this collection by implementing various RAG/LLM inference papers—a category I found particularly engaging.

For basic ML questions:

Machine-Learning-Interviews/src/MLC/notebooks at main · alirezadir/Machine-Learning-Interviews

This repo is meant to serve as a guide for Machine Learning/AI technical interviews. - alirezadir/Machine-Learning-Interviews

https://github.com/alirezadir/Machine-Learning-Interviews/tree/main/src/MLC/notebooks

Bookmark for https://github.com/alirezadir/Machine-Learning-Interviews/tree/main/src/MLC/notebooks
- One major omission in this resource is the implementation of basic neural networks using NumPy and PyTorch. To address this gap, I practiced coding feedforward neural networks (FNN), LSTMs, and RNNs from scratch using random values, then validated my implementations against online examples.
For LLMs, I worked with the "Hands-on Large Language Models" repository. Rather than following the book directly, I turned the notebooks into fill-in-the-blank exercises and completed them systematically.

GitHub - HandsOnLLM/Hands-On-Large-Language-Models: Official code repo for the O’Reilly Book - “Hands-On Large Language Models”

Official code repo for the O’Reilly Book - “Hands-On Large Language Models” - HandsOnLLM/Hands-On-Large-Language-Models

https://github.com/HandsOnLLM/Hands-On-Large-Language-Models/

Bookmark for https://github.com/HandsOnLLM/Hands-On-Large-Language-Models/
- Be prepared for questions about implementing different attention mechanisms — including cached, grouped query, multi-head, and single-head attention. You'll need to know how to code these using just NumPy and PyTorch. I recently found this post, which provides a thorough overview.
  
  Transformers Laid Out
  
  I have encountered that there are mainly three types of blogs/videos/tutorials talking about transformers
  
  https://goyalpramod.github.io/blogs/Transformers_laid_out/
  
  Bookmark for https://goyalpramod.github.io/blogs/Transformers_laid_out/
- Master implementing key components of the transformer architecture, including token embeddings, layer normalization, encoder layers, and techniques for integrating Mixture of Experts (MoE) into existing model architectures.
  - I studied several codebases thoroughly, particularly the LLaMA model implementation and OLMo repository, to deepen my understanding of these concepts.
    
    transformers/src/transformers/models/llama at main · huggingface/transformers
    
    🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. - huggingface/transformers
    
    https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama
    
    Bookmark for https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama
    
    GitHub - allenai/OLMo: Modeling, training, eval, and inference code for OLMo
    
    Modeling, training, eval, and inference code for OLMo - allenai/OLMo
    
    https://github.com/allenai/OLMo
    
    Bookmark for https://github.com/allenai/OLMo
- Make sure to include dimensions in your variable names or comments. This helps with debugging, and interviewers frequently check dimensions to test your understanding—not just your ability to memorize. While I started by using comments, I later found Noam Shazeer's Shape Suffixes approach ( Shape Suffixes — Good Coding Style), which proved to be an excellent—perhaps even superior—method.
For RAG/LLM inference papers, I practiced implementing core components: a basic embedding model with clustering capabilities, along with common decoding strategies like top-p, top-k, temperature control, and beam vs. greedy search. I also selected 10 papers from my favorite researchers and worked through their implementations independently. Since running the full models wasn't practical due to their size, I used LLM chatbots to validate my approaches, drawing on my ML expertise to critically evaluate their responses and identify potential issues.
Finally, for additional practice, I worked through every question here using the same approach I used with LeetCode, though I rarely needed multiple attempts this time.

Deep-ML

Practice Machine Learning and Data Science Problems

https://www.deep-ml.com/

Bookmark for https://www.deep-ml.com/

Statistics

I originally hadn't planned to include statistics in my preparation, but after being caught off guard by an unexpected Amazon interview loopShow information for the linked content early in my job search, I quickly realized I needed to. I then set out to find and practice statistics questions, using the resources listed below.

I'll be candid: there were many times when I struggled to remember concepts or needed to better understand the reasoning behind material I had simply memorized in college. I turned to ChatGPT and Claude to help clarify these topics. Though I always fact-checked their responses and ensured I had proper context, this approach actually improved my ability to explain concepts clearly and systematically (even if I still tended to ramble too muchShow information for the linked content).

GitHub - wzchen/probability_cheatsheet: A comprehensive 10-page probability cheatsheet that covers a semester’s worth of introduction to probability.

A comprehensive 10-page probability cheatsheet that covers a semester’s worth of introduction to probability. - wzchen/probability_cheatsheet

https://github.com/wzchen/probability_cheatsheet

Some linear algebra concepts from here (I know it's not statistics, but it's still relevant)

https://ocw.mit.edu/courses/18-06-linear-algebra-spring-2010/4d876a9159e32543eb0d73b4d4382f4c_MIT18_06S10ZoomNotes.pdf

For data science question preparation, I used this resource:

GitHub - youssefHosni/Data-Science-Interview-Questions-Answers: Curated list of data science interview questions and answers

Curated list of data science interview questions and answers - youssefHosni/Data-Science-Interview-Questions-Answers

https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/tree/main

I found Chapter 5 of Chip Huyen's book particularly useful:

Chapter 5. Math · MLIB

https://huyenchip.com/ml-interviews-book/contents/chapter-5.-math.html

ML Knowledge

I organized the Machine Learning knowledge section into two parts: (a) ML fundamentals and (b) Large Language Models (LLMs) and Transformers. Looking back, I should have created three parts by making NLP fundamentals its own section. Depending on your specialization, this third section could instead cover Computer Vision with Vision Language Models, or other cutting-edge architectures.

ML Fundamentals

I began my preparation with this ML cheatsheet, which covers the fundamentals:

The Little Book of Deep Learning

https://fleuret.org/francois/lbdl.html

AI Flashcards - Visual Machine Learning Concepts

Master machine learning and AI concepts with 119 beautifully designed flashcards. Perfect for students, engineers, and researchers.

https://aiflashcards.com/

Machine Learning Glossary — ML Glossary documentation

https://ml-cheatsheet.readthedocs.io/en/latest/

GitHub - soulmachine/machine-learning-cheat-sheet: Classical equations and diagrams in machine learning

Classical equations and diagrams in machine learning - soulmachine/machine-learning-cheat-sheet

https://github.com/soulmachine/machine-learning-cheat-sheet

Next, I worked through chapters 7 and 8 of the ML interviews book:

Chapter 7. Machine learning workflows · MLIB

https://huyenchip.com/ml-interviews-book/contents/chapter-7.-machine-learning-workflows.html

I primarily used these two resources in my preparation. For each question I encountered, I documented it and explored potential follow-up questions. My study method was straightforward: I'd first attempt to answer the question independently. If successful, I'd use ChatGPT to generate follow-up questions and practice those too. When I couldn't answer a question, I'd start a fresh chat to research the topic, document my findings with notes and comments in Notion, and categorize it under the relevant toggle section.

Overall, you list should include:

General breadth topics:
- Machine Learning Fundamentals: Supervised vs. unsupervised learning, logistic regression for classification, bag-of-words, bagging vs. boosting, decision trees/random forests, confusion matrix, regularization methods, clustering (k-means/k-medians), distance measures, convexity, loss functions.
- Statistics and Probability: Bias-variance tradeoff, imputation techniques, handling different dataset sizes, causal inference, statistical significance, regression analysis, Bayesian analysis, time series analysis, active learning, hypothesis testing, probability calculations.
- Models and Performance: Overfitting vs. underfitting, model performance monitoring, online/offline evaluation metrics.
- Dimensionality Reduction: Feature selection methods, matrix factorization, autoencoders, curse of dimensionality, PCA vs. t-SNE vs. MDS vs. autoencoders.
- Time Series Analysis: Stationary time series, frequency domain analysis, forecasting models.
- Neural Networks: Attention mechanisms, ReLU activation, deep neural networks (DNNs), recurrent neural networks (RNNs) including LSTMs, convolutional neural networks (CNNs), transformer models.
- Algorithms:
  - Clustering: K-means, LDA, word2vec, hierarchical clustering, Mean-Shift, DBSCAN, expectation-maximization using GMM.
  - Classification: Logistic regression, Naïve Bayes, k-nearest neighbors (KNN), decision trees, support vector machines (SVMs).
  - Optimization: Unconstrained continuous optimization, gradient descent methods (batch vs. stochastic), maximizing continuous functions.
- Other Topics: Probability distributions and limit theorems, discriminative vs. generative models, methodology (online vs. offline learning, cross-validation), online experimentation (A/B testing), binary and multi-class classification, hyperparameters, feature engineering.
Area-specific topics:
- Personalization/Recommendation Systems: Collaborative filtering, transformer architecture, active learning, self-attention, multiple attention heads, pooling methods, visualizing attention layers, ranking, multi-armed bandits (MABs), evaluation metrics, cold-start recommenders.
- Deep Learning: Transformers, dropout, normalization, maxout, softmax, word embeddings.
- Natural Language Processing: Latent Dirichlet Allocation (LDA), language modeling, handling sparse data, text normalization techniques, statistical language modeling, information retrieval, word2vec, CNNs, transformers.
- Computer Vision: Image and signal processing, linear algebra, image retrieval, CNNs, residual networks etc.

LLMs

I organized the material into 5 key sections:

Architecture and Training: Covered transformer architecture fundamentals, pre-training versus fine-tuning approaches, training objectives (with emphasis on next-token prediction), tokenization methods, scaling laws, and parameter-efficient techniques like LoRA and Prefix Tuning.
Generation Control: Explored methods for controlling model outputs through temperature and top-p sampling, effective prompt engineering, few-shot and in-context learning techniques, chain-of-thought prompting for complex reasoning, and strategies to minimize hallucinations.
LLM Evaluation: Examined key metrics (including perplexity, ROUGE, and BLEU), standard benchmarks like MMLU and BigBench, approaches to human evaluation, and tools for bias detection and mitigation.
Optimization and Deployment: Studied practical aspects like quantization, model distillation, inference optimization, prompt caching, load balancing, and cost-effective deployment strategies.
Safety and Ethics: Focused on essential safeguards including content filtering, output sanitization, protection against jailbreaking attempts, and data privacy protocols for responsible LLM deployment.

One of the best analogies I learned came from Kevin Murphy's PML book, as quoted in Yuan's blog post ( Attention as Soft Dictionary Lookup).

We can think of attention as a soft dictionary look up, in which we compare the query q to each key k_i , and then retrieve the corresponding value v_i ." - Chapter 15, p. 513 — We can think of attention as a soft dictionary look up, in which we compare the query $q$ to each key $k_i$ , and then retrieve the corresponding value $v_i$ ." - Chapter 15, p. 513

While this architecture is widely known, I found it valuable as a framework for organizing my review of architecture-related papers and questions.

Differences between Transformer and LLaMA architectures, as found in the Medium post " Unlocking Low-Resource Language Understanding ” — Differences between Transformer and LLaMA architectures, as found in the Medium post "Unlocking Low-Resource Language Understanding”

Here are the key resources and materials I studied:

General content

LLM Notes

Backprop Understanding in Simple Words Most deep learning libraries provide a fast and flexible framework for building deep learning projects by easing the computation of the partial derivatives (also called gradients) that drive learning through backpropagation. When training a model, we use a…

https://docs.google.com/document/d/1K7ahLiopilE0TxpkRcrrzgUPTcjZq0aY_5J797_0o98/edit?tab=t.0

Bookmark for https://docs.google.com/document/d/1K7ahLiopilE0TxpkRcrrzgUPTcjZq0aY_5J797_0o98/edit?tab=t.0

Understanding LLMs Part 1 - Transformers

https://miro.com/app/board/uXjVLBwJaV4=/

Bookmark for https://miro.com/app/board/uXjVLBwJaV4=/

Aman’s AI Journal • Primers • Overview of Large Language Models

Aman’s AI Journal | Course notes and learning material for Artificial Intelligence and Deep Learning Stanford classes.

https://aman.ai/primers/ai/LLM/

Bookmark for https://aman.ai/primers/ai/LLM/

Transformer Math 101

We present basic math related to computation and memory usage for transformers

https://blog.eleuther.ai/transformer-math/

Bookmark for https://blog.eleuther.ai/transformer-math/

- YouTube

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

https://www.youtube.com/watch?v=9vM4p9NN0Ts&feature=youtu.be

Bookmark for https://www.youtube.com/watch?v=9vM4p9NN0Ts&feature=youtu.be

[Public, Approved] Intro to Transformers

Transformers Lucas Beyer lucasb.eyer.be@gmail.com @giffmana http://lb.eyer.be/transformer

http://lucasb.eyer.be/transformer

Bookmark for http://lucasb.eyer.be/transformer

LM-class

LM-class is an education resource for contemporary language modeling, broadly construed.

https://lm-class.org/lectures

Bookmark for https://lm-class.org/lectures

https://arxiv.org/pdf/2501.09223

Bookmark for https://arxiv.org/pdf/2501.09223

How To Scale Your Model

Training LLMs often feels like alchemy, but understanding and optimizing the performance of your models doesn’t have to. This book aims to demystify the science of scaling language models on TPUs: how TPUs work and how they communicate with each other, how LLMs run on real hardware, and how to parallelize your models during training and inference so they run efficiently at massive scale. If you’ve ever wondered “how expensive should this LLM be to train” or “how much memory do I need to serve this model myself” or “what’s an AllGather”, we hope this will be useful to you.

https://jax-ml.github.io/scaling-book/

Bookmark for https://jax-ml.github.io/scaling-book/

https://www.brandonrohrer.com/transformers

Bookmark for https://www.brandonrohrer.com/transformers

GenAI Handbook

https://genai-handbook.github.io/

Bookmark for https://genai-handbook.github.io/
For LLM architecture, I focused on:
- Encoder vs decoder
  
  The Transformer Family Version 2.0
  
  Many new Transformer architecture improvements have been proposed since my last post on “The Transformer Family” about three years ago. Here I did a big refactoring and enrichment of that 2020 post — restructure the hierarchy of sections and improve many sections with more recent papers. Version 2.0 is a superset of the old version, about twice the length. Notations Symbol Meaning $d$ The model size / hidden state dimension / positional encoding size. $h$ The number of heads in multi-head attention layer. $L$ The segment length of input sequence. $N$ The total number of attention layers in the model; not considering MoE. $\mathbf{X} \in \mathbb{R}^{L \times d}$ The input sequence where each element has been mapped into an embedding vector of shape $d$, same as the model size. $\mathbf{W}^k \in \mathbb{R}^{d \times d_k}$ The key weight matrix. $\mathbf{W}^q \in \mathbb{R}^{d \times d_k}$ The query weight matrix. $\mathbf{W}^v \in \mathbb{R}^{d \times d_v}$ The value weight matrix. Often we have $d_k = d_v = d$. $\mathbf{W}^k_i, \mathbf{W}^q_i \in \mathbb{R}^{d \times d_k/h}; \mathbf{W}^v_i \in \mathbb{R}^{d \times d_v/h}$ The weight matrices per head. $\mathbf{W}^o \in \mathbb{R}^{d_v \times d}$ The output weight matrix. $\mathbf{Q} = \mathbf{X}\mathbf{W}^q \in \mathbb{R}^{L \times d_k}$ The query embedding inputs. $\mathbf{K} = \mathbf{X}\mathbf{W}^k \in \mathbb{R}^{L \times d_k}$ The key embedding inputs. $\mathbf{V} = \mathbf{X}\mathbf{W}^v \in \mathbb{R}^{L \times d_v}$ The value embedding inputs. $\mathbf{q}_i, \mathbf{k}_i \in \mathbb{R}^{d_k}, \mathbf{v}_i \in \mathbb{R}^{d_v}$ Row vectors in query, key, value matrices, $\mathbf{Q}$, $\mathbf{K}$ and $\mathbf{V}$. $S_i$ A collection of key positions for the $i$-th query $\mathbf{q}_i$ to attend to. $\mathbf{A} \in \mathbb{R}^{L \times L}$ The self-attention matrix between a input sequence of lenght $L$ and itself. $\mathbf{A} = \text{softmax}(\mathbf{Q}\mathbf{K}^\top / \sqrt{d_k})$. $a_{ij} \in \mathbf{A}$ The scalar attention score between query $\mathbf{q}_i$ and key $\mathbf{k}_j$. $\mathbf{P} \in \mathbb{R}^{L \times d}$ position encoding matrix, where the $i$-th row $\mathbf{p}_i$ is the positional encoding for input $\mathbf{x}_i$. Transformer Basics The Transformer (which will be referred to as “vanilla Transformer” to distinguish it from other enhanced versions; Vaswani, et al., 2017) model has an encoder-decoder architecture, as commonly used in many NMT models. Later simplified Transformer was shown to achieve great performance in language modeling tasks, like in encoder-only BERT or decoder-only GPT.
  
  https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/
  
  Bookmark for https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/
- Training objectives
  
  As of Dec 30, 2024, MTP is an interesting mechanic to examine, and you should be prepared for questions about it. Deepseek-v3 has an implementation that you can reference.
- Self-attention (QKV sizes and weight matrices of QKV),
  
  Attention as Soft Dictionary Lookup
  
  The Dictionary Metaphor 🔑📚
  
  https://www.yuan-meng.com/posts/attention_as_dict/
  
  Bookmark for https://www.yuan-meng.com/posts/attention_as_dict/
  
  This video is great if you want to explain the intuition to someone else:
  
  - YouTube
  
  Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.
  
  https://youtu.be/eMlx5fFNoYc?si=bNsHDCbPuU95j7Av
  
  Bookmark for https://youtu.be/eMlx5fFNoYc?si=bNsHDCbPuU95j7Av
- Multi-head attention (concatenation of head outputs rather than averaging), grouped query attention, and other variants.
  
  Something to note: MLA became a popular mechanism while I was interviewing, so I learned about it during the process, aided by my "I know of these papers" section.
- Position-wise Feed-Forward Network (FFN size, ReLU activation, comparisons between ReLU, Swish, SiLU, and Swiglu variants),
- Positional embeddings (dimensions, sizing, learned embeddings, various embedding types, and Rotary Position Embeddings - RoPE),
- Layer normalization (underlying intuition and why LayerNorm parameters are separate from layer parameters),
- Cross-attention (comparing cross-attention with masked self-attention, understanding decoder-only architectures, and handling initial inputs in decoder-only models).
For MoE, I started with this blogpost:

Knowing Enough About MoE to Explain Dropped Tokens in GPT-4

In a previous blogpost, I made a simple observation about GPT-4 from a paper I had incidentally read on a whim. After finishing the post, I realised I didn’t actually ever figure out how token dropping could occur; only learning a black-box rule that it could occur in batched MoE inference for reasons. This post is here to fix that – to collect enough info from important MoE papers (and alleged GPT-4 leaks) to explain the full mechanism of token dropping.

https://152334h.github.io/blog/knowing-enough-about-moe/

Bookmark for https://152334h.github.io/blog/knowing-enough-about-moe/

I then focused on Feed-Forward Networks (FFNs), studying their structure, placement, gating networks, routing mechanisms, load balancing, and training stability challenges. As of Dec 30, 2024, Deepseek's technical report offers valuable insights into practical MoE implementation.

DeepSeek-V3 Technical Report

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

https://arxiv.org/abs/2412.19437

Bookmark for https://arxiv.org/abs/2412.19437
I focused on embeddings, studying token and position embeddings while building upon basic LLM embedding concepts. I specifically explored type/segment embeddings, directionality, and various tokenization approaches and methods.
The key inference topics included:
- Attention mechanisms in training versus inference, including KV cache architecture and dimensionality
- Batch processing strategies and optimizations
- Decoding approaches, from simple greedy search to more sophisticated methods like beam search, along with sampling parameters (top-p, top-k, temperature) and their interplay
- Context window constraints and management
- Model quantization techniques
- Special tokens and their roles in chat-tuned models, plus early stopping criteria
- Long-range dependency handling through sparse attention mechanisms
And some reference material:

Generation configurations: temperature, top-k, top-p, and test time compute

ML models are probabilistic. Imagine that you want to know what’s the best cuisine in the world. If you ask someone this question twice, a minute apart, their answers both times should be the same. If you ask a model the same question twice, its answer can change. If the model thinks that Vietnamese cuisine has a 70% chance of being the best cuisine and Italian cuisine has a 30% chance, it’ll answer “Vietnamese” 70% of the time, and “Italian” 30%.

https://huyenchip.com/2024/01/16/sampling.html

Bookmark for https://huyenchip.com/2024/01/16/sampling.html
The main fine-tuning topics included:
- Task-specific heads and adapter layers
- Parameter-efficient methods like LoRA (including mathematical foundations, comparisons between QLoRA and LoRA, and GaLore)
- Fine-tuning objectives and frameworks
- Hyperparameters and training approaches (supervised vs. unsupervised, instruction fine-tuning)
- Practical implementation steps (loading libraries, datasets, model configuration, tokenization, zero-shot inference, model evaluation)
- Direct Preference Optimization (DPO) in fine-tuning
I made sure to practice writing boilerplate code for the libraries section, which also ties into the LLM coding portion.
Retrieval and RAG. Although Retrieval-Augmented Generation (RAG) could be viewed as its own topic, I've incorporated it as part of the broader LLM pipeline based on my hands-on experience and how it currently integrates with language models. Here are the resources I used:

An Evolution of Learning to Rank

First Thing First Enigmas of the universe Cannot be known without a search — Epica, Omega (2021)

https://www.yuan-meng.com/posts/ltr/

Bookmark for https://www.yuan-meng.com/posts/ltr/

Negative Sampling for Learning Two-Tower Networks

Web-Scale Recommender Systems Two-Stage Architecture The iconic YouTube paper (Covington et al., 2016) introduced a two-stage architecture that since became the industry standard for large-scale recommender systems: Retrieval (a.k.a. “candidate generation”): Quickly select top k (in the hundreds or thousands) loosely relevant items from a large corpus of billions Ranking (a.k.a. “reranking”): Order final candidates (dozens) by predicted reward probability (e.g., an ads click, a listing booking, a video watch, etc.)

https://www.yuan-meng.com/posts/negative_sampling/

Bookmark for https://www.yuan-meng.com/posts/negative_sampling/

An Introduction to Embedding-Based Retrieval

So, What is an Embedding? Embedding is a classic idea in mathematical topology and machine learning (click ▶ for definitions). You can think of embeddings as a special type of vectors.

https://www.yuan-meng.com/posts/ebr/

Bookmark for https://www.yuan-meng.com/posts/ebr/

An Overview on RAG Evaluation | Weaviate

Learn about new trends in RAG evaluation and the current state of the art.

https://weaviate.io/blog/rag-evaluation

Bookmark for https://weaviate.io/blog/rag-evaluation

The main RAG topics included:
- Knowledge base fundamentals (chunking, embeddings such as Word2Vec/Sentence-BERT/GPT models, contrastive loss, and vector storage)
- Query understanding
- Information retrieval components:
  - Candidate generation (BM25, TF-IDF, Learning to Rank)
  - Relevance scoring and dense retrieval
  - Reranking and filtering
  - Diversity sampling
  - Context window selection
  - Metadata extraction and result aggregation
- Context integration and response generation
- Evaluation metrics (NDCG, Mean Reciprocal Rank)
RLHF (Reinforcement Learning from Human Feedback) is a topic I wish I had studied more thoroughly. Although I covered the basics—comparing PPO and DPO approaches and using ChatGPT to explore the subject through follow-up questions—I reached my knowledge limit fairly quickly. I still feel there's much more to learn. Here are some resources I used:

RLHF: Reinforcement Learning from Human Feedback

[LinkedIn discussion, Twitter thread]

https://huyenchip.com/2023/05/02/rlhf.html

Bookmark for https://huyenchip.com/2023/05/02/rlhf.html

The N Implementation Details of RLHF with PPO

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo

Bookmark for https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo

The 37 Implementation Details of Proximal Policy Optimization · The ICLR Blog Track

https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

Bookmark for https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

Illustrating Reinforcement Learning from Human Feedback (RLHF)

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/blog/rlhf

Bookmark for https://huggingface.co/blog/rlhf

rl-for-llms.md

GitHub Gist: instantly share code, notes, and snippets.

https://gist.github.com/yoavg/6bff0fecd65950898eba1bb321cfbd81

Bookmark for https://gist.github.com/yoavg/6bff0fecd65950898eba1bb321cfbd81

Note that GRPO became a popular mechanism after my interviews, so I learned about it later, purely out of interest.

A vision researcher’s guide to some RL stuff: PPO & GRPO

https://yugeten.github.io/posts/2025/01/ppogrpo/

Bookmark for https://yugeten.github.io/posts/2025/01/ppogrpo/
Since I have extensive experience with synthetic data generation and evaluation, I kept my preparation for these topics minimal. Although I have access to various resources on synthetic data generation and evaluation (including one I created), I didn't need to review them in depth.

I would appreciate suggestions for additional resources to include in this section.

How To T̶r̶a̶i̶n̶ Synthesize Your D̶r̶a̶g̶o̶n̶ Data – Answer.AI

The art and science of crafting synthetic data for AI training

https://www.answer.ai/posts/2024-10-15-how-to-synthesize-data.html

Bookmark for https://www.answer.ai/posts/2024-10-15-how-to-synthesize-data.html
Since test-time computation was a relatively new concept and I already had limited knowledge of reinforcement learning techniques, I chose not to focus heavily on this area. I realized I would only be able to offer surface-level responses to related questions. Nevertheless, I did review these key resources:

- YouTube

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

https://t.co/9tJY26eUcP

Bookmark for https://t.co/9tJY26eUcP

- YouTube

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

https://t.co/lnvDrweyfK

Bookmark for https://t.co/lnvDrweyfK

Reverse engineering OpenAI’s o1

What productionizing test-time compute shows us about the future of AI. Exploration has landed in language model training.

https://www.interconnects.ai/p/reverse-engineering-openai-o1

Bookmark for https://www.interconnects.ai/p/reverse-engineering-openai-o1

OpenAI’s o1 using “search” was a PSYOP

How to understand OpenAI’s o1 models as really just one wacky, wonderful, long chain of thought

https://www.interconnects.ai/p/openais-o1-using-search-was-a-psyop

Bookmark for https://www.interconnects.ai/p/openais-o1-using-search-was-a-psyop
Since my knowledge of model optimization is limited, I welcome suggestions for additional reading material. Here are some resources that may be helpful:

Flash LLM

Share your videos with friends, family, and the world

https://www.youtube.com/playlist?list=PLO45-80-XKkT6BUKCYeBMTEqnlcpYavxq

Bookmark for https://www.youtube.com/playlist?list=PLO45-80-XKkT6BUKCYeBMTEqnlcpYavxq

- YouTube

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

https://www.youtube.com/watch?v=139UPjoq7Kw

Bookmark for https://www.youtube.com/watch?v=139UPjoq7Kw

llm_distillation.pdf

https://drive.google.com/file/d/1xMohjQcTmQuUd_OiZ3hB1r47WB1WM3Am/view?pli=1

Bookmark for https://drive.google.com/file/d/1xMohjQcTmQuUd_OiZ3hB1r47WB1WM3Am/view?pli=1

ML (System) Design

While I didn't extensively prepare for these interviews, I was fortunate that the companies I targeted mainly focused on research experiment design rather than general system design questions. I encountered a few questions about recommendation systems and sentiment analysis design, which I handled as well as I could. Working with primarily research-based design questions was a relief.

When I first started interviewing, I made the mistake of hesitating to ask for clarification. After finding two helpful blog posts and a YouTube video about asking better questions, I created a ChatGPT prompt to practice. I would pose questions like "How would you set up a research question design for LLaMA Guard fine-tuning?" and use ChatGPT's suggested clarifying questions for practice. This helped me realize that it's completely fine to spend up to 10 minutes asking clarifying questions.

Another concern I had, which I've mentioned beforeShow information for the linked content, is that many people now default to using large language models for everything—even for tasks that simpler models like BERT could handle effectively, or when BM25 would work better than text embeddings with cosine similarity for information retrieval problems.

The key question is: When should we use these powerful generative models versus smaller discriminative models? After discussing this with my friends (Yash and June), they helped me craft the perfect response, which I've since memorized: "There are two ways I can approach this problem. One uses a discriminative method when optimization and latency are priorities, though it's more restrictive in its outputs. The other uses a generative method throughout, which may introduce latency but offers more flexibility. I can briefly discuss the pros and cons of both approaches, and then you can guide which path you'd prefer me to explore."

I really appreciate them suggesting this approach. I now consistently use this method when tackling any design questions.

Behavioral

You've probably heard of the STAR method by now. Stick to it—I did in my process too. I used natural transition phrasesShow information for the linked content to avoid sounding rehearsed while keeping my responses structured. For example, I'd say "When this happened" or "For some context" as a personal reminder to describe the situation. Then I'd say "I had to do" to signal the task portion—these were cues for myself, not for the interviewer. I kept an eye on the clock to ensure I wasn't spending more than 30 seconds on any one part. Then I'd transition with phrases like "Here's what I did" or "I chose to do this," followed by "This was the result." I made sure the core response fit within two minutes. After that, I'd elaborate with additional details, often without prompting, but those first two minutes were always laser-focused.

I used this question bank set for behavioral questions and had stories for all of them.

GitHub - ashishps1/awesome-behavioral-interviews: Tips and resources to prepare for Behavioral interviews.

Tips and resources to prepare for Behavioral interviews. - ashishps1/awesome-behavioral-interviews

https://github.com/ashishps1/awesome-behavioral-interviews

Companies vary in their interview formats. While some may ask just a few behavioral questions per round, others—like Meta, with its extensive multi-round sessions containing up to 10 questionsShow information for the linked content—can catch candidates unprepared.

When interviewing with multiple companies, you'll need distinct examples for each question within a single interview loop. Repeatedly telling the same stories during your job search can make your responses sound mechanical. Here's a useful tip from my advisor: take a sip of water after each paragraph to stay present and keep your responses natural.

This principle extends to your introduction. Using the same introductionShow information for the linked content repeatedly can make you appear stiff and disconnected. Though interviewers expect candidates to prepare their responses, a natural delivery creates better engagement with your interviewer.

I Know of These Papers

I deliberately chose the word "know" for an important reason—it captures multiple levels of familiarity: deeply reading papers, skimming them, finding them through social media, learning about them from others' presentations or discussions, and implementing them firsthand. This knowledge extends beyond academic papers to include Twitter threads, implementation guides, verified code examples, blog posts, and other sources. This broad scope is why I kept the heading intentionally general.

When interviewers ask open-ended questions, I make a point to cite my sourcesShow information for the linked content, saying things like "I learned this from a blog post" or "I've seen this discussed widely on Twitter." I maintain a broad collection of references and always stay transparent about my depth of understanding for each paper. As someone who's constantly online, I use Zotero and Raindrop to organize papers, related discussions, and emerging research. I skim every paper before adding it to my Zotero library, categorizing them by potential interview questions—which is exactly why I created this page.

I tracked open-ended questions in this section and researched relevant papers I may have overlooked, regularly expanding the list. I also maintained a collection of papers worth mentioning for specific topics, which proved invaluable. I'll share some resources to showcase recent papers that caught my attention. I should admit that my paper knowledge comes primarily from my somewhat excessive social media use rather than formal newsletters or collections. These resources might help you build your own reference library for handling open-ended questions.

Here are some resources I use to stay on top of the vast number of research papers in my field:

People who share other people’s work and themselves work on topics that I am interested in

https://x.com/LChoshen

Bookmark for https://x.com/LChoshen

https://x.com/yoavgo

Bookmark for https://x.com/yoavgo
People who share other people’s recent work (I prefer their curation)

https://x.com/fly51fly

Bookmark for https://x.com/fly51fly

https://x.com/gm8xx8

Bookmark for https://x.com/gm8xx8

https://x.com/TheXeophon

Bookmark for https://x.com/TheXeophon

Bluesky

https://bsky.app/profile/reachsumit.bsky.social

Bookmark for https://bsky.app/profile/reachsumit.bsky.social

https://x.com/omarsar0

Bookmark for https://x.com/omarsar0

I speed skim this list:

arxiv cs.CL (@arxiv-cs-cl.bsky.social)

Computer Science -- Computation and Language source: export.arxiv.org/rss/cs.CL maintainer: @tmaehara.bsky.social

https://bsky.app/profile/arxiv-cs-cl.bsky.social

Bookmark for https://bsky.app/profile/arxiv-cs-cl.bsky.social

https://x.com/arXivGPT

Bookmark for https://x.com/arXivGPT

https://x.com/TheAITimeline

Bookmark for https://x.com/TheAITimeline
People who summarize recent work or talk about ML fundamentals

Ahead of AI | Sebastian Raschka, PhD | Substack

Ahead of AI specializes in Machine Learning & AI research and is read by tens of thousands of researchers and practitioners who want to stay ahead in the ever-evolving field. Click to read Ahead of AI, by Sebastian Raschka, PhD, a Substack publication with hundreds of thousands of subscribers.

https://magazine.sebastianraschka.com/

Bookmark for https://magazine.sebastianraschka.com/

Artificial Fintelligence | Finbarr Timbers | Substack

I write detailed articles about the frontiers of AI research. Read by over 5000 researchers at OpenAI, DeepMind, Midjourney, Google, Stanford, Berkeley, etc. Click to read Artificial Fintelligence, by Finbarr Timbers, a Substack publication with tens of thousands of subscribers.

https://www.artfintel.com/

Bookmark for https://www.artfintel.com/

Lil’Log

Document my learning notes.

https://lilianweng.github.io/

Bookmark for https://lilianweng.github.io/
People whose work I follow

https://x.com/besanushi

Bookmark for https://x.com/besanushi

https://x.com/jaseweston?lang=en

Bookmark for https://x.com/jaseweston?lang=en

https://x.com/BlancheMinerva

Bookmark for https://x.com/BlancheMinerva

Finally

Interview experiences can vary greatly depending on the luck of the draw. A skilled interviewer makes the process flow naturally, and while strategically steering conversations toward your strengths is valuable, some interviewers may be less flexible in their approach. Since resource quality and relevance can differ widely, it's crucial to prioritize materials that directly support your specific career aspirations.

I'm eager to learn from others who have successfully juggled machine learning studies with their other life responsibilities. If you know of any useful interview preparation resources, please share them—I'll gladly add them here.

Good luck with your preparation!