Course Resources
- The Youtube Video
- The specific online whiteboard tool: Excalidraw
Pre-Training and Base Model
The pre-training stage is to give model world knowledge, this stage usually is the most time consuming, months to train.
Data Preprocessing
The ChatGPT vendors all have something similar to the FineWeb to collect data from internet.
Pay attention to the FineWeb recipe pipeline for data preprocessing.
Tokenization
For how ChatGPT tokenizes the input text, try this online tiktokenizer for visualization, this is very important, and it is the fundamental step.
Neural Network Training
The LLM neural network transformer visualization tool.
The reproduction of ChatGPT 2.0 practice.
Tip: During the training of the big neural network, you can print the generated text for every 20 trains and you will see how the model is gradually improved by predicting the next words.
Then the lecturer moves to the hyperbolic to try inference from a base model - He uses Llama as example.
Now we have a base model, but it is still just an internet document simulator, it can generate text sequentially on the initial input, but we actually want to build a assist system to answer questions.
How Does Neural Network Hold knowledge?
LLMs store learned knowledge in the model’s parameters (weights), not in a database or memory. During training, the model adjusts its millions (or even billions) of weights to recognize patterns, relationships, and structures in language.
- Neural Network Weights: These weights encode statistical relationships between words, phrases, and concepts.
- Hidden Representations: The model learns abstract representations of language, enabling it to generate relevant responses based on context.
- No Direct Storage of Training Data: The model doesn’t store exact documents or books but compresses useful patterns and knowledge into its weights.
When you ask a question, the model doesn’t “look up” an answer from storage. Instead, it generates responses dynamically based on learned patterns. The model predicts the most probable next words given the input, guided by the patterns encoded in its weights.
Objective of Pre-Training Stage
The model is trained on a massive dataset (books, articles, code, etc.), learning to predict missing words, the next word, or even reconstruct corrupted text.
Common pre-training objectives:
- Causal Language Modeling (CLM): Predict the next token (used in models like GPT).
- Masked Language Modeling (MLM): Predict missing words in a sentence (used in BERT).
- Sequence-to-Sequence Learning: For tasks like translation (used in models like T5).
What Does the Model Learn?
- Statistical patterns in language: It learns which words/tokens frequently appear together.
- Syntax and grammar: It picks up grammatical structures by learning associations between words.
- Semantics and meaning: The model develops an understanding of concepts through word embeddings (e.g., “Paris” is related to “France”).
- World knowledge: It passively absorbs factual information from its dataset.
- Basic reasoning: By recognizing complex relationships, it can perform simple inference.
Post-Training
We call this stage supervised finetuning, with human curation 人工筛选, to show model problem and its demonstrated solution, for imitating.
We need to provide the conversation(prompt + answer) dataset and then train the model, but it takes much less time in comparison with pre-training stage.
-
Human labelers are employed(can be done by software as well) to create these conversations, for example, to come up with the prompt and ideal response, the instruct GPT paper mentions it.
-
The open source reproduction of the conversation training dataset by human labelers, and the UltraChat can help with multi-round dialog data, so in fact you actually talk to the simulation of human labelers not the magic AI.
-
Tokenziation of the conversations, similar to TCP packet structure, we define the structure to encode the conversations before feeding them to model, for example GPT-4 tokenizer, here you will see special tag like “<|im-start|>” to group the content.
Hallucination
An issue that model does not know and produce fake answer, it can be mitigated:
-
Use model interrogation to discover model’s knowledge, and programmatically augment its training dataset with knowledge-based refusals in cases where the model doesn’t know.
-
Allow model to search by the search trigger tokens when it doesn’t know the answer, the search trigger token how it gets used is also trained by dataset.
- You can explicitly tell LLM to use/not to use any tool.
Knowledge vs Working Memory
-
Knowledge in the parameters(weights), it is the vague recollection (e.g. of something you read 1 month ago).
-
knowledge in the tokens of the context window: it is the working memory.
Knowledge of self
This is still achieved by dataset to train model to have self identification, for example: https://huggingface.co/datasets/allenai/olmo-2-hard-coded. the model knows nothing about itself without this dataset.
Model needs tokens to think
For example, for a math problem, you need to train the model to use context to infer the result, and distribute the reasoning/computation before the final answer.
Don’t immediately give the answer in a short sentence at beginning, this does not help model training. If you give the answer first in dataset, the model will try justification for it.
Question:
1 | I bug 3 apples and 2 oranges. Each orange costs $2, the total cost is $13. |
Bad answer:
1 | The answer is $3. This is because 2 oranges at $2 are $4 total. So 3 apples cost |
Good answer:
1 | The total cost of the 2 oranges is $4. 13-4=$9, the cost of 3 apples is $9. 9/3=3 |
To be less error-prone, you can ask model to “use code/tool”, rather than computing mentally.
Becuase of the token nature, the model is not good with counting/spelling, try “use code/tool” to get the right answer if possible.
Reinforcement Learning
In this stage, we move SFT(supervised finetuning) model to reinforcement learning, the last major stage of training, basically what we do is: prompt to practice, trail & error until you reach the correct answer.
For example, given the problem statement(prompt) and the final answer, we generate 15 solutions, only 4 of them got the right answer, we pick the top solution based on some criteria and train on it, repeat many times to encourage such tokens to be created by model.
DeepSeek-R1 published the reinforcement learning which draws attention from public because RL is kind of secret within AI companies.
For example, the ChatGPT o3-mini, o3-mini-high, o1 are all RL model, the previous ones are just SFT model.
The RL can go beyond the human expertise.
RL in Un-verifiable Domains
For example, “write a joke about pelicans”, how do we score the answer?
We need a scalable approach, the RLHF(reinforcement learning from human feedback). The core ideas are:
- Take a small portion of result, human order them from best to worst.
- Train a neural net simulator of human preferences (“reward model”).
- Run RL as usual, but using the simulator instead of humans.
RL Downside
RL discover ways to “game” the lossy simulation of humans.
For example, after 1000 updates, the top jokes about perlicans is not you want, but something totally non-sensical like “the the the the the”, this kind of input is not in the simulation model’s training set and it happens to have a high score.
So you cannot run RL indefinitely in un-verifiable domains.
About Knowledge Distillation
Distillation in LLMs refers to a technique called knowledge distillation, which is used to train a smaller, more efficient model (called the student model) by transferring knowledge from a larger, more powerful model (the teacher model). The goal is to retain most of the teacher model’s performance while reducing computational costs.
Approaches to Distillation
If You Own the Teacher Model (Full Access)
- You can directly use its logits (probability distributions over outputs) or intermediate layer representations to guide the student model’s training.
- You can access its training data and generate additional “soft labels” for better supervision.
If You Don’t Own the Teacher Model (Black-Box Distillation)
- You can use the API of the teacher model (if available) to query it and collect outputs (e.g., responses or probabilities).
- This is often called zero-shot or black-box distillation, where you use the teacher’s responses to fine-tune a smaller model.
- A famous example is training smaller models based on OpenAI’s GPT-4 responses without having access to GPT-4’s internals.
Limitations of Black-Box Distillation:
- You are limited to what the API provides (e.g., if it only gives text responses and not token probabilities, the student learns less fine-grained knowledge).
- It can be expensive if querying a paid API.
Pros and Cons of Distillation
Aspect | Pros (Advantages) | Cons (Disadvantages) |
---|---|---|
Efficiency & Performance | - Produces a smaller, faster model with similar performance to the larger teacher model. - Reduces computational cost and memory usage. |
- The student model usually cannot match the teacher model’s full performance. - Some knowledge is inevitably lost during distillation. |
Training Cost | - Requires fewer resources compared to training from scratch. - Leverages the pre-trained teacher model to guide learning. |
- Training the student model still requires significant compute, especially if distilling from a very large teacher model. - If using a black-box API, querying the teacher model can be costly. |
Data Dependency | - Can work without access to the original training data of the teacher model (if using API-based distillation). | - If training data is available, distillation is more effective, but obtaining high-quality labeled data can be expensive. |
Flexibility | - Can be used with various architectures, allowing compression of transformer-based models like GPT, BERT, etc. - Can be applied to different NLP tasks (e.g., text generation, classification). |
- Some architectures may not benefit as much from distillation. - Requires careful tuning of hyperparameters to balance knowledge transfer. |
Inference Speed | - Leads to much faster inference, making LLMs deployable on edge devices or mobile platforms. - Reduces latency in real-time applications (e.g., chatbots, search engines). |
- The trade-off between speed and accuracy needs to be balanced—aggressive compression can degrade quality. |
Knowledge Transfer | - Allows a smaller model to capture soft labels and knowledge (such as uncertainty and hidden patterns) from a larger model. | - Some complex reasoning or long-context dependencies from the teacher model may not transfer well. |
Accessibility | - If a teacher model is available via API, distillation can be done without full access to the source code or training data. | - Black-box distillation is limited by what the API exposes (e.g., no access to logits or internal activations). |
Security & Privacy | - Can be used to create private models without exposing original training data. - Helps in model compression for on-premises deployment. |
- If distilling from an API-based teacher model, there is potential for bias transfer or unintentional memorization of sensitive data. |
Adaptability | - The student model can be fine-tuned on specific domains (e.g., legal, medical) after distillation. | - If the teacher model updates frequently, the distilled model may become outdated unless re-distilled. |
Preview of Things to Come
- multimodel(not just text but audio, images, video, natural conversations)
- tasks -> agents (long, coherent, error-correcting contexts)
- pervasive, invisible
- computer-using
- test-time training, etc
Where to Keep Track of Them
- https://lmarena.ai/?leaderboard, but don’t be too serious about the ranking.
- https://buttondown.com/ainews
- X / Twitter
- Run model locally: https://lmstudio.ai/