My single VM and VM compose setup please see this InfraTree.
Since year 2019 I have been knowing Vagrant from Book
<<Ansible: Up and Running>> as it enables easy set up, test for Ansible, Jenkins
and many other software on local VM (You have another choice using docker).
Vagrant Box Search
Vagrant Box Search provides you pre-built
VM images, all-in-one box, etc.
But you can also build your own customized image from root image via Packer.
Introduction
Vagrant is a tool for building and managing virtual machine environments in a
single workflow. You can find everything from Vagrant site.
It leverages a declarative configuration file which describes all your
software requirements, packages, operating system configuration, users, and more.
Vagrant also integrates with your existing configuration management tooling like
Ansible, Chef, Docker, Puppet or Salt, so you can use the same scripts to
configure Vagrant for production.
The Vagrantfile is meant to be committed to version control with your
project, if you use version control. This way, every person working with that
project can benefit from Vagrant without any upfront work.
The syntax of Vagrantfiles is Ruby, but knowledge of the Ruby programming
language is not necessary to make modifications to the Vagrantfile, since it is
mostly simple variable assignment.
1 2 3 4 5 6 7 8
# init project mkdir vagrant_getting_started cd vagrant_getting_started
# --minimal: generate minimal Vagrantfile # hashicorp/bionic64: the box name # if you have Vagrantfile, no need this vagrant init hashicorp/bionic64 [--minimal]
Of course you can create a Vagrantfile manually:
1 2 3
# download box image # you don't need to explicitly do this, vagrant will handle download from Vagrantfile vagrant box add hashicorp/bionic64
In the above command, you will notice that boxes are namespaced. Boxes are
broken down into two parts - the username and the box name - separated by a
slash. In the example above, the username is “hashicorp”, and the box is
“bionic64”.
Vagrantfile Example
Editing Vagrantfile, this is a simple example to bring up a jenkins cluster:
# Jenkins server, set as primary config.vm.define "server", primary:truedo |server| server.vm.hostname = "jenkins-server" # private network # jenkins uses port 8080 in browser server.vm.network "private_network", ip:SERVER_IP # provisioning server.vm.provision :shell, path:"./provision/server.sh", privileged:true end
# agents setup (1..2).each do |i| config.vm.define "agent#{i}"do |agent| agent.vm.hostname = "jenkins-agent#{i}" # private network agent.vm.network "private_network", ip:"#{AGENT_IP}#{i}" # provisioning agent.vm.provision :shell, path:"./provision/agent.sh", privileged:true end end end
Network Config
By using private_network, there is no need to do port forwarding, you are able
to access the service directly from the ip on your host, if you go to check the
virtualbox configuration, it uses Host-only adapter.
Please follow the proivate network range, you can try chrome incognito mode
or firefox, or using curl/wget.
For SSH access, the VMs must be in the same private network:
1 2
# assign private IP config.vm.network "private_network", ip: "192.168.2.2"
This configuration uses a private network. The VM can be accessed only
from another VM that runs Vagrant and within the same range. You won’t be able
to connect to this IP address from another physical host, even if it’s on the
same network as the VM. However, different Vagrant VMs can connect to each
other.
SSH agent forwarding, for example, git clone in VM using the private key of host:
1
config.ssh.forward_agent = true
Bring up the environment:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# validate syntax vagrant validate # check machine status vagrant status # at your project root directory vagrant up # ssh to primary machine in your project vagrant ssh [vm name]
# when you finish working # destroy machine # -f: confirm vagrant destroy -f # remove the box vagrant box remove
Note that by default vagrant ssh login as user vagrant, not root user, you can use sudo to execute command or run sudo su - first.
SSH to VM
This is to demystify how SSH to VM from our host works.
Host default HostName 127.0.0.1 User vagrant Port 2222 UserKnownHostsFile /dev/null StrictHostKeyChecking no PasswordAuthentication no IdentityFile <absolute path>/.vagrant/machines/<vm name>/virtualbox/private_key IdentitiesOnly yes LogLevel FATAL
Vagrant sets up host-to-guest port forwarding on a high random port on localhost
(e.g., 127.0.0.1:2222 → guest:22).You can also change the default ssh port
mapping, see this blog, for example:
1 2 3
# id: "ssh" # map host port 13001 to guest port 22 config.vm.network :forwarded_port, guest: 22, host: 13001, id: "ssh"
The underlying SSH command can be seen by ps aux | grep ssh:
Vagrant generates new SSH key pair per VM, if you disable insert_key in
Vagrantfile:
1
config.ssh.insert_key = false
Then Vagrant uses insecure default key, if you check verbose ssh command under
the vagrant ssh, you will see the location of that insecure key:
1 2
ps aux | grep ssh #-i <user home>/.vagrant.d/insecure_private_key
Synced Folder
By using synced folders, Vagrant will automatically sync your files to and
from the guest machine. By default, Vagrant shares your project directory
(remember, that is the one with the Vagrantfile) to the /vagrant directory in
your guest machine.
Synced folder will be mapped before provisioning would run.
Vagrant has built-in support for automated provisioning. Using this feature,
Vagrant will automatically install software when you vagrant up:
1 2 3 4 5 6 7 8
# use Ansible provisioner VAGRANTFILE_API_VERSION = "2" Vagrant.configure(VAGRANTFILE_API_VERSION) do |config| config.vm.box = "ubuntu/trusty64" config.vm.provision "ansible"do |ansible| ansible.playbook = "playbook.yml" end end
Vagrant will not run it a second time unless you force it.
1 2 3 4 5 6
# reboot machine and reload provision setting if machine is already running # if no provision needed, just run vagrant reload # --provision: force run provision again vagrant reload --provision # force rerun provision when machine is running vagrant provision
Methods to teardown:
1 2 3 4 5 6 7 8 9 10
# hibernate, save states in disk vagrant suspend vagrant resume
# normal power off vagrant halt # reclaim all resources vagrant destroy # boot again vagrant up
The question is, how can I know the mysql server is a primary, replica or dual?
Is It a Replica
1
SHOW REPLICA STATUS\G
What to look for in the output:
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
If both of these are Yes, the server is running as a replica and is actively
trying to pull and apply changes from a primary. If either is No or Connecting,
it’s configured as a replica but replication might be stopped or encountering
issues.
Source_Host (or Master_Host in older versions): This field will show the
hostname or IP address of the server it’s replicating from. If this is
populated, it’s a strong indicator that this server is a replica. If it’s empty,
it’s not configured as a replica.
If SHOW REPLICA STATUS returns an empty set, it means the server is NOT
configured as a replica.
Is It a Master
The master must have binlog enabled if it is used as a source for replica:
1 2 3
mysql>show variables where variable_name in ('log_bin', 'binlog_format');
SHOW MASTER STATUS\G
If it has replica and the replication is running:
1 2 3
SHOW REPLICAS\G
SHOW PROCESSLIST\G
Dual
Have both properties above, for example, the MySQL server who is used as a
cascading replica.
At its core, a lock is a mechanism that allows a database system (like MySQL) to
regulate concurrent access to data. When a transaction or operation needs to
read or modify data, it acquires a lock on that data. This lock prevents other
transactions from accessing or modifying the same data in a way that would lead
to inconsistency or data corruption.
Key Principles of Locking
Concurrency Control: Locks ensure that multiple users or processes can access
the database simultaneously without interfering with each other’s operations.
Data Consistency: They prevent phenomena like dirty reads (reading uncommitted
data), non-repeatable reads (reading the same data twice and getting different
results because another transaction committed a change in between), and
phantom reads (seeing new rows appear or disappear in a range query due to
concurrent insertions/deletions).
Data Integrity: They protect data from being corrupted by simultaneous updates
that could overwrite each other’s changes.
Transaction Isolation: Locks are fundamental to implementing ACID properties,
specifically “Isolation,” ensuring that concurrent transactions appear to execute
serially.
Types of Locks (Simplified for understanding)
MySQL (especially with InnoDB, its primary storage engine) uses a sophisticated
locking mechanism. The most common types are:
Shared Locks (S-locks / Read Locks)
Acquired when a transaction wants to read data.
Multiple transactions can hold shared locks on the same data simultaneously.
A shared lock prevents an exclusive lock from being acquired on the same data.
Analogy: Multiple people can read the same book at the same time (e.g., in a
library reading room), but no one can modify it.
Exclusive Locks (X-locks / Write Locks)
Acquired when a transaction wants to modify (insert, update, delete) data.
Only one transaction can hold an exclusive lock on particular data at any given
time.
An exclusive lock prevents any other shared or exclusive locks from being
acquired on the same data.
Analogy: Only one person can borrow a book and make annotations or changes to
it at a time.
Lock Granularity
Locks can be applied at different levels of granularity:
Row-level locks: The most common and desirable for high concurrency. Only the
specific rows being accessed are locked. InnoDB primarily uses row-level locking.
Page-level locks: Less granular than row-level, but more granular than table-level.
Table-level locks: Locks the entire table, preventing any other operations
(reads or writes) on that table. MyISAM primarily uses table-level locking.
InnoDB also uses table-level locks for DDL operations (e.g., ALTER TABLE).
Metadata Locks (MDL): Protects database objects (tables, functions, etc.) from
concurrent DDL and DML operations that would conflict. For example, an ALTER
TABLE cannot proceed if there are active queries on the table, and vice-versa.
Lock Commands
It’s crucial to understand that MySQL’s primary storage engine, InnoDB (the
default), handles most locking automatically at the row level within transactions.
You rarely explicitly “lock a row” or “lock a page” yourself in application code.
The LOCK TABLES statement is an explicit table-level lock that bypasses
InnoDB’s finer-grained control and is generally discouraged for high-concurrency
applications using InnoDB.
Instance Wide Lock
For replication, we use instance lock even before dump/load as we want to get
the binlog file and position at that point.
1 2 3 4 5 6
# Wait for active transactions to complete or flush caches mysql> FLUSH TABLES WITH READ LOCK;
# This lock is released automatically when the session that issued the command disconnects, so if you quit the session it is the same as UNLOCK TABLES; mysql> UNLOCK TABLES;
I tested that after running FTWRL, from a new terminal login and the insert
operation hangs(blocked) until the lock is released.
Table Lock
Read Lock
1
LOCK TABLES products READ;
What it does: Allows the session holding the lock to read from products. Allows
other sessions to read from products without explicitly acquiring a lock. Prevents
any session (including the one holding the lock) from writing to products.
Behavior for session holding the lock:
Can execute SELECT statements on products.
Cannot execute INSERT, UPDATE, DELETE on products.
Behavior for other sessions:
Can execute SELECT statements on products.
INSERT, UPDATE, DELETE statements on products will be blocked (wait) until the
lock is released.
Write Lock
1
LOCK TABLES products WRITE;
What it does: Grants exclusive access to the table for the session holding the
lock. The session holding the lock can both read and write. All other sessions
are blocked from both reading and writing to products.
Behavior for session holding the lock:
Can execute SELECT, INSERT, UPDATE, DELETE statements on products.
Behavior for other sessions:
All SELECT, INSERT, UPDATE, DELETE statements on products will be blocked (wait)
until the lock is released.
mysql> \s -------------- mysql Ver 14.14 Distrib 5.7.33, for Linux (x86_64) using EditLine wrapper
Connection id: 513 Current database: Currentuser: root@localhost SSL: Notin use Current pager: stdout Using outfile: '' Using delimiter: ; Server version: 8.0.16 MySQL Community Server - GPL Protocol version: 10 Connection: Localhost via UNIX socket Server characterset: utf8mb4 Db characterset: utf8mb4 Client characterset: utf8 Conn. characterset: utf8 UNIX socket: /tmp/mysql.sock Uptime: 4 days 21 hours 55 min 37 sec
Threads: This indicates the total number of currently connected client threads
(connections) to the MySQL server
Questions: This is a cumulative counter representing the total number of
queries (statements) that the server has executed since it was last started
Slow queries: This is a cumulative counter for the number of queries that have
taken longer than the long_query_time system variable setting to execute. By
default, long_query_time is usually 10 seconds.
Opens: This is a cumulative counter for the number of files (tables, logs, etc.)
that MySQL has opened.
Flush tables: This is a cumulative counter for the number of times a FLUSH TABLES
command (or similar flush operation) has been executed. FLUSH TABLES forces MySQL
to close all open tables and reload them from disk. This is often done after
making changes to table structures, privileges, or for backup purposes.
Open tables: This represents the actual number of tables that are currently
open in the MySQL table cache. These tables are kept open to avoid the overhead
of opening and closing them repeatedly, improving performance.
Query per second: It’s calculated as Questions / Uptime (where Uptime is the
server’s running time in seconds).
Each server (source and all replicas) in a replication chain must have a
unique server_id. When a source server writes changes to its binary log, it
includes its server_id with each event.
Replicas use this ID to:
Prevent loops: A replica will skip applying events that originated from a
server with its own server_id to avoid infinite loops in multi-master or
circular replication topologies.
Identify the source: In more complex replication setups, the server_id helps
identify which server generated a particular set of changes
These are the most common and direct ways to see the current user.
USER() returns the username and host that the client attempted to
authenticate with.
CURRENT_USER() returns the username and host that the MySQL server actually
authenticated the client connection with. This is the user account that determines
your privileges.
They can sometimes be different (e.g., if you tried to connect with a
non-existent user, MySQL might connect you as an anonymous user).
-- with IP range CREATEUSER'myuser'@'192.168.1.0/24' IDENTIFIED BY'YourSecurePasswordHere'; CREATEUSER'myuser'@'192.168.1.%' IDENTIFIED BY'YourSecurePasswordHere';
Grant Privileges
Grant privileges to user:
1 2
mysql>GRANT EVENT ON animals.*TO'myuser'@'%'; mysql> FLUSH PRIVILEGES;
***************************2.row*************************** Id: 2620 User: myuser Host: 35.194.40.52:59948 db: NULL Command: Binlog Dump Time: 615115 State: Source has sent all binlog to replica; waiting for more updates Info: NULL
-- when GTID is ON ***************************1.row*************************** Id: 11437 User: myuser Host: 34.121.192.197:56772 db: NULL Command: Binlog Dump GTID Time: 800846 State: Master has sent all binlog to slave; waiting for more updates Info: NULL
This thread is responsible for connecting to the source server and reading the
binary log events. Think of it as the “data fetcher.” It streams the changes
(SQL statements, row changes, etc.) from the source’s binary logs.
Once received, the I/O thread writes these events to a local file on the replica
called the relay log. The relay log acts as a temporary cache of the binary log
events from the source.
If a replica is offline for a period, the I/O thread can quickly retrieve all
the accumulated binary log events from the source once it reconnects, even if the
SQL thread takes longer to process them.
This thread is responsible for reading the events from the relay log (written by
the I/O thread) and executing them on the replica’s database. Think of it as the
“data applier.”
Its goal is to apply the changes to the replica’s data as quickly and efficiently
as possible to keep the replica synchronized with the source.
-- if you enabled binlog, the retention is set automatically -- check binlog retention binlog_expire_logs_seconds is for latest version mysql>show variables where variable_name in \ ('binlog_expire_logs_seconds', 'expire_logs_days'); +----------------------------+---------+ | Variable_name |Value| +----------------------------+---------+ | binlog_expire_logs_seconds |2592000| | expire_logs_days |0| +----------------------------+---------+ 2rowsinset (0.00 sec)
One server can have both MyISAM and InnoDB for different tables.
1 2 3 4
SELECTcount(table_name) as table_count FROM information_schema.tables tab WHERE engine !='InnoDB' AND table_type ='BASE TABLE' AND table_schema notin ('information_schema','sys','performance_schema','mysql');
This quick note is mainly about CLI ctr, please note that ctr tool is made for
debugging containerd. It doesn’t support all the features you may be used to
from docker such as port publishing, automatic container restart on failure, or
browsing container logs.
● containerd.service - containerd container runtime Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; preset: disabled) Active: active (running) since Sat 2025-06-14 22:45:52 UTC; 1 week 0 days ago Docs: https://containerd.io Main PID: 461 (containerd) Tasks: 256 Memory: 1.5G CPU: 4h 43min 43.197s CGroup: /system.slice/containerd.service
Namespace
You can have both docker and containerd containers running in the same server, docker
is also using containerd as the underlying runtime, if you have ctr available
you would be able to examine the relationship:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
$ ctr ns ls NAME LABELS ns1 ns2
$ ctr -n ns2 c ls CONTAINER IMAGE RUNTIME 742bd6fc9cd8efdac4dbdf1bd302e4c9ecd1d259224596746939ef5ae0167d47 - io.containerd.runc.v2 ebe97a5c7a289273ce333dae921b7e4f8931cd5edc27749c8b6d445867935a12 - io.containerd.runc.v2
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES ebe97a5c7a28 us-docker.pkg.dev/demo/images/example:latest "/mysql-scripts/start" 7 days ago Up 7 days mysqld 742bd6fc9cd8 us-docker.pkg.dev/demo/images/example:latest "/proxy_server -logt…" 7 days ago Up 7 days proxy_server
Container and Task
The separation of containers and tasks in ctr (and in containerd’s architecture)
might seem a bit different from Docker’s more unified view of a running container.
A container in containerd is primarily an isolated metadata and configuration
entity. It doesn’t have a running process (task) associated with it yet. So you
can have a container in containerd without anyone live task(the init process)
running, it is just empty.
Please note that in containerd, one container has only one task (initprocess), one task can have more than one processes!
Commands
Image Pull
Let’s see an example to demonstrate the container vs task in containerd, I have
the ctr alias setup like below for target namespace:
1
alias ctr='sudo ctr -n ns1'
Please note that even the image is also associated with namespace! Also You have
to pull it first, ctr won’t do it auto for you when you create a container:
1
$ ctr image pull docker.io/library/alpine:latest
List the local images:
1 2 3 4
# -q: only show image path ctr i ls -q
docker.io/library/alpine:latest
Inspect Image Internals
Instead of running a container with task and exec into it, one way to check the
image internal (e.g the built-in files) is to mount a local folder to image and
go to check the local folder:
1 2 3 4 5 6 7 8 9
$ mkdir /tmp/agent_rootfs $ ctr i mount docker.io/library/alpine:latest /tmp/agent_rootfs
# now examine the alpine image internals $ ls /tmp/agent_rootfs bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
# unmount $ ctr i unmount /tmp/agent_rootfs
Create Container
Then, create a container named as test and give it a default process /bin/sh,
if you don’t specify a command, it might use the default command from the image’s
configuration (which might just exit immediately if it’s not designed to run
indefinitely).
1
ctr c create docker.io/library/alpine:latest test /bin/sh
Check the container is created:
1 2 3
$ ctr c ls
test docker.io/library/alpine:latest
You can view the arg bin/sh we specified in args field:
1
ctr c info test | grep -A3 args
Sometimes you want to check the container start timestamp:
1
ctr c info test | jq -r '.UpdatedAt'
Please note that no task is running from test container so far:
1 2
# you should see empty result ctr t ls | grep test
Start Container
Now start the container in detached mode, this command starts the initial task
associated with the container test. By default, the init task ID will be the
same as the container ID, the /bin/sh is the init process of the task
1 2 3 4
$ ctr t start -d test
$ ctr t ls | grep test test 125090 RUNNING
Create Children Process
Let’s create 3 processes for task test:
1 2 3
ctr t exec -d --exec-id pro1 test /bin/sh ctr t exec -d --exec-id pro2 test /bin/sh ctr t exec -d --exec-id pro3 test /bin/sh
Now check the processes in container test, please note we still have only one
task ctr t ls | grep test:
1 2 3 4 5 6 7
$ ctr t ps test
PID INFO 125090 - 125191 exec_id:"pro1" 125221 exec_id:"pro2" 125250 exec_id:"pro3"
The PID here like 125191, 125221, 125250 is the host PID of the process,
so you kill them from host, also ps -aux | grep can be used to find the process
in host VM, to kill the process pro1:
1
sudo kill -9 125191
Exec into Container
To exec into the container, you need to specify a process name:
1 2
# foo ps name ctr t exec -t --exec-id foo test sh
Or using a random ps name:
1
ctr t exec -t --exec-id ${RANDOM} test sh
Remove
Finally, kill the task test:
1
ctr t kill -s 9 test
Clean up the task test:
1
ctr t rm -f test
Clean up the container test:
1
ctr c rm test
Containerd Container Lifecycle Management
As we mentioned containerd or ctr does not have a restart on failure config like
docker, one way to restart on failure is to set up systemd service, for example:
1 2 3 4 5 6 7 8 9
$ systemctl status example.service
● example.service - Example: app Loaded: loaded (/etc/systemd/system/example.service; disabled; preset: disabled) Active: active (running) since Sat 2025-06-14 22:46:54 UTC; 1 week 1 day ago Main PID: 2965 (bash) Tasks: 2 (limit: 17980) Memory: 2.0M CPU: 45.258s
There is a good free course with informative visualization for containerd:
link,
I will redirect some images from it in this blog for personal learning purpose.
Core Concepts
Containerd uses namespaces to provide isolation for different sets of
containers and resources.
A task represents a running process within a container. A single container
can have one init task running inside it.
Containerd maintains a local store (usually a SQLite database) to keep track
of the state of all the objects it manages: namespaces, containers, images,
tasks, and snapshots.
Containerd has a plugin-based architecture. This allows for extending its
functionality and integrating with other systems.
Runc to Containerd
The containerd cannot run containers on its own, to put it simply, runc is a
command-line tool that knows how to create, start, stop, and delete containers
given a container configuration and a root filesystem.
Docker (through containerd), Podman, Kubernetes, and other “higher-level”
container runtimes and orchestrators under the hood rely on runc (or an
alternative OCI Runtime implementation) to run containers.
This blog is about gcloud configuration setup and management, recapped from
my daily work, they are useful especially the api_endpoint_overrides so I don’t
need to run lengthy API call counterpart.
List gcloud configuration
You may have multiple gcloud configuration entities:
[ ] I want to build my own screencast extension and linked to my GCS and the link
can be rendered by markdown tag in my blog.
[ ] 中外神话大战, global audience
you can watch the youtube video to learn how to use them:
chatgpt: advanced data analysis
claude artifacts: create web app etc and practice it and it can be published it
to run on web for everyone access
Tip: During the training of the big neural network, you can print the generated
text for every 20 trains and you will see how the model is gradually improved by
predicting the next words.
Then the lecturer moves to the hyperbolic
to try inference from a base model - He uses Llama as example.
Now we have a base model, but it is still just an internet document simulator, it
can generate text sequentially on the initial input, but we actually want
to build a assist system to answer questions.
How Does Neural Network Hold knowledge?
LLMs store learned knowledge in the model’s parameters (weights), not in a
database or memory. During training, the model adjusts its millions (or even billions)
of weights to recognize patterns, relationships, and structures in language.
Neural Network Weights: These weights encode statistical relationships between
words, phrases, and concepts.
Hidden Representations: The model learns abstract representations of language,
enabling it to generate relevant responses based on context.
No Direct Storage of Training Data: The model doesn’t store exact documents or
books but compresses useful patterns and knowledge into its weights.
When you ask a question, the model doesn’t “look up” an answer from storage.
Instead, it generates responses dynamically based on learned patterns. The model
predicts the most probable next words given the input, guided by the patterns
encoded in its weights.
Objective of Pre-Training Stage
The model is trained on a massive dataset (books, articles, code, etc.), learning
to predict missing words, the next word, or even reconstruct corrupted text.
Common pre-training objectives:
Causal Language Modeling (CLM): Predict the next token (used in models like GPT).
Masked Language Modeling (MLM): Predict missing words in a sentence (used in BERT).
Sequence-to-Sequence Learning: For tasks like translation (used in models like T5).
What Does the Model Learn?
Statistical patterns in language: It learns which words/tokens frequently appear
together.
Syntax and grammar: It picks up grammatical structures by learning associations
between words.
Semantics and meaning: The model develops an understanding of concepts through
word embeddings (e.g., “Paris” is related to “France”).
World knowledge: It passively absorbs factual information from its dataset.
Basic reasoning: By recognizing complex relationships, it can perform simple
inference.
Post-Training
We call this stage supervised finetuning, with human curation 人工筛选, to show
model problem and its demonstrated solution, for imitating.
We need to provide the conversation(prompt + answer) dataset and then train the
model, but it takes much less time in comparison with pre-training stage.
Human labelers are employed(can be done by software as well) to create these
conversations, for example, to come up with the prompt and ideal response, the
instruct GPT paper mentions it.
The open source reproduction of the conversation training dataset by human labelers, and the UltraChat
can help with multi-round dialog data, so in fact you actually talk to the
simulation of human labelers not the magic AI.
Tokenziation of the conversations, similar to TCP packet structure, we define
the structure to encode the conversations before feeding them to model, for example
GPT-4 tokenizer, here you will
see special tag like “<|im-start|>” to group the content.
Hallucination
An issue that model does not know and produce fake answer, it can be mitigated:
Use model interrogation to discover model’s knowledge, and programmatically
augment its training dataset with knowledge-based refusals in cases where the
model doesn’t know.
Allow model to search by the search trigger tokens when it doesn’t know
the answer, the search trigger token how it gets used is also trained by dataset.
You can explicitly tell LLM to use/not to use any tool.
Knowledge vs Working Memory
Knowledge in the parameters(weights), it is the vague recollection (e.g. of
something you read 1 month ago).
knowledge in the tokens of the context window: it is the working memory.
For example, for a math problem, you need to train the model to use context to
infer the result, and distribute the reasoning/computation before the final answer.
Don’t immediately give the answer in a short sentence at beginning, this does not
help model training. If you give the answer first in dataset, the model will try
justification for it.
Question:
1 2
I bug 3 apples and 2 oranges. Each orange costs $2, the total cost is $13. What is the cost of apples?
Bad answer:
1 2
The answer is $3. This is because 2 oranges at $2 are $4 total. So 3 apples cost $9, and therefore each apple is 9/3=$3.
Good answer:
1 2
The total cost of the 2 oranges is $4. 13-4=$9, the cost of 3 apples is $9. 9/3=3 so each apple costs $3. The answer is $3.
To be less error-prone, you can ask model to “use code/tool”, rather than computing
mentally.
Becuase of the token nature, the
model is not good with counting/spelling, try “use code/tool” to get the right answer
if possible.
Reinforcement Learning
In this stage, we move SFT(supervised finetuning) model to reinforcement learning,
the last major stage of training, basically what we do is: prompt to practice,
trail & error until you reach the correct answer.
For example, given the problem statement(prompt) and the final answer, we generate
15 solutions, only 4 of them got the right answer, we pick the top solution based
on some criteria and train on it, repeat many times to encourage such tokens to be
created by model.
DeepSeek-R1 published the reinforcement learning which draws attention from public
because RL is kind of secret within AI companies.
For example, the ChatGPT o3-mini, o3-mini-high, o1 are all RL model, the previous
ones are just SFT model.
The RL can go beyond the human expertise.
RL in Un-verifiable Domains
For example, “write a joke about pelicans”, how do we score the answer?
We need a scalable approach, the RLHF(reinforcement learning from human feedback).
The core ideas are:
Take a small portion of result, human order them from best to worst.
Train a neural net simulator of human preferences (“reward model”).
Run RL as usual, but using the simulator instead of humans.
RL Downside
RL discover ways to “game” the lossy simulation of humans.
For example, after 1000 updates, the top jokes about perlicans is not you want,
but something totally non-sensical like “the the the the the”, this kind of input
is not in the simulation model’s training set and it happens to have a high
score.
So you cannot run RL indefinitely in un-verifiable domains.
About Knowledge Distillation
Distillation in LLMs refers to a technique called knowledge distillation, which
is used to train a smaller, more efficient model (called the student model) by
transferring knowledge from a larger, more powerful model (the teacher model).
The goal is to retain most of the teacher model’s performance while reducing
computational costs.
Approaches to Distillation
If You Own the Teacher Model (Full Access)
You can directly use its logits (probability distributions over outputs) or
intermediate layer representations to guide the student model’s training.
You can access its training data and generate additional “soft labels” for
better supervision.
If You Don’t Own the Teacher Model (Black-Box Distillation)
You can use the API of the teacher model (if available) to query it and collect
outputs (e.g., responses or probabilities).
This is often called zero-shot or black-box distillation, where you use
the teacher’s responses to fine-tune a smaller model.
A famous example is training smaller models based on OpenAI’s GPT-4 responses
without having access to GPT-4’s internals.
Limitations of Black-Box Distillation:
You are limited to what the API provides (e.g., if it only gives text responses
and not token probabilities, the student learns less fine-grained knowledge).
It can be expensive if querying a paid API.
Pros and Cons of Distillation
Aspect
Pros (Advantages)
Cons (Disadvantages)
Efficiency & Performance
- Produces a smaller, faster model with similar performance to the larger teacher model. - Reduces computational cost and memory usage.
- The student model usually cannot match the teacher model’s full performance. - Some knowledge is inevitably lost during distillation.
Training Cost
- Requires fewer resources compared to training from scratch. - Leverages the pre-trained teacher model to guide learning.
- Training the student model still requires significant compute, especially if distilling from a very large teacher model. - If using a black-box API, querying the teacher model can be costly.
Data Dependency
- Can work without access to the original training data of the teacher model (if using API-based distillation).
- If training data is available, distillation is more effective, but obtaining high-quality labeled data can be expensive.
Flexibility
- Can be used with various architectures, allowing compression of transformer-based models like GPT, BERT, etc. - Can be applied to different NLP tasks (e.g., text generation, classification).
- Some architectures may not benefit as much from distillation. - Requires careful tuning of hyperparameters to balance knowledge transfer.
Inference Speed
- Leads to much faster inference, making LLMs deployable on edge devices or mobile platforms. - Reduces latency in real-time applications (e.g., chatbots, search engines).
- The trade-off between speed and accuracy needs to be balanced—aggressive compression can degrade quality.
Knowledge Transfer
- Allows a smaller model to capture soft labels and knowledge (such as uncertainty and hidden patterns) from a larger model.
- Some complex reasoning or long-context dependencies from the teacher model may not transfer well.
Accessibility
- If a teacher model is available via API, distillation can be done without full access to the source code or training data.
- Black-box distillation is limited by what the API exposes (e.g., no access to logits or internal activations).
Security & Privacy
- Can be used to create private models without exposing original training data. - Helps in model compression for on-premises deployment.
- If distilling from an API-based teacher model, there is potential for bias transfer or unintentional memorization of sensitive data.
Adaptability
- The student model can be fine-tuned on specific domains (e.g., legal, medical) after distillation.
- If the teacher model updates frequently, the distilled model may become outdated unless re-distilled.
Preview of Things to Come
multimodel(not just text but audio, images, video, natural conversations)