2025-06-23 content updated.

My single VM and VM compose setup please see this InfraTree.

Since year 2019 I have been knowing Vagrant from Book <<Ansible: Up and Running>> as it enables easy set up, test for Ansible, Jenkins and many other software on local VM (You have another choice using docker).

Vagrant Box Search provides you pre-built VM images, all-in-one box, etc.

But you can also build your own customized image from root image via Packer.

Introduction

Vagrant is a tool for building and managing virtual machine environments in a single workflow. You can find everything from Vagrant site.

It leverages a declarative configuration file which describes all your software requirements, packages, operating system configuration, users, and more.

Vagrant also integrates with your existing configuration management tooling like Ansible, Chef, Docker, Puppet or Salt, so you can use the same scripts to configure Vagrant for production.

For comparing Vagrant with Other Software.

Install

Installation is easy, Vagrant and VirtualBox, other providers are possible, for example, Docker, VMware.

Project

The Vagrantfile is meant to be committed to version control with your project, if you use version control. This way, every person working with that project can benefit from Vagrant without any upfront work.

The syntax of Vagrantfiles is Ruby, but knowledge of the Ruby programming language is not necessary to make modifications to the Vagrantfile, since it is mostly simple variable assignment.

1
2
3
4
5
6
7
8
# init project
mkdir vagrant_getting_started
cd vagrant_getting_started

# --minimal: generate minimal Vagrantfile
# hashicorp/bionic64: the box name
# if you have Vagrantfile, no need this
vagrant init hashicorp/bionic64 [--minimal]

Of course you can create a Vagrantfile manually:

1
2
3
# download box image
# you don't need to explicitly do this, vagrant will handle download from Vagrantfile
vagrant box add hashicorp/bionic64

In the above command, you will notice that boxes are namespaced. Boxes are broken down into two parts - the username and the box name - separated by a slash. In the example above, the username is “hashicorp”, and the box is “bionic64”.

Vagrantfile Example

Editing Vagrantfile, this is a simple example to bring up a jenkins cluster:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# -*- mode: ruby -*-
# vi: set ft=ruby :

# image info for all machines
IMAGE_NAME = "generic/centos7"
IMAGE_VERSION = "3.0.10"

# server static ip
SERVER_IP = "192.168.3.2"
# agent static ip, start from 192.168.3.1x
AGENT_IP = "192.168.3.1"

Vagrant.configure("2") do |config|
# box for virtual machines
config.vm.box = IMAGE_NAME
config.vm.box_version = IMAGE_VERSION

# virtualbox configuration for virtual machines
config.vm.provider "virtualbox" do |v|
v.memory = 512
v.cpus = 1
end

# synced folder
config.vm.synced_folder ".", "/vagrant", owner: "root", group: "root"

# Jenkins server, set as primary
config.vm.define "server", primary: true do |server|
server.vm.hostname = "jenkins-server"
# private network
# jenkins uses port 8080 in browser
server.vm.network "private_network", ip: SERVER_IP
# provisioning
server.vm.provision :shell, path: "./provision/server.sh", privileged: true
end

# agents setup
(1..2).each do |i|
config.vm.define "agent#{i}" do |agent|
agent.vm.hostname = "jenkins-agent#{i}"
# private network
agent.vm.network "private_network", ip: "#{AGENT_IP}#{i}"
# provisioning
agent.vm.provision :shell, path: "./provision/agent.sh", privileged: true
end
end
end

Network Config

By using private_network, there is no need to do port forwarding, you are able to access the service directly from the ip on your host, if you go to check the virtualbox configuration, it uses Host-only adapter.

Please follow the proivate network range, you can try chrome incognito mode or firefox, or using curl/wget.

For SSH access, the VMs must be in the same private network:

1
2
# assign private IP
config.vm.network "private_network", ip: "192.168.2.2"

This configuration uses a private network. The VM can be accessed only from another VM that runs Vagrant and within the same range. You won’t be able to connect to this IP address from another physical host, even if it’s on the same network as the VM. However, different Vagrant VMs can connect to each other.

SSH agent forwarding, for example, git clone in VM using the private key of host:

1
config.ssh.forward_agent = true

Bring up the environment:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# validate syntax
vagrant validate
# check machine status
vagrant status
# at your project root directory
vagrant up
# ssh to primary machine in your project
vagrant ssh [vm name]

# when you finish working
# destroy machine
# -f: confirm
vagrant destroy -f
# remove the box
vagrant box remove

Note that by default vagrant ssh login as user vagrant, not root user, you can use sudo to execute command or run sudo su - first.

SSH to VM

This is to demystify how SSH to VM from our host works.

1
2
3
4
5
6
7
8
9
10
11
12
13
# vagrant must be running
vagrant ssh-config

Host default
HostName 127.0.0.1
User vagrant
Port 2222
UserKnownHostsFile /dev/null
StrictHostKeyChecking no
PasswordAuthentication no
IdentityFile <absolute path>/.vagrant/machines/<vm name>/virtualbox/private_key
IdentitiesOnly yes
LogLevel FATAL

Vagrant sets up host-to-guest port forwarding on a high random port on localhost (e.g., 127.0.0.1:2222 → guest:22).You can also change the default ssh port mapping, see this blog, for example:

1
2
3
# id: "ssh"
# map host port 13001 to guest port 22
config.vm.network :forwarded_port, guest: 22, host: 13001, id: "ssh"

The underlying SSH command can be seen by ps aux | grep ssh:

1
2
3
4
5
6
7
8
9
10
# -q: quiet mode
ssh vagrant@127.0.0.1 \
-p 2222 \
-o LogLevel=FATAL \
-o Compression=yes \
-o DSAAuthentication=yes \
-o IdentitiesOnly=yes \
-o StrictHostKeyChecking=no \
-o UserKnownHostsFile=/dev/null \
-i <absolute path>/.vagrant/machines/<vm name>/virtualbox/private_key

Vagrant generates new SSH key pair per VM, if you disable insert_key in Vagrantfile:

1
config.ssh.insert_key = false

Then Vagrant uses insecure default key, if you check verbose ssh command under the vagrant ssh, you will see the location of that insecure key:

1
2
ps aux | grep ssh
#-i <user home>/.vagrant.d/insecure_private_key

Synced Folder

By using synced folders, Vagrant will automatically sync your files to and from the guest machine. By default, Vagrant shares your project directory (remember, that is the one with the Vagrantfile) to the /vagrant directory in your guest machine.

Synced folder will be mapped before provisioning would run.

Vagrant also supports rsync, primarily in situations where other synced folder mechanisms are not available: https://www.vagrantup.com/docs/synced-folders/rsync.html

Vagrant has built-in support for automated provisioning. Using this feature, Vagrant will automatically install software when you vagrant up:

1
2
3
4
5
6
7
8
# use Ansible provisioner
VAGRANTFILE_API_VERSION = "2"
Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
config.vm.box = "ubuntu/trusty64"
config.vm.provision "ansible" do |ansible|
ansible.playbook = "playbook.yml"
end
end

Vagrant will not run it a second time unless you force it.

1
2
3
4
5
6
# reboot machine and reload provision setting if machine is already running
# if no provision needed, just run vagrant reload
# --provision: force run provision again
vagrant reload --provision
# force rerun provision when machine is running
vagrant provision

Methods to teardown:

1
2
3
4
5
6
7
8
9
10
# hibernate, save states in disk
vagrant suspend
vagrant resume

# normal power off
vagrant halt
# reclaim all resources
vagrant destroy
# boot again
vagrant up

Multi-Machine

This is helpful for cluster setup: https://www.vagrantup.com/docs/multi-machine

Convert ova to box

Convert a Virtualbox ova to a Vagrant box https://gist.github.com/chengdol/315d3cbb83cf224c3b34913095b7fff9

The question is, how can I know the mysql server is a primary, replica or dual?

Is It a Replica

1
SHOW REPLICA STATUS\G

What to look for in the output:

  • Replica_IO_Running: Yes
  • Replica_SQL_Running: Yes

If both of these are Yes, the server is running as a replica and is actively trying to pull and apply changes from a primary. If either is No or Connecting, it’s configured as a replica but replication might be stopped or encountering issues.

  • Source_Host (or Master_Host in older versions): This field will show the hostname or IP address of the server it’s replicating from. If this is populated, it’s a strong indicator that this server is a replica. If it’s empty, it’s not configured as a replica.

If SHOW REPLICA STATUS returns an empty set, it means the server is NOT configured as a replica.

Is It a Master

The master must have binlog enabled if it is used as a source for replica:

1
2
3
mysql> show variables where variable_name in ('log_bin', 'binlog_format');

SHOW MASTER STATUS\G

If it has replica and the replication is running:

1
2
3
SHOW REPLICAS\G

SHOW PROCESSLIST\G

Dual

Have both properties above, for example, the MySQL server who is used as a cascading replica.

At its core, a lock is a mechanism that allows a database system (like MySQL) to regulate concurrent access to data. When a transaction or operation needs to read or modify data, it acquires a lock on that data. This lock prevents other transactions from accessing or modifying the same data in a way that would lead to inconsistency or data corruption.

Key Principles of Locking

  • Concurrency Control: Locks ensure that multiple users or processes can access the database simultaneously without interfering with each other’s operations.

  • Data Consistency: They prevent phenomena like dirty reads (reading uncommitted data), non-repeatable reads (reading the same data twice and getting different results because another transaction committed a change in between), and phantom reads (seeing new rows appear or disappear in a range query due to concurrent insertions/deletions).

  • Data Integrity: They protect data from being corrupted by simultaneous updates that could overwrite each other’s changes.

  • Transaction Isolation: Locks are fundamental to implementing ACID properties, specifically “Isolation,” ensuring that concurrent transactions appear to execute serially.

Types of Locks (Simplified for understanding)

MySQL (especially with InnoDB, its primary storage engine) uses a sophisticated locking mechanism. The most common types are:

Shared Locks (S-locks / Read Locks)

  • Acquired when a transaction wants to read data.
  • Multiple transactions can hold shared locks on the same data simultaneously.
  • A shared lock prevents an exclusive lock from being acquired on the same data.
  • Analogy: Multiple people can read the same book at the same time (e.g., in a library reading room), but no one can modify it.

Exclusive Locks (X-locks / Write Locks)

  • Acquired when a transaction wants to modify (insert, update, delete) data.
  • Only one transaction can hold an exclusive lock on particular data at any given time.
  • An exclusive lock prevents any other shared or exclusive locks from being acquired on the same data.
  • Analogy: Only one person can borrow a book and make annotations or changes to it at a time.

Lock Granularity

Locks can be applied at different levels of granularity:

  • Row-level locks: The most common and desirable for high concurrency. Only the specific rows being accessed are locked. InnoDB primarily uses row-level locking.

  • Page-level locks: Less granular than row-level, but more granular than table-level.

  • Table-level locks: Locks the entire table, preventing any other operations (reads or writes) on that table. MyISAM primarily uses table-level locking. InnoDB also uses table-level locks for DDL operations (e.g., ALTER TABLE).

  • Metadata Locks (MDL): Protects database objects (tables, functions, etc.) from concurrent DDL and DML operations that would conflict. For example, an ALTER TABLE cannot proceed if there are active queries on the table, and vice-versa.

Lock Commands

It’s crucial to understand that MySQL’s primary storage engine, InnoDB (the default), handles most locking automatically at the row level within transactions.

You rarely explicitly “lock a row” or “lock a page” yourself in application code. The LOCK TABLES statement is an explicit table-level lock that bypasses InnoDB’s finer-grained control and is generally discouraged for high-concurrency applications using InnoDB.

Instance Wide Lock

For replication, we use instance lock even before dump/load as we want to get the binlog file and position at that point.

1
2
3
4
5
6
# Wait for active transactions to complete or flush caches
mysql> FLUSH TABLES WITH READ LOCK;

# This lock is released automatically when the session that issued the command
disconnects, so if you quit the session it is the same as UNLOCK TABLES;
mysql> UNLOCK TABLES;

I tested that after running FTWRL, from a new terminal login and the insert operation hangs(blocked) until the lock is released.

Table Lock

Read Lock

1
LOCK TABLES products READ;

What it does: Allows the session holding the lock to read from products. Allows other sessions to read from products without explicitly acquiring a lock. Prevents any session (including the one holding the lock) from writing to products.

Behavior for session holding the lock:

  • Can execute SELECT statements on products.
  • Cannot execute INSERT, UPDATE, DELETE on products.

Behavior for other sessions:

  • Can execute SELECT statements on products.
  • INSERT, UPDATE, DELETE statements on products will be blocked (wait) until the lock is released.

Write Lock

1
LOCK TABLES products WRITE;

What it does: Grants exclusive access to the table for the session holding the lock. The session holding the lock can both read and write. All other sessions are blocked from both reading and writing to products.

Behavior for session holding the lock:

  • Can execute SELECT, INSERT, UPDATE, DELETE statements on products.

Behavior for other sessions:

  • All SELECT, INSERT, UPDATE, DELETE statements on products will be blocked (wait) until the lock is released.

Release Lock

1
UNLOCK TABLES;

Check Status

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
mysql> \s
--------------
mysql Ver 14.14 Distrib 5.7.33, for Linux (x86_64) using EditLine wrapper

Connection id: 513
Current database:
Current user: root@localhost
SSL: Not in use
Current pager: stdout
Using outfile: ''
Using delimiter: ;
Server version: 8.0.16 MySQL Community Server - GPL
Protocol version: 10
Connection: Localhost via UNIX socket
Server characterset: utf8mb4
Db characterset: utf8mb4
Client characterset: utf8
Conn. characterset: utf8
UNIX socket: /tmp/mysql.sock
Uptime: 4 days 21 hours 55 min 37 sec

Threads: 3 Questions: 388 Slow queries: 0 Opens: 368 Flush tables: 6 Open tables: 23 Queries per second avg: 0.000
--------------
  • Threads: This indicates the total number of currently connected client threads (connections) to the MySQL server
  • Questions: This is a cumulative counter representing the total number of queries (statements) that the server has executed since it was last started
  • Slow queries: This is a cumulative counter for the number of queries that have taken longer than the long_query_time system variable setting to execute. By default, long_query_time is usually 10 seconds.
  • Opens: This is a cumulative counter for the number of files (tables, logs, etc.) that MySQL has opened.
  • Flush tables: This is a cumulative counter for the number of times a FLUSH TABLES command (or similar flush operation) has been executed. FLUSH TABLES forces MySQL to close all open tables and reload them from disk. This is often done after making changes to table structures, privileges, or for backup purposes.
  • Open tables: This represents the actual number of tables that are currently open in the MySQL table cache. These tables are kept open to avoid the overhead of opening and closing them repeatedly, improving performance.
  • Query per second: It’s calculated as Questions / Uptime (where Uptime is the server’s running time in seconds).

Check Service ID

1
2
3
4
5
6
7
8
mysql> SHOW VARIABLES LIKE 'server_id';

+---------------+-----------+
| Variable_name | Value |
+---------------+-----------+
| server_id | 633025519 |
+---------------+-----------+
1 row in set (0.00 sec)

Each server (source and all replicas) in a replication chain must have a unique server_id. When a source server writes changes to its binary log, it includes its server_id with each event.

Replicas use this ID to:

  • Prevent loops: A replica will skip applying events that originated from a server with its own server_id to avoid infinite loops in multi-master or circular replication topologies.
  • Identify the source: In more complex replication setups, the server_id helps identify which server generated a particular set of changes

Check Users

1
2
3
4
5
6
7
8
9
10
11
12
13
mysql> SELECT User, Host FROM mysql.user;

+------------------+-----------+
| User | Host |
+------------------+-----------+
| root | % |
| myuser | % |
| mysql.infoschema | localhost |
| mysql.session | localhost |
| mysql.sys | localhost |
| root | localhost |
+------------------+-----------+
6 rows in set (0.00 sec)

Or you can use

1
2
SELECT USER(); 
SELECT CURRENT_USER();

These are the most common and direct ways to see the current user.

  • USER() returns the username and host that the client attempted to authenticate with.
  • CURRENT_USER() returns the username and host that the MySQL server actually authenticated the client connection with. This is the user account that determines your privileges.

They can sometimes be different (e.g., if you tried to connect with a non-existent user, MySQL might connect you as an anonymous user).

Check Active User States

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
mysql> SELECT Id, User, Host, db, Command, Time, State, Info \
FROM information_schema.processlist\G

*************************** 1. row ***************************
Id: 8433
User: root
Host: localhost
db: NULL
Command: Query
Time: 0
State: executing
Info: SELECT Id, User, Host, db, Command, Time, State, Info FROM information_schema.processlist
*************************** 2. row ***************************
Id: 4
User: event_scheduler
Host: localhost
db: NULL
Command: Daemon
Time: 2859469
State: Waiting on empty queue
Info: NULL
2 rows in set (0.01 sec)

Show DBs/Schemas

1
2
3
4
5
6
7
8
9
10
11
12
13
mysql> show databases;
mysql> show schemas;

+--------------------+
| Database |
+--------------------+
| animals |
| information_schema |
| mysql |
| performance_schema |
| sys |
+--------------------+
5 rows in set (0.00 sec)

Drop Schema

1
DROP SCHEMA animals;

List Tables

1
2
use <db>;
show tables;

Describe Table

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
mysql> describe animals.animals;

+---------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+--------------+------+-----+---------+----------------+
| id | mediumint(9) | NO | PRI | NULL | auto_increment |
| name | varchar(255) | YES | | NULL | |
| species | varchar(255) | YES | | NULL | |
| cute | tinyint(1) | YES | | NULL | |
+---------+--------------+------+-----+---------+----------------+
4 rows in set (0.01 sec)

mysql> SHOW CREATE TABLE animals.animals\G

*************************** 1. row ***************************
Table: animals
Create Table: CREATE TABLE `animals` (
`id` mediumint(9) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`species` varchar(255) DEFAULT NULL,
`cute` tinyint(1) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

Drop Table

Please check the table dependency first

1
DROP TABLE animals.animals;

Create User

For example, create a user for replication purposes, logging in with root(or at least with create user privilege)

1
2
3
4
5
CREATE USER 'myuser'@'%' IDENTIFIED BY 'YourSecurePasswordHere';

-- with IP range
CREATE USER 'myuser'@'192.168.1.0/24' IDENTIFIED BY 'YourSecurePasswordHere';
CREATE USER 'myuser'@'192.168.1.%' IDENTIFIED BY 'YourSecurePasswordHere';

Grant Privileges

Grant privileges to user:

1
2
mysql> GRANT EVENT ON animals.* TO 'myuser'@'%';
mysql> FLUSH PRIVILEGES;

Show Privileges

1
2
3
4
5
6
7
8
mysql>  SHOW GRANTS FOR 'myuser'@'%';

+------------------------------------------------------------------------------------------------------------------+
| Grants for myuser@% |
+------------------------------------------------------------------------------------------------------------------+
| GRANT SELECT, RELOAD, EXECUTE, REPLICATION SLAVE, REPLICATION CLIENT, SHOW VIEW, TRIGGER ON *.* TO `myuser`@`%` |
+------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Revoke Privileges

1
2
mysql> REVOKE EVENT ON `animals`.* FROM `myuser`@`%`;
mysql> FLUSH PRIVILEGES;

Check Replication

This is important if you launch the replication and how to check the progress or status.

On Source Server

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
SHOW MASTER STATUS\G

-- Only binlog
*************************** 1. row ***************************
File: binlog.000041
Position: 556
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set:
1 row in set (0.00 sec)


-- when GTID is ON
*************************** 1. row ***************************
File: mysql-bin.000451
Position: 194
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set: c390f45d-d4b5-11e9-99a5-42010af00084:1-96
1 row in set (0.00 sec)


SHOW VARIABLES LIKE 'server_id';

+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id | 1 |
+---------------+-------+


SHOW PROCESSLIST\G

*************************** 2. row ***************************
Id: 2620
User: myuser
Host: 35.194.40.52:59948
db: NULL
Command: Binlog Dump
Time: 615115
State: Source has sent all binlog to replica; waiting for more updates
Info: NULL


-- when GTID is ON
*************************** 1. row ***************************
Id: 11437
User: myuser
Host: 34.121.192.197:56772
db: NULL
Command: Binlog Dump GTID
Time: 800846
State: Master has sent all binlog to slave; waiting for more updates
Info: NULL

On Replica

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
SHOW SLAVE STATUS\G
mysql> SHOW REPLICA STATUS\G

*************************** 1. row ***************************
Replica_IO_State: Waiting for source to send event
Source_Host: 35.204.135.141
Source_User: speckle
Source_Port: 3306
Connect_Retry: 60
Source_Log_File: mysql-bin.000009
Read_Source_Log_Pos: 194
Relay_Log_File: relay-log.000021
Relay_Log_Pos: 410
Relay_Source_Log_File: mysql-bin.000009
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table: mysql.%
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Source_Log_Pos: 194
Relay_Log_Space: 701
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Source_SSL_Allowed: No
Source_SSL_CA_File:
Source_SSL_CA_Path:
Source_SSL_Cert:
Source_SSL_Cipher:
Source_SSL_Key:
Seconds_Behind_Source: 0
Source_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Source_Server_Id: 1
Source_UUID: c390f45d-d4b5-11e9-99a5-42010af00084
Source_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Replica_SQL_Running_State: Replica has read all relay log; waiting for more updates
Source_Retry_Count: 86400
Source_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Source_SSL_Crl:
Source_SSL_Crlpath:
Retrieved_Gtid_Set: c390f45d-d4b5-11e9-99a5-42010af00084:2
Executed_Gtid_Set: 7b5e76d5-4971-11f0-8015-42010a400007:1-26,
c390f45d-d4b5-11e9-99a5-42010af00084:1-2
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Source_TLS_Version:
Source_public_key_path:
Get_Source_public_key: 0
Network_Namespace:
1 row in set (0.00 sec)

Replica_IO_Running (I/O Thread / Receiver Thread)::

This thread is responsible for connecting to the source server and reading the binary log events. Think of it as the “data fetcher.” It streams the changes (SQL statements, row changes, etc.) from the source’s binary logs.

Once received, the I/O thread writes these events to a local file on the replica called the relay log. The relay log acts as a temporary cache of the binary log events from the source.

If a replica is offline for a period, the I/O thread can quickly retrieve all the accumulated binary log events from the source once it reconnects, even if the SQL thread takes longer to process them.

Replica_SQL_Running (SQL Thread / Applier Thread):

This thread is responsible for reading the events from the relay log (written by the I/O thread) and executing them on the replica’s database. Think of it as the “data applier.”

Its goal is to apply the changes to the replica’s data as quickly and efficiently as possible to keep the replica synchronized with the source.

Check Binlog

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
-- log_bin_basename is the binlog file path
mysql> SHOW VARIABLES LIKE 'log_bin%';
+---------------------------------+-----------------------------+
| Variable_name | Value |
+---------------------------------+-----------------------------+
| log_bin | ON |
| log_bin_basename | /var/lib/mysql/binlog |
| log_bin_index | /var/lib/mysql/binlog.index |
| log_bin_trust_function_creators | OFF |
| log_bin_use_v1_row_events | OFF |
+---------------------------------+-----------------------------+
5 rows in set (0.00 sec)

mysql> show variables where variable_name in \
('log_bin', 'binlog_format');
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| binlog_format | ROW |
| log_bin | ON |
+---------------+-------+
2 rows in set (0.00 sec)


-- if you enabled binlog, the retention is set automatically
-- check binlog retention binlog_expire_logs_seconds is for latest version
mysql> show variables where variable_name in \
('binlog_expire_logs_seconds', 'expire_logs_days');
+----------------------------+---------+
| Variable_name | Value |
+----------------------------+---------+
| binlog_expire_logs_seconds | 2592000 |
| expire_logs_days | 0 |
+----------------------------+---------+
2 rows in set (0.00 sec)

mysql> show binary logs;
+---------------+-----------+-----------+
| Log_name | File_size | Encrypted |
+---------------+-----------+-----------+
| binlog.000017 | 201 | No |
| binlog.000018 | 201 | No |
| binlog.000019 | 201 | No |
| binlog.000020 | 201 | No


-- check the location of bin log file and the file name pattern:
mysql> show variables like '%log_bin%';
+---------------------------------+---------------------------------------+
| Variable_name | Value |
+---------------------------------+---------------------------------------+
| log_bin | ON |
| log_bin_basename | /usr/local/mysql/data/mysql-bin |
| log_bin_index | /usr/local/mysql/data/mysql-bin.index |
| log_bin_trust_function_creators | OFF |
| log_bin_use_v1_row_events | OFF |
| sql_log_bin | ON |
+---------------------------------+---------------------------------------+|

Inspect Binlog Event

You can do inspection by mysqlbinlog cli, but we can also do it from SQL, reference from CSQL PITR public document.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
SHOW BINARY LOGS;
+---------------+-----------+-----------+
| Log_name | File_size | Encrypted |
+---------------+-----------+-----------+
| binlog.000031 | 809 | No |
| binlog.000032 | 201 | No |
| binlog.000033 | 2948 | No |
| binlog.000034 | 519 | No |
| binlog.000035 | 201 | No |
| binlog.000036 | 201 | No |


SHOW BINLOG EVENTS IN 'binlog.000033' limit 10;
+---------------+-----+----------------+-----------+-------------+----------------------------------------------------+
| Log_name | Pos | Event_type | Server_id | End_log_pos | Info |
+---------------+-----+----------------+-----------+-------------+----------------------------------------------------+
| binlog.000033 | 4 | Format_desc | 1 | 126 | Server ver: 8.0.42-0ubuntu0.20.04.1, Binlog ver: 4 |
| binlog.000033 | 126 | Previous_gtids | 1 | 157 | |
| binlog.000033 | 157 | Anonymous_Gtid | 1 | 236 | SET @@SESSION.GTID_NEXT= 'ANONYMOUS' |
| binlog.000033 | 236 | Query | 1 | 307 | BEGIN |
| binlog.000033 | 307 | Table_map | 1 | 505 | table_id: 523 (mysql.user) |
| binlog.000033 | 505 | Delete_rows | 1 | 693 | table_id: 523 flags: STMT_END_F |
| binlog.000033 | 693 | Xid | 1 | 724 | COMMIT /* xid=1459 */ |
| binlog.000033 | 724 | Anonymous_Gtid | 1 | 801 | SET @@SESSION.GTID_NEXT= 'ANONYMOUS' |
| binlog.000033 | 801 | Query | 1 | 912 | drop schema animals /* xid=1621 */ |
| binlog.000033 | 912 | Anonymous_Gtid | 1 | 989 | SET @@SESSION.GTID_NEXT= 'ANONYMOUS' |
+---------------+-----+----------------+-----------+-------------+----------------------------------------------------+
10 rows in set (0.00 sec)


SHOW BINLOG EVENTS IN 'binlog.000033' from 157 limit 3;
+---------------+-----+----------------+-----------+-------------+--------------------------------------+
| Log_name | Pos | Event_type | Server_id | End_log_pos | Info |
+---------------+-----+----------------+-----------+-------------+--------------------------------------+
| binlog.000033 | 157 | Anonymous_Gtid | 1 | 236 | SET @@SESSION.GTID_NEXT= 'ANONYMOUS' |
| binlog.000033 | 236 | Query | 1 | 307 | BEGIN |
| binlog.000033 | 307 | Table_map | 1 | 505 | table_id: 523 (mysql.user) |
+---------------+-----+----------------+-----------+-------------+--------------------------------------+
3 rows in set (0.00 sec)

Check GTID is ON

1
2
3
4
5
6
7
8
9
10
mysql> show variables where variable_name in \
('gtid_mode','enforce_gtid_consistency');

+--------------------------+-------+
| Variable_name | Value |
+--------------------------+-------+
| enforce_gtid_consistency | ON |
| gtid_mode | ON |
+--------------------------+-------+
2 rows in set (0.01 sec)

Check MyISAM vs InnoDB

One server can have both MyISAM and InnoDB for different tables.

1
2
3
4
SELECT count(table_name) as table_count FROM information_schema.tables tab
WHERE engine != 'InnoDB'
AND table_type = 'BASE TABLE'
AND table_schema not in ('information_schema','sys','performance_schema','mysql');

This quick note is mainly about CLI ctr, please note that ctr tool is made for debugging containerd. It doesn’t support all the features you may be used to from docker such as port publishing, automatic container restart on failure, or browsing container logs.

Recommended containerd course: link.

Containerd Daemon

Assume you have containerd installed:

1
2
3
4
5
6
7
8
9
10
11
sudo systemctl status containerd

● containerd.service - containerd container runtime
Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; preset: disabled)
Active: active (running) since Sat 2025-06-14 22:45:52 UTC; 1 week 0 days ago
Docs: https://containerd.io
Main PID: 461 (containerd)
Tasks: 256
Memory: 1.5G
CPU: 4h 43min 43.197s
CGroup: /system.slice/containerd.service

Namespace

You can have both docker and containerd containers running in the same server, docker is also using containerd as the underlying runtime, if you have ctr available you would be able to examine the relationship:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ ctr ns ls
NAME LABELS
ns1
ns2

$ ctr -n ns2 c ls
CONTAINER IMAGE RUNTIME
742bd6fc9cd8efdac4dbdf1bd302e4c9ecd1d259224596746939ef5ae0167d47 - io.containerd.runc.v2
ebe97a5c7a289273ce333dae921b7e4f8931cd5edc27749c8b6d445867935a12 - io.containerd.runc.v2

$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ebe97a5c7a28 us-docker.pkg.dev/demo/images/example:latest "/mysql-scripts/start" 7 days ago Up 7 days mysqld
742bd6fc9cd8 us-docker.pkg.dev/demo/images/example:latest "/proxy_server -logt…" 7 days ago Up 7 days proxy_server

Container and Task

The separation of containers and tasks in ctr (and in containerd’s architecture) might seem a bit different from Docker’s more unified view of a running container.

A container in containerd is primarily an isolated metadata and configuration entity. It doesn’t have a running process (task) associated with it yet. So you can have a container in containerd without anyone live task(the init process) running, it is just empty.

Please note that in containerd, one container has only one task (init process), one task can have more than one processes!

Commands

Image Pull

Let’s see an example to demonstrate the container vs task in containerd, I have the ctr alias setup like below for target namespace:

1
alias ctr='sudo ctr -n ns1'

Please note that even the image is also associated with namespace! Also You have to pull it first, ctr won’t do it auto for you when you create a container:

1
$ ctr image pull docker.io/library/alpine:latest

List the local images:

1
2
3
4
# -q: only show image path
ctr i ls -q

docker.io/library/alpine:latest

Inspect Image Internals

Instead of running a container with task and exec into it, one way to check the image internal (e.g the built-in files) is to mount a local folder to image and go to check the local folder:

1
2
3
4
5
6
7
8
9
$ mkdir /tmp/agent_rootfs
$ ctr i mount docker.io/library/alpine:latest /tmp/agent_rootfs

# now examine the alpine image internals
$ ls /tmp/agent_rootfs
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var

# unmount
$ ctr i unmount /tmp/agent_rootfs

Create Container

Then, create a container named as test and give it a default process /bin/sh, if you don’t specify a command, it might use the default command from the image’s configuration (which might just exit immediately if it’s not designed to run indefinitely).

1
ctr c create docker.io/library/alpine:latest test /bin/sh

Check the container is created:

1
2
3
$ ctr c ls

test docker.io/library/alpine:latest

You can view the arg bin/sh we specified in args field:

1
ctr c info test | grep -A3 args

Sometimes you want to check the container start timestamp:

1
ctr c info test | jq -r '.UpdatedAt'

Please note that no task is running from test container so far:

1
2
# you should see empty result
ctr t ls | grep test

Start Container

Now start the container in detached mode, this command starts the initial task associated with the container test. By default, the init task ID will be the same as the container ID, the /bin/sh is the init process of the task

1
2
3
4
$ ctr t start -d test

$ ctr t ls | grep test
test 125090 RUNNING

Create Children Process

Let’s create 3 processes for task test:

1
2
3
ctr t exec -d  --exec-id pro1 test /bin/sh
ctr t exec -d --exec-id pro2 test /bin/sh
ctr t exec -d --exec-id pro3 test /bin/sh

Now check the processes in container test, please note we still have only one task ctr t ls | grep test:

1
2
3
4
5
6
7
$ ctr t ps test

PID INFO
125090 -
125191 exec_id:"pro1"
125221 exec_id:"pro2"
125250 exec_id:"pro3"

The PID here like 125191, 125221, 125250 is the host PID of the process, so you kill them from host, also ps -aux | grep can be used to find the process in host VM, to kill the process pro1:

1
sudo kill -9 125191

Exec into Container

To exec into the container, you need to specify a process name:

1
2
# foo ps name
ctr t exec -t --exec-id foo test sh

Or using a random ps name:

1
ctr t exec -t --exec-id ${RANDOM} test sh

Remove

Finally, kill the task test:

1
ctr t kill -s 9 test

Clean up the task test:

1
ctr t rm -f test

Clean up the container test:

1
ctr c rm test

Containerd Container Lifecycle Management

As we mentioned containerd or ctr does not have a restart on failure config like docker, one way to restart on failure is to set up systemd service, for example:

1
2
3
4
5
6
7
8
9
$ systemctl status example.service

● example.service - Example: app
Loaded: loaded (/etc/systemd/system/example.service; disabled; preset: disabled)
Active: active (running) since Sat 2025-06-14 22:46:54 UTC; 1 week 1 day ago
Main PID: 2965 (bash)
Tasks: 2 (limit: 17980)
Memory: 2.0M
CPU: 45.258s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ cat /etc/systemd/system/example.service

[Unit]
Description=Example: app
Wants=example-setup-vmparams-noncritical.service
After=example-setup-vmparams-noncritical.service
[Service]
User=example
Restart=always
RestartSec=5
StartLimitBurst=10000
ExecStart=/bin/bash /var/lib/example/noncritical/bin/manage.sh start
ExecStop=-/bin/bash /var/lib/example/noncritical/bin/manage.sh stop
[Install]
WantedBy=example-non-critical.target

There is a good free course with informative visualization for containerd: link, I will redirect some images from it in this blog for personal learning purpose.

Core Concepts

  • Containerd uses namespaces to provide isolation for different sets of containers and resources.

  • A task represents a running process within a container. A single container can have one init task running inside it.

  • Containerd maintains a local store (usually a SQLite database) to keep track of the state of all the objects it manages: namespaces, containers, images, tasks, and snapshots.

  • Containerd has a plugin-based architecture. This allows for extending its functionality and integrating with other systems.

Runc to Containerd

The containerd cannot run containers on its own, to put it simply, runc is a command-line tool that knows how to create, start, stop, and delete containers given a container configuration and a root filesystem.

Docker (through containerd), Podman, Kubernetes, and other “higher-level” container runtimes and orchestrators under the hood rely on runc (or an alternative OCI Runtime implementation) to run containers.

CNI to Containerd

CNI is network plugin used for containerd.

This blog is about gcloud configuration setup and management, recapped from my daily work, they are useful especially the api_endpoint_overrides so I don’t need to run lengthy API call counterpart.

List gcloud configuration

You may have multiple gcloud configuration entities:

1
2
3
4
5
6
$ gcloud config configurations list

NAME IS_ACTIVE ACCOUNT PROJECT COMPUTE_DEFAULT_ZONE COMPUTE_DEFAULT_REGION
env-139 False chengdol@example.com chengdol-demo
env-149 True chengdol@example.com chengdol-demo
default False chengdol@example.com chengdol-demo

Activate gcloud configuration

To activate target:

1
gcloud config configurations activate env-149

Describe activated gcloud configuration

Check activate gcloud config:

1
2
3
4
5
6
7
8
9
10
11
12
$ gcloud config list

[api_endpoint_overrides]
sql = https://env-149.sandbox.googleapis.com/
[billing]
quota_project = other-project
[core]
account = chengdol@example.com
disable_usage_reporting = True
project = chengdol-demo

Your active configuration is: [env-149]

Or using:

1
2
3
4
5
6
7
8
9
10
11
$ gcloud config configurations describe env-149
is_active: true
name: env-149
properties:
api_endpoint_overrides:
sql: https://env-149.sandbox.googleapis.com/
billing:
quota_project: other-project
core:
account: chengdol@example.com
project: chengdol-demo
  • api_endpoint_overrides: it enables you to run gcloud sql against the overidden endpoint (e.g development backend).

  • quota_project: for quota control, you can update it by

1
gcloud config set billing/quota_project other-project

Delete gcloud configuration

1
gcloud config configurations delete env-149

  1. Master In-House Tools
  • In-house ChatGPT can help this learning journey.
  • Deepen expertise in the company’s in-house tools to accelerate system design.
  • Leverage this knowledge to make well-informed tool selection proposals.
  1. Strengthen System Design Skills
  • Focus on system design principles rather than being constrained by programming languages.
  • Utilize AI tools to speed up coding tasks and free up time for higher-level design work.
  1. Be More Assertive/Demanding in Work and Projects
  • Evaluate tasks critically: Does this align with your level? Does it make sense? Push back when necessary:
    • Clarify scope(overall boundaries) before committing.
      • Objectives – What is the goal of the task or project?
      • Deliverables – What exactly needs to be done?
      • Constraints – Are there limitations in terms of time, resources, or technology?
      • Dependencies – Does this work rely on other teams or systems?
      • Priority – Is this work aligned with business or team priorities (e.g., P0 S1 OKR)?
    • Avoid accommodating requests unless they are OKR-aligned.
    • Involve manager when necessary to prioritize requests.
  • Take ownership of L5-level design work
  • When team work, lead significant portions.
  1. Advocate for Your Work, Beat Your Drum and Manage Up
  • Learn to sell your work and communicate its impact effectively.
  • Build strong support for performance reviews by ensuring influential people recognize your contributions.
  • Improve status updates to your manager using a structured approach:
    • Red, Yellow, Green status indicators.
    • Clearly state risks, timelines, and actions.
  1. Step Out of Your Comfort Zone
  • Continuously acquire new skills.
  • Stay updated on industry trends to remain competitive.
  1. Think About Career Growth
  • Define your next career steps and proactively work toward them.
  1. Be More Aware of Team and Organizational Context
  • Pay attention to the projects and priorities of others in the organization.
  1. Apply the ‘5 Whys’ Method
  • When receiving a request, ask “Why?” five times to uncover the root need before acting.
  1. Be Proactive in Working with PMs/TPMs
  • Don’t wait for responses, schedule meetings or reach out directly to ensure alignment.
  • PMs/TPMs usually are not responsive.
  1. Factor in Lead Time for Partner Teams
  • Plan ahead and engage partner teams early.
    • For example, for 2025 S1 planning, reach out at least 2 weeks in advance to finalize planning/priorization.

[ ] I want to build my own screencast extension and linked to my GCS and the link can be rendered by markdown tag in my blog. [ ] 中外神话大战, global audience

you can watch the youtube video to learn how to use them:

  • chatgpt: advanced data analysis

  • claude artifacts: create web app etc and practice it and it can be published it to run on web for everyone access

  • cursor: composer, manage your local codebase, using claude remotely

    • pricing: 20$/month
    • can be used to wirte code in context
    • explain code for you, for example, from open source
    • compare with github copilot, price, feature, etc
  • true voice mode in ChatGPT, not the vioce -> text tokens

  • grok voice mode is interesting, many roles can play

  • google notebooklm: read and extract

    • generate podcast
  • image input

    • show in text to confirm the model see image correctly
    • In ChatGPT, you can paste the screenshot directly to it
  • image output:

  • Video input:

    • maybe the model is still processing stream of images from the video
  • Video output:

  • ChatGPT memory, you need to instruct it:

    • please remember it to our conversation context.
    • ChatGPT will store it in its memory bank about you, a separate DB.
  • You can personalize the ChatGPT globally in settings.

  • Create your own custom ChatGPT with instructions.

    • so LLM has the instructions as context to handle your questions
      • give LLM examples question/answer format, etc
      • for example: break korean into words, so easy to make flashcards later

Course Resources

Pre-Training and Base Model

The pre-training stage is to give model world knowledge, this stage usually is the most time consuming, months to train.

Data Preprocessing

The ChatGPT vendors all have something similar to the FineWeb to collect data from internet.

Pay attention to the FineWeb recipe pipeline for data preprocessing.

Tokenization

For how ChatGPT tokenizes the input text, try this online tiktokenizer for visualization, this is very important, and it is the fundamental step.

Neural Network Training

The LLM neural network transformer visualization tool.

The reproduction of ChatGPT 2.0 practice.

Tip: During the training of the big neural network, you can print the generated text for every 20 trains and you will see how the model is gradually improved by predicting the next words.

Then the lecturer moves to the hyperbolic to try inference from a base model - He uses Llama as example.

Now we have a base model, but it is still just an internet document simulator, it can generate text sequentially on the initial input, but we actually want to build a assist system to answer questions.

How Does Neural Network Hold knowledge?

LLMs store learned knowledge in the model’s parameters (weights), not in a database or memory. During training, the model adjusts its millions (or even billions) of weights to recognize patterns, relationships, and structures in language.

  • Neural Network Weights: These weights encode statistical relationships between words, phrases, and concepts.
  • Hidden Representations: The model learns abstract representations of language, enabling it to generate relevant responses based on context.
  • No Direct Storage of Training Data: The model doesn’t store exact documents or books but compresses useful patterns and knowledge into its weights.

When you ask a question, the model doesn’t “look up” an answer from storage. Instead, it generates responses dynamically based on learned patterns. The model predicts the most probable next words given the input, guided by the patterns encoded in its weights.

Objective of Pre-Training Stage

The model is trained on a massive dataset (books, articles, code, etc.), learning to predict missing words, the next word, or even reconstruct corrupted text.

Common pre-training objectives:

  • Causal Language Modeling (CLM): Predict the next token (used in models like GPT).
  • Masked Language Modeling (MLM): Predict missing words in a sentence (used in BERT).
  • Sequence-to-Sequence Learning: For tasks like translation (used in models like T5).

What Does the Model Learn?

  • Statistical patterns in language: It learns which words/tokens frequently appear together.
  • Syntax and grammar: It picks up grammatical structures by learning associations between words.
  • Semantics and meaning: The model develops an understanding of concepts through word embeddings (e.g., “Paris” is related to “France”).
  • World knowledge: It passively absorbs factual information from its dataset.
  • Basic reasoning: By recognizing complex relationships, it can perform simple inference.

Post-Training

We call this stage supervised finetuning, with human curation 人工筛选, to show model problem and its demonstrated solution, for imitating.

We need to provide the conversation(prompt + answer) dataset and then train the model, but it takes much less time in comparison with pre-training stage.

  • Human labelers are employed(can be done by software as well) to create these conversations, for example, to come up with the prompt and ideal response, the instruct GPT paper mentions it.

  • The open source reproduction of the conversation training dataset by human labelers, and the UltraChat can help with multi-round dialog data, so in fact you actually talk to the simulation of human labelers not the magic AI.

  • Tokenziation of the conversations, similar to TCP packet structure, we define the structure to encode the conversations before feeding them to model, for example GPT-4 tokenizer, here you will see special tag like “<|im-start|>” to group the content.

Hallucination

An issue that model does not know and produce fake answer, it can be mitigated:

  • Use model interrogation to discover model’s knowledge, and programmatically augment its training dataset with knowledge-based refusals in cases where the model doesn’t know.

  • Allow model to search by the search trigger tokens when it doesn’t know the answer, the search trigger token how it gets used is also trained by dataset.

    • You can explicitly tell LLM to use/not to use any tool.

Knowledge vs Working Memory

  • Knowledge in the parameters(weights), it is the vague recollection (e.g. of something you read 1 month ago).

  • knowledge in the tokens of the context window: it is the working memory.

Knowledge of self

This is still achieved by dataset to train model to have self identification, for example: https://huggingface.co/datasets/allenai/olmo-2-hard-coded. the model knows nothing about itself without this dataset.

Model needs tokens to think

For example, for a math problem, you need to train the model to use context to infer the result, and distribute the reasoning/computation before the final answer.

Don’t immediately give the answer in a short sentence at beginning, this does not help model training. If you give the answer first in dataset, the model will try justification for it.

Question:

1
2
I bug 3 apples and 2 oranges. Each orange costs $2, the total cost is $13.
What is the cost of apples?

Bad answer:

1
2
The answer is $3. This is because 2 oranges at $2 are $4 total. So 3 apples cost
$9, and therefore each apple is 9/3=$3.

Good answer:

1
2
The total cost of the 2 oranges is $4. 13-4=$9, the cost of 3 apples is $9. 9/3=3
so each apple costs $3. The answer is $3.

To be less error-prone, you can ask model to “use code/tool”, rather than computing mentally.

Becuase of the token nature, the model is not good with counting/spelling, try “use code/tool” to get the right answer if possible.

Reinforcement Learning

In this stage, we move SFT(supervised finetuning) model to reinforcement learning, the last major stage of training, basically what we do is: prompt to practice, trail & error until you reach the correct answer.

For example, given the problem statement(prompt) and the final answer, we generate 15 solutions, only 4 of them got the right answer, we pick the top solution based on some criteria and train on it, repeat many times to encourage such tokens to be created by model.

DeepSeek-R1 published the reinforcement learning which draws attention from public because RL is kind of secret within AI companies.

For example, the ChatGPT o3-mini, o3-mini-high, o1 are all RL model, the previous ones are just SFT model.

The RL can go beyond the human expertise.

RL in Un-verifiable Domains

For example, “write a joke about pelicans”, how do we score the answer?

We need a scalable approach, the RLHF(reinforcement learning from human feedback). The core ideas are:

  1. Take a small portion of result, human order them from best to worst.
  2. Train a neural net simulator of human preferences (“reward model”).
  3. Run RL as usual, but using the simulator instead of humans.

RL Downside

RL discover ways to “game” the lossy simulation of humans.

For example, after 1000 updates, the top jokes about perlicans is not you want, but something totally non-sensical like “the the the the the”, this kind of input is not in the simulation model’s training set and it happens to have a high score.

So you cannot run RL indefinitely in un-verifiable domains.

About Knowledge Distillation

Distillation in LLMs refers to a technique called knowledge distillation, which is used to train a smaller, more efficient model (called the student model) by transferring knowledge from a larger, more powerful model (the teacher model). The goal is to retain most of the teacher model’s performance while reducing computational costs.

Approaches to Distillation

If You Own the Teacher Model (Full Access)

  • You can directly use its logits (probability distributions over outputs) or intermediate layer representations to guide the student model’s training.
  • You can access its training data and generate additional “soft labels” for better supervision.

If You Don’t Own the Teacher Model (Black-Box Distillation)

  • You can use the API of the teacher model (if available) to query it and collect outputs (e.g., responses or probabilities).
  • This is often called zero-shot or black-box distillation, where you use the teacher’s responses to fine-tune a smaller model.
  • A famous example is training smaller models based on OpenAI’s GPT-4 responses without having access to GPT-4’s internals.

Limitations of Black-Box Distillation:

  • You are limited to what the API provides (e.g., if it only gives text responses and not token probabilities, the student learns less fine-grained knowledge).
  • It can be expensive if querying a paid API.

Pros and Cons of Distillation

Aspect Pros (Advantages) Cons (Disadvantages)
Efficiency & Performance - Produces a smaller, faster model with similar performance to the larger teacher model.
- Reduces computational cost and memory usage.
- The student model usually cannot match the teacher model’s full performance.
- Some knowledge is inevitably lost during distillation.
Training Cost - Requires fewer resources compared to training from scratch.
- Leverages the pre-trained teacher model to guide learning.
- Training the student model still requires significant compute, especially if distilling from a very large teacher model.
- If using a black-box API, querying the teacher model can be costly.
Data Dependency - Can work without access to the original training data of the teacher model (if using API-based distillation). - If training data is available, distillation is more effective, but obtaining high-quality labeled data can be expensive.
Flexibility - Can be used with various architectures, allowing compression of transformer-based models like GPT, BERT, etc.
- Can be applied to different NLP tasks (e.g., text generation, classification).
- Some architectures may not benefit as much from distillation.
- Requires careful tuning of hyperparameters to balance knowledge transfer.
Inference Speed - Leads to much faster inference, making LLMs deployable on edge devices or mobile platforms.
- Reduces latency in real-time applications (e.g., chatbots, search engines).
- The trade-off between speed and accuracy needs to be balanced—aggressive compression can degrade quality.
Knowledge Transfer - Allows a smaller model to capture soft labels and knowledge (such as uncertainty and hidden patterns) from a larger model. - Some complex reasoning or long-context dependencies from the teacher model may not transfer well.
Accessibility - If a teacher model is available via API, distillation can be done without full access to the source code or training data. - Black-box distillation is limited by what the API exposes (e.g., no access to logits or internal activations).
Security & Privacy - Can be used to create private models without exposing original training data.
- Helps in model compression for on-premises deployment.
- If distilling from an API-based teacher model, there is potential for bias transfer or unintentional memorization of sensitive data.
Adaptability - The student model can be fine-tuned on specific domains (e.g., legal, medical) after distillation. - If the teacher model updates frequently, the distilled model may become outdated unless re-distilled.

Preview of Things to Come

  • multimodel(not just text but audio, images, video, natural conversations)
  • tasks -> agents (long, coherent, error-correcting contexts)
  • pervasive, invisible
  • computer-using
  • test-time training, etc

Where to Keep Track of Them

  1. https://lmarena.ai/?leaderboard, but don’t be too serious about the ranking.
  2. https://buttondown.com/ainews
  3. X / Twitter
  4. Run model locally: https://lmstudio.ai/
0%