Revisit on 2023 March because of SPLA.
Intro
Cassandra quick started with
single docker node. You can extend it to test any Cassandra client, for example
gocql
, build the app with golang image and bring it up in the same docker
network.
Cassandra basics to understand consepts such as ring(cassandra cluster, masterless), horizontal scaling(aka scale-out), partitioning, RF(replication factor), CL(consistency level), quorum(RF/2+1), CAP(cassandar by default AP).
How the quorum is calculated.
Replication Strategy
Keyspace
: a namespace that defines data replication on nodes, one keyspace may
have mutliple related tables, the Replicatin Strategy
is keyspace-wide:
- simply strategy for single data center
- network topology strategy for multiple data centers
Data center can have multiple racks.
For example, one data center has 2 racks, rack1 has 2 nodes, rack2 has one node, if the simply strategy is 1, then rack1 owns 50% data, each node in rack1 owns 25%, rack2 owns 50%, since rack2 only contains 1 node, so that node owns 50%.
Tunable Consistency
Coordinator Node
: client connect to perform actions. Each connection to
Cassandra may have a different coordinator node, any node can be the coordinator.
You can configure consistency on a cluster, datacenter, or per individual read or write operation. see this doc for details.
Consistency level for write:
ONE
,TWO
,THREE
QUORUM
(majority of nodes succeeds)ALL
(must all good)ANY
(include coordinator itself).
Hinted Handoff
: when one write node is unavaiable, the data is written to
coordinator node, the coordinator node will try repeatedly write to the
unavailable node until succeeded.
Consistency level for read: how many nodes to consult to return the most current data to caller.
SERIAL
: see this docONE
,TWO
,THREE
QUORUM
(majority of nodes succeeds)ALL
(must all good)
Read Repair
: 当对一个node写失败了but back online later,在read时如果有多个replicas
的数据可以参考,则对那个node可重新写入上次wirte失败的数据. Run nodetool repair
periodically will resolve the inconsistencies in cluster.
Achieving strong consistency:
- write consistency + read consistency > replication factor
Multiple data center consistency level:
EACH_QUORUM
LOCAL_QUORUM
: local means current coordinator node data centerLOCAL_ONE
: the same asONE
Partition, Composite and Clustering Key
To correctly run cql, especially the order by
, you need to understand how to
define primary key and use partition, composite and clustering
keys efficiently:
https://www.baeldung.com/cassandra-keys
CQL
A single Cassandra docker node is enough for CQL.
Cassandra Query Language:
- https://cassandra.apache.org/doc/latest/cql/index.html
- Keyspace, table and basic data type
- CRUD operations
- Counters
- Aggregate functions
With cqlsh
script, you can specify remote Cassandra node with port to connect,
by default it will connect to localhost 9042 port.
Keyspace -> Tables -> partitions -> row.
In brief, each table requires a unique primary key
. The first field listed is
the partition key
, since its hashed value is used to determine the node to
store the data. If those fields are wrapped in parentheses then the partition
key is composite. Otherwise the first field is the partition key. Any fields
listed after the primary key are called clustering columns
. These store data
in ascending or descending order within the partition for the fast retrieval of
similar values. All the fields together are the primary key.
1 | ## help |
Create a keyspace with:
1 | create keyspace pluralsight with replication = {'class':'SimpleStrategy', 'replication_factor':1}; |
Create a table in this keyspace with:
1 | use pluralsight; |
Optionally attempt to create the table again with:
1 | create table if not exists courses (id varchar primary key); |
(and note that you will not get an error as long as the ‘if not exists’ is present)
Add a few columns to the courses table with:
1 | alter table courses add duration int; |
Add a comment to the table with:
1 | alter table courses with comment = 'A table of courses'; |
View the complete table and all its default properties with:
1 | -- describe |
Drop and recreate a more complete courses table with:
1 | drop table courses; |
(Note that when entering the lines as above cqlsh will automatically detect a multi-line CQL statement)
Exit cqlsh:
1 | exit |
Load course data by running a series of CQL commands from an external file
1 | cat courses.cql | cqlsh |
Verify that the CQL commands in the file were indeed executed:
1 | use pluralsight; |
(The ‘desc tables’ should show a single ‘courses’ table, and the ‘select’ statement should show 5 rows of sample data.)
The ‘expand’ cqlsh command will display the query results in a ‘one column per line’ format:
1 | -- pretty format |
You can display the time a piece of data was written with the ‘writetime’ function:
1 | select id, cc, writetime(cc) from courses where id = 'advanced-javascript'; |
We can update this cc column with an ‘update’ statement:
1 | update courses set cc = true where id = 'advanced-javascript'; |
Now re-run the select statement containing the ‘writetime’ function and notice that the time has changed. You can prove to yourself that this write time is stored on a per column basis by selecting this for a different column:
1 | select id, name, writetime(name) from courses where id = 'advanced-javascript'; |
Note that this writetime value is the same as the one returned by our first ‘cc’ query.
Cassandra also provides a function for returning the token associated with a partition key:
1 | select id, token(id) from courses; |
If you try to select from a column other than the primary key, you’ll get an error:
1 | select * from courses where author = 'Cory House'; |
(We’ll show how to do this in a later module.)
Let’s create a users table:
1 | create table users ( |
Then we’ll insert and “upsert” two rows of data:
1 | insert into users (id, first_name, last_name) values ('john-doe', 'John', 'Doe'); |
(Note that the net effect of the insert and update are the same.)
Now we’ll add a new ‘reset_token’ column to this table, and add a value to this column with a TTL:
1 | alter table users add reset_token varchar; |
We can retrieve the time remaining for a ttl with the ‘ttl’ query function:
1 | select ttl(reset_token) from users where id = 'john-doe'; |
We can turn on tracing and do a select to see that there are currently no tombstones:
1 | tracing on; |
(Re-run this several times until the 2 minutes have elasped and the token_value will be gone, and tracing will show a tombstone.)
Turn off tracing:
1 | tracing off; |
Create a ratings table with two counter columns:
1 | create table ratings ( |
Now let’s increment both counter columns to represent receiving a new course rating of 4:
1 | update ratings set ratings_count = ratings_count + 1, ratings_total = ratings_total + 4 where course_id = 'nodejs-big-picture'; |
(The select should show the data we just upserted.)
Now let’s add a second course rating of 3:
1 | update ratings set ratings_count = ratings_count + 1, ratings_total = ratings_total + 3 where course_id = 'nodejs-big-picture'; |
This should show the new values of “2” and “7” for ratings_count and ratings_total respectively.
Drop and re-create “ratings” to use with the “avg” aggregate function
1 | drop table ratings; |
Insert a few sample ratings
1 | insert into ratings (course_id, user_id, rating) values ('cassandra-developers', 'user1', 4); |
You can select the average for a single course (across users):
1 | select course_id, avg(rating) from ratings where course_id = 'cassandra-developers'; |
However, you can’t apply aggregate functions across partition keys:
1 | select course_id, avg(rating) from ratings; -- incorrect results |
Multi-Row Partition
Composite Key
Previously we only have one primary key in table, that primary is the partition key. But it could be:
1 | -- composite key |
partition_key
can also be composite.
There is no join
operation in Cassandra.
Drop this table and create a new one to hold both course and module data
1 | drop table courses; |
Insert data for the course, plus the first two modules
1 | insert into courses (id, name, author, audience, duration, cc, released, module_id, module_name, module_duration) |
Select the data we just inserted
1 | -- get same result |
Now we can include both id and module_id in our where clause
1 | select * from courses where id = 'nodejs-big-picture' and module_id = 2; |
We can’t select by just module, unless we enable ‘ALLOW FILTERING’
1 | -- if no partition_key, performance downgrade |
Now insert the remaining modules for the course
1 | insert into courses (id, name, author, audience, duration, cc, released, module_id, module_name, module_duration) |
We can also use module_id as part of an “in” clause
1 | select * from courses where id = 'nodejs-big-picture' and module_id in (2,3,4); |
And we can order by module_id
1 | select * from courses where id = 'nodejs-big-picture' order by module_id desc; |
We can “select distinct” just the id, but not the id and course name:
1 | select distinct id from courses; // succeeds |
Static Columns
Static Columns are static within the partition. Its the common data in a partition.
From cqlsh, drop and recreate the courses table, using static columns
1 | use pluralsight; |
Insert just the course data, and select it back
1 | insert into courses (id, name, author, audience, duration, cc, released) |
Now insert the module data for the first two modules
1 | insert into courses (id, module_id, module_name, module_duration) |
Selecting from courses now returns both course and module data in each row
1 | select * from courses where id = 'nodejs-big-picture'; |
Insert the third module, but also change the name of the course. Select all rows to show the course title changed everywhere.
1 | insert into courses (id, name, module_id, module_name, module_duration) |
Insert the fourth module, and fix the course name
1 | insert into courses (id, name, module_id, module_name, module_duration) |
Insert the remaining course module
1 | insert into courses (id, module_id, module_name, module_duration) |
The ‘in’ and ‘order by’ clauses work the same as before
1 | select * from courses where id = 'nodejs-big-picture' and module_id in (2,3,4); |
Select course info, repeated based on the number of modules in the course
1 | select id, name, author, audience, duration, cc, released from courses; |
Now “select distinct” course info and only get one row back
1 | select distinct id, name, author, audience, duration, cc, released from courses; |
Select just the module information for the course
1 | select module_id, module_name, module_duration from courses where id = 'nodejs-big-picture'; |
Load module-level course data by running a series of CQL commands from an external file
1 | cat data/courses2.cql | cqlsh |
Select module information for the ‘advanced-javascript’ course
1 | use pluralsight; |
Select module information for the ‘docker-fundamentals’ course
1 | select module_id, module_name, module_duration from courses where id = 'advanced-python'; |
Select just the course-level information for all 5 courses
1 | select distinct id, name, author from courses; |
Time Series Data
Launch our one Cassandra node and (when it’s ready) load our sample course data
1 | cat data/courses2.cql | cqlsh |
From cqlsh, create a new table to hold course page views
1 | use pluralsight; |
Insert a row into this table, using “now()” to create a timeuuid with the current date/time. Include a one year TTL.
1 | insert into course_page_views (course_id, view_id) |
Insert another row into the table with a manually generated v1 UUID (also with a TTL)
1 | insert into course_page_views (course_id, view_id) |
Insert two more rows using “now()”
1 | insert into course_page_views (course_id, view_id) |
Select the rows, and then use dateOf() to extract the date/time portion of the view_id
1 | select * from course_page_views; |
Reverse the date order of the results
1 | select dateOf(view_id) from course_page_views where course_id = 'nodejs-big-picture' order by view_id asc; |
Select only those dates based on Timeuuids that span a 2 day range
1 | select dateOf(view_id) from course_page_views where course_id = 'nodejs-big-picture' |
Truncate the table, and add a static column
1 | truncate course_page_views; |
Now insert three rows, using “now()” for both Timeuuids (with TTLs)
1 | insert into course_page_views (course_id, last_view_id, view_id) |
Selecting all rows shows different view_ids but the same last_view_id for all rows
1 | select * from course_page_views; |
Use ‘select distinct’ to get just the latest page view for this course
1 | select distinct course_id, last_view_id from course_page_views; |
For just one course, this can also be accomplished with the view_id and a LIMIT clause
1 | select course_id, view_id from course_page_views where course_id = 'nodejs-big-picture' limit 1; |
However, a ‘limit’ won’t work across multiple courses. Insert multiple views for another course.
1 | insert into course_page_views (course_id, last_view_id, view_id) |
Select latest view_id from each course, using the limit clause
1 | select course_id, view_id from course_page_views where course_id = 'nodejs-big-picture' limit 1; |
Retrieve the latest course page view for all courses with ‘select distinct’ and the static column
1 | select distinct course_id, last_view_id from course_page_views; |
Select all the individual views for each course, one at a time
1 | select course_id, view_id from course_page_views where course_id = 'nodejs-big-picture'; |
课程后面的东西目前用不到,到时候再接着看。