Apache Cassandra

Revisit on 2023 March because of SPLA.

Intro

Cassandra quick started with single docker node. You can extend it to test any Cassandra client, for example gocql, build the app with golang image and bring it up in the same docker network.

Cassandra basics to understand consepts such as ring(cassandra cluster, masterless), horizontal scaling(aka scale-out), partitioning, RF(replication factor), CL(consistency level), quorum(RF/2+1), CAP(cassandar by default AP).

How the quorum is calculated.

Replication Strategy

Keyspace: a namespace that defines data replication on nodes, one keyspace may have mutliple related tables, the Replicatin Strategy is keyspace-wide:

  • simply strategy for single data center
  • network topology strategy for multiple data centers

Data center can have multiple racks.

For example, one data center has 2 racks, rack1 has 2 nodes, rack2 has one node, if the simply strategy is 1, then rack1 owns 50% data, each node in rack1 owns 25%, rack2 owns 50%, since rack2 only contains 1 node, so that node owns 50%.

Tunable Consistency

Coordinator Node: client connect to perform actions. Each connection to Cassandra may have a different coordinator node, any node can be the coordinator.

You can configure consistency on a cluster, datacenter, or per individual read or write operation. see this doc for details.

Consistency level for write:

  • ONE, TWO, THREE
  • QUORUM(majority of nodes succeeds)
  • ALL(must all good)
  • ANY(include coordinator itself).

Hinted Handoff: when one write node is unavaiable, the data is written to coordinator node, the coordinator node will try repeatedly write to the unavailable node until succeeded.

Consistency level for read: how many nodes to consult to return the most current data to caller.

  • SERIAL: see this doc
  • ONE, TWO, THREE
  • QUORUM(majority of nodes succeeds)
  • ALL (must all good)

Read Repair: 当对一个node写失败了but back online later,在read时如果有多个replicas 的数据可以参考,则对那个node可重新写入上次wirte失败的数据. Run nodetool repair periodically will resolve the inconsistencies in cluster.

Achieving strong consistency:

  • write consistency + read consistency > replication factor

Multiple data center consistency level:

  • EACH_QUORUM
  • LOCAL_QUORUM: local means current coordinator node data center
  • LOCAL_ONE: the same as ONE

Partition, Composite and Clustering Key

To correctly run cql, especially the order by, you need to understand how to define primary key and use partition, composite and clustering keys efficiently:

https://www.baeldung.com/cassandra-keys

CQL

A single Cassandra docker node is enough for CQL.

Cassandra Query Language:

With cqlsh script, you can specify remote Cassandra node with port to connect, by default it will connect to localhost 9042 port.

Keyspace -> Tables -> partitions -> row.

In brief, each table requires a unique primary key. The first field listed is the partition key, since its hashed value is used to determine the node to store the data. If those fields are wrapped in parentheses then the partition key is composite. Otherwise the first field is the partition key. Any fields listed after the primary key are called clustering columns. These store data in ascending or descending order within the partition for the fast retrieval of similar values. All the fields together are the primary key.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
## help
cqlsh> help

Documented shell commands:
===========================
CAPTURE CLS COPY DESCRIBE EXPAND LOGIN SERIAL SOURCE UNICODE
CLEAR CONSISTENCY DESC EXIT HELP PAGING SHOW TRACING

## specified
cqlsh> help consistency;

CQL help topics:
================
AGGREGATES CREATE_KEYSPACE DROP_TRIGGER TEXT
ALTER_KEYSPACE CREATE_MATERIALIZED_VIEW DROP_TYPE TIME
ALTER_MATERIALIZED_VIEW CREATE_ROLE DROP_USER TIMESTAMP
ALTER_TABLE CREATE_TABLE FUNCTIONS TRUNCATE
ALTER_TYPE CREATE_TRIGGER GRANT TYPES
ALTER_USER CREATE_TYPE INSERT UPDATE
APPLY CREATE_USER INSERT_JSON USE
ASCII DATE INT UUID
BATCH DELETE JSON
BEGIN DROP_AGGREGATE KEYWORDS
BLOB DROP_COLUMNFAMILY LIST_PERMISSIONS
BOOLEAN DROP_FUNCTION LIST_ROLES
COUNTER DROP_INDEX LIST_USERS
CREATE_AGGREGATE DROP_KEYSPACE PERMISSIONS
CREATE_COLUMNFAMILY DROP_MATERIALIZED_VIEW REVOKE
CREATE_FUNCTION DROP_ROLE SELECT
DROP_TABLE SELECT_JSON

Create a keyspace with:

1
create keyspace pluralsight with replication = {'class':'SimpleStrategy', 'replication_factor':1};

Create a table in this keyspace with:

1
2
use pluralsight;
create table courses (id varchar primary key);

Optionally attempt to create the table again with:

1
create table if not exists courses (id varchar primary key);

(and note that you will not get an error as long as the ‘if not exists’ is present)

Add a few columns to the courses table with:

1
2
3
alter table courses add duration int;
alter table courses add released timestamp;
alter table courses add author varchar;

Add a comment to the table with:

1
alter table courses with comment = 'A table of courses';

View the complete table and all its default properties with:

1
2
-- describe
desc table courses;

Drop and recreate a more complete courses table with:

1
2
3
4
5
6
7
8
9
10
11
drop table courses;

create table courses (
id varchar primary key,
name varchar,
author varchar,
audience int,
duration int,
cc boolean,
released timestamp
) with comment = 'A table of courses';

(Note that when entering the lines as above cqlsh will automatically detect a multi-line CQL statement)

Exit cqlsh:

1
exit

Load course data by running a series of CQL commands from an external file

1
cat courses.cql | cqlsh

Verify that the CQL commands in the file were indeed executed:

1
2
3
use pluralsight;
desc tables;
select * from courses;

(The ‘desc tables’ should show a single ‘courses’ table, and the ‘select’ statement should show 5 rows of sample data.)

The ‘expand’ cqlsh command will display the query results in a ‘one column per line’ format:

1
2
3
4
-- pretty format
expand on;
select * from courses;
expand off;

You can display the time a piece of data was written with the ‘writetime’ function:

1
select id, cc, writetime(cc) from courses where id = 'advanced-javascript';

We can update this cc column with an ‘update’ statement:

1
update courses set cc = true where id = 'advanced-javascript';

Now re-run the select statement containing the ‘writetime’ function and notice that the time has changed. You can prove to yourself that this write time is stored on a per column basis by selecting this for a different column:

1
select id, name, writetime(name) from courses where id = 'advanced-javascript';

Note that this writetime value is the same as the one returned by our first ‘cc’ query.

Cassandra also provides a function for returning the token associated with a partition key:

1
select id, token(id) from courses;

If you try to select from a column other than the primary key, you’ll get an error:

1
select * from courses where author = 'Cory House';

(We’ll show how to do this in a later module.)

Let’s create a users table:

1
2
3
4
5
6
7
create table users (
id varchar primary key,
first_name varchar,
last_name varchar,
email varchar,
password varchar
) with comment = 'A table of users';

Then we’ll insert and “upsert” two rows of data:

1
2
3
insert into users (id, first_name, last_name) values ('john-doe', 'John', 'Doe');
update users set first_name = 'Jane', last_name = 'Doe' where id = 'jane-doe';
select * from users;

(Note that the net effect of the insert and update are the same.)

Now we’ll add a new ‘reset_token’ column to this table, and add a value to this column with a TTL:

1
2
alter table users add reset_token varchar;
update users using ttl 120 set reset_token = 'abc123' where id = 'john-doe';

We can retrieve the time remaining for a ttl with the ‘ttl’ query function:

1
select ttl(reset_token) from users where id = 'john-doe';

We can turn on tracing and do a select to see that there are currently no tombstones:

1
2
tracing on;
select * from users where id = 'john-doe';

(Re-run this several times until the 2 minutes have elasped and the token_value will be gone, and tracing will show a tombstone.)

Turn off tracing:

1
tracing off;

Create a ratings table with two counter columns:

1
2
3
4
5
create table ratings (
course_id varchar primary key,
ratings_count counter,
ratings_total counter
) with comment = 'A table of course ratings';

Now let’s increment both counter columns to represent receiving a new course rating of 4:

1
2
update ratings set ratings_count = ratings_count + 1, ratings_total = ratings_total + 4 where course_id = 'nodejs-big-picture';
select * from ratings;

(The select should show the data we just upserted.)

Now let’s add a second course rating of 3:

1
2
3
update ratings set ratings_count = ratings_count + 1, ratings_total = ratings_total + 3 where course_id = 'nodejs-big-picture';
select * from ratings;
exit

This should show the new values of “2” and “7” for ratings_count and ratings_total respectively.

Drop and re-create “ratings” to use with the “avg” aggregate function

1
2
3
4
5
6
7
8
drop table ratings;

create table ratings (
course_id varchar,
user_id varchar,
rating int,
primary key (course_id, user_id)
);

Insert a few sample ratings

1
2
3
4
insert into ratings (course_id, user_id, rating) values ('cassandra-developers', 'user1', 4);
insert into ratings (course_id, user_id, rating) values ('cassandra-developers', 'user2', 5);
insert into ratings (course_id, user_id, rating) values ('cassandra-developers', 'user3', 4);
insert into ratings (course_id, user_Id, rating) values ('advanced-python', 'user1', 5);

You can select the average for a single course (across users):

1
2
select course_id, avg(rating) from ratings where course_id = 'cassandra-developers';
select course_id, avg(rating) from ratings where course_id = 'advanced-python';

However, you can’t apply aggregate functions across partition keys:

1
select course_id, avg(rating) from ratings;  -- incorrect results

Multi-Row Partition

Composite Key

Previously we only have one primary key in table, that primary is the partition key. But it could be:

1
2
-- composite key
PRIMARY KEY (partition_key, clustering_key, ...)

partition_key can also be composite.

There is no join operation in Cassandra.

Drop this table and create a new one to hold both course and module data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
drop table courses;
create table courses (
id varchar,
name varchar,
author varchar,
audience int,
duration int,
cc boolean,
released timestamp,
module_id int,
module_name varchar,
module_duration int,
primary key (id, module_id)
) with comment = 'A table of courses and modules';

Insert data for the course, plus the first two modules

1
2
3
4
5
insert into courses (id, name, author, audience, duration, cc, released, module_id, module_name, module_duration)
values ('nodejs-big-picture','Node.js: The Big Picture','Paul O''Fallon', 1, 3240, true, '2019-06-03',1,'Course Overview',70);

insert into courses (id, name, author, audience, duration, cc, released, module_id, module_name, module_duration)
values ('nodejs-big-picture','Node.js: The Big Picture','Paul O''Fallon', 1, 3240, true, '2019-06-03',2,'Considering Node.js',900);

Select the data we just inserted

1
2
3
-- get same result
select * from courses;
select * from courses where id = 'nodejs-big-picture';

Now we can include both id and module_id in our where clause

1
select * from courses where id = 'nodejs-big-picture' and module_id = 2;

We can’t select by just module, unless we enable ‘ALLOW FILTERING’

1
2
3
-- if no partition_key, performance downgrade
select * from courses where module_id = 2; // fails
select * from courses where module_id = 2 allow filtering; // succeeds

Now insert the remaining modules for the course

1
2
3
4
5
6
7
8
insert into courses (id, name, author, audience, duration, cc, released, module_id, module_name, module_duration)
values ('nodejs-big-picture','Node.js: The Big Picture','Paul O''Fallon', 1, 3240, true, '2019-06-03', 3, 'Thinking Asynchronously', 1304);

insert into courses (id, name, author, audience, duration, cc, released, module_id, module_name, module_duration)
values ('nodejs-big-picture','Node.js: The Big Picture','Paul O''Fallon', 1, 3240, true, '2019-06-03', 4, 'Defining an Application and Managing Dependencies', 525);

insert into courses (id, name, author, audience, duration, cc, released, module_id, module_name, module_duration)
values ('nodejs-big-picture','Node.js: The Big Picture','Paul O''Fallon', 1, 3240, true, '2019-06-03', 5, 'Assembling a Development Toolset', 489);

We can also use module_id as part of an “in” clause

1
select * from courses where id = 'nodejs-big-picture' and module_id in (2,3,4);

And we can order by module_id

1
select * from courses where id = 'nodejs-big-picture' order by module_id desc;

We can “select distinct” just the id, but not the id and course name:

1
2
select distinct id from courses;         // succeeds
select distinct id, name from courses; // fails

Static Columns

Static Columns are static within the partition. Its the common data in a partition.

From cqlsh, drop and recreate the courses table, using static columns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
use pluralsight;
drop table courses;
create table courses (
id varchar,
name varchar static,
author varchar static,
audience int static,
duration int static,
cc boolean static,
released timestamp static,
module_id int,
module_name varchar,
module_duration int,
primary key (id, module_id)
) with comment = 'A table of courses and modules';

Insert just the course data, and select it back

1
2
3
4
insert into courses (id, name, author, audience, duration, cc, released)
values ('nodejs-big-picture','Node.js: The Big Picture','Paul O''Fallon', 1, 3240, true, '2019-06-03');

select * from courses where id = 'nodejs-big-picture';

Now insert the module data for the first two modules

1
2
3
4
5
insert into courses (id, module_id, module_name, module_duration)
values ('nodejs-big-picture',1,'Course Overview',70);

insert into courses (id, module_id, module_name, module_duration)
values ('nodejs-big-picture',2,'Considering Node.js',900);

Selecting from courses now returns both course and module data in each row

1
2
select * from courses where id = 'nodejs-big-picture';
select * from courses where id = 'nodejs-big-picture' and module_id = 2;

Insert the third module, but also change the name of the course. Select all rows to show the course title changed everywhere.

1
2
3
4
insert into courses (id, name, module_id, module_name, module_duration)
values ('nodejs-big-picture', 'The Big Node.js Picture', 3, 'Thinking Asynchronously', 1304);

select * from courses where id = 'nodejs-big-picture';

Insert the fourth module, and fix the course name

1
2
insert into courses (id, name, module_id, module_name, module_duration)
values ('nodejs-big-picture', 'Node.js: The Big Picture', 4, 'Defining an Application and Managing Dependencies', 525);

Insert the remaining course module

1
2
insert into courses (id, module_id, module_name, module_duration)
values ('nodejs-big-picture', 5, 'Assembling a Development Toolset', 489);

The ‘in’ and ‘order by’ clauses work the same as before

1
2
3
select * from courses where id = 'nodejs-big-picture' and module_id in (2,3,4);

select * from courses where id = 'nodejs-big-picture' order by module_id desc;

Select course info, repeated based on the number of modules in the course

1
select id, name, author, audience, duration, cc, released from courses;

Now “select distinct” course info and only get one row back

1
select distinct id, name, author, audience, duration, cc, released from courses;

Select just the module information for the course

1
select module_id, module_name, module_duration from courses where id = 'nodejs-big-picture';

Load module-level course data by running a series of CQL commands from an external file

1
cat data/courses2.cql | cqlsh

Select module information for the ‘advanced-javascript’ course

1
2
use pluralsight;
select module_id, module_name, module_duration from courses where id = 'advanced-javascript';

Select module information for the ‘docker-fundamentals’ course

1
select module_id, module_name, module_duration from courses where id = 'advanced-python';

Select just the course-level information for all 5 courses

1
select distinct id, name, author from courses;

Time Series Data

Launch our one Cassandra node and (when it’s ready) load our sample course data

1
cat data/courses2.cql | cqlsh

From cqlsh, create a new table to hold course page views

1
2
3
4
5
6
use pluralsight;
create table course_page_views (
course_id varchar,
view_id timeuuid,
primary key (course_id, view_id)
) with clustering order by (view_id desc);

Insert a row into this table, using “now()” to create a timeuuid with the current date/time. Include a one year TTL.

1
2
insert into course_page_views (course_id, view_id)
values ('nodejs-big-picture', now()) using TTL 31536000;

Insert another row into the table with a manually generated v1 UUID (also with a TTL)

1
2
insert into course_page_views (course_id, view_id)
values ('nodejs-big-picture', bb9807aa-fb68-11e9-8f0b-362b9e155667) using TTL 31536000;

Insert two more rows using “now()”

1
2
3
4
5
insert into course_page_views (course_id, view_id)
values ('nodejs-big-picture', now()) using TTL 31536000;

insert into course_page_views (course_id, view_id)
values ('nodejs-big-picture', now()) using TTL 31536000;

Select the rows, and then use dateOf() to extract the date/time portion of the view_id

1
2
select * from course_page_views;
select dateOf(view_id) from course_page_views where course_id = 'nodejs-big-picture';

Reverse the date order of the results

1
select dateOf(view_id) from course_page_views where course_id = 'nodejs-big-picture' order by view_id asc;

Select only those dates based on Timeuuids that span a 2 day range

1
2
3
4
5
select dateOf(view_id) from course_page_views where course_id = 'nodejs-big-picture'
and view_id >= maxTimeuuid('2019-10-30 00:00+0000')
and view_id < minTimeuuid('2019-11-02 00:00+0000');

-- adjust these dates as necessary to match a more current date range

Truncate the table, and add a static column

1
2
truncate course_page_views;
alter table course_page_views add last_view_id timeuuid static;

Now insert three rows, using “now()” for both Timeuuids (with TTLs)

1
2
3
4
5
6
7
8
insert into course_page_views (course_id, last_view_id, view_id)
values ('nodejs-big-picture', now(), now()) using TTL 31536000;

insert into course_page_views (course_id, last_view_id, view_id)
values ('nodejs-big-picture', now(), now()) using TTL 31536000;

insert into course_page_views (course_id, last_view_id, view_id)
values ('nodejs-big-picture', now(), now()) using TTL 31536000;

Selecting all rows shows different view_ids but the same last_view_id for all rows

1
select * from course_page_views;

Use ‘select distinct’ to get just the latest page view for this course

1
select distinct course_id, last_view_id from course_page_views;

For just one course, this can also be accomplished with the view_id and a LIMIT clause

1
select course_id, view_id from course_page_views where course_id = 'nodejs-big-picture' limit 1;

However, a ‘limit’ won’t work across multiple courses. Insert multiple views for another course.

1
2
3
4
5
6
7
8
insert into course_page_views (course_id, last_view_id, view_id)
values ('advanced-javascript', now(), now()) using TTL 31536000;

insert into course_page_views (course_id, last_view_id, view_id)
values ('advanced-javascript', now(), now()) using TTL 31536000;

insert into course_page_views (course_id, last_view_id, view_id)
values ('advanced-javascript', now(), now()) using TTL 31536000;

Select latest view_id from each course, using the limit clause

1
2
select course_id, view_id from course_page_views where course_id = 'nodejs-big-picture' limit 1;
select course_id, view_id from course_page_views where course_id = 'advanced-javascript' limit 1;

Retrieve the latest course page view for all courses with ‘select distinct’ and the static column

1
select distinct course_id, last_view_id from course_page_views;

Select all the individual views for each course, one at a time

1
2
select course_id, view_id from course_page_views where course_id = 'nodejs-big-picture';
select course_id, view_id from course_page_views where course_id = 'advanced-javascript';

课程后面的东西目前用不到,到时候再接着看。

0%