Database Two-Column Table Design

This is a design and good practice about how to manage data in table for Spanner, and you can extend it to other types of DB accordingly.

Table Schema

Basically, the table will have 2 columns:

  • Object: proto or encoded JSON, can be easily backward compatible in case of adding new fields or deprecating old fields.
  • Metadata: proto or encoded JSON, stores the info associated with the paired Object.

For Metadata, it consists of:

  • version_id: uint64, such as 100000000023
  • version_state: enum
    • UNDEFINED: should never be used
    • LATEST: current object
    • ARCHIVED: stale object (must be followed by either LATEST or DELETED)
    • DELETED: deleted object
  • created_by: string, such as bob@corp.com
  • last_update_time: UTC, timestamp or uint64 for seconds

The primary key of the table is the combination of Metadata.verion_id and certain fields from Object.

How it works

The primary idea is that we never physically delete the record from the table within the data retention period.

  1. A new object foo is created by user alice@corp.com, the metadata#1 is
1
2
3
version_id: 100000000001
version_state: LATEST
created_by: alice@corp.com
  1. Some fields of the object foo’s are updated by leo@corp.com, the original foo record metadata#1 will be changed to
1
2
3
version_id: 100000000001
version_state: ARCHIVED
created_by: alice@corp.com

At meanwhile, a new foo object is written into the table with the metadata#2 below:

1
2
3
version_id: 100000000002
version_state: LATEST
created_by: leo@corp.com
  1. Later, user john@corp.com want to delete the foo object, the metadata#2 will be changed to:
1
2
3
version_id: 100000000002
version_state: ARCHIVED
created_by: leo@corp.com

At meanwhile, the foo object with metadata#3 is created:

1
2
3
version_id: 100000000003
version_state: DELETED
created_by: john@corp.com

In this foo’s lifecycle, we totally have 3 records and we know who did what and when.

Please note that each operation(create, update, delete) should be in a transaction as old one(if exists) needs to be marked as ARCHIVE and new one will be created (LATEST or DELETED), otherwise there will be duplicate records from Object perspective although the Metadata.version_id is different.

For DB which has changelogs for history, auditing, debugging, you may not need this design as it compliates the client side implementation, but they may be not that straightforward in real case for quickly checking.

Data Access Layer

You need to create data access layer for common operations like above, the app or service should not access the DB directly via SQL query but using the API provided by data access layer, for example:

  • list object
  • upsert object
    • archive object
  • delete object
    • archive object
  • query object by certain fields

For human, it is still available to run CRUD on DB directly, but should be cautious and with AOD(access on demand) for production.

Base Management Class

A base management class can be created to handle common operations and specific Object class inherits from it, for example:

Data Retention

If physical deletion is required in terms of data retention policy, you can have scheduled job to scan the table in off-peak time in 2 steps:

  1. loop through all LATEST objects, if invalid, logically delete it in a transaction.

  2. loop through all DELETED objects, if invalid, physically delete it and all ARCHIVED ones, no transaction is needed.

0%