Goals

Learn about DynamoDB and its use of two-dimensional aggregates for structuring, distributing, and associating data. Discover how global indexes can automatically make data queryable from multiple dimensions. Design and implement applications using partitions and sorting.

Examples:

Exercises:

Introduction

Context

Hash Tables

Wide-column stores typically use a Distributed Hash Table to distribute data across clusters and rings of machines.

Traditional hash tables are data structures used to efficiently map keys to values. Each key is hashed using a hash function, which produces a unique identifier for the input. This identifier is then used to map the key to a specific slot in an array, where the corresponding value is stored.

hash_table.drawio.svg

When a key needs to be looked up, it is passed through the hash function to produce its identifier, which is then used to directly access the corresponding value in the array. This allows for constant-time lookups, regardless of the size of the data set. In contrast to distributed hash tables, traditional hash tables are typically implemented on a single machine and are not distributed across a network.

Two-dimensional Aggregates

So far, we have only examined one-dimensional aggregates, where a single key corresponds to a single value. However, wide-column stores like DynamoDB and Cassandra use two-dimensional aggregates, where a single key is made up of a row key and a column key. Both keys are necessary to look up a value. The term "wide-column" stems from the fact that each column can store a differently-sized aggregate, and as the number of attributes in the aggregate grows, the columns get wider.

column_store_structure.svg

1. Dimension: Row Key

The row key is used to group a list of columns. A group of columns is stored and replicated by multiple physical nodes. Using a Distributed Hash Table, the row key is used to find a node storing the columns.

Sometimes, it helps to think of a row as a folder containing files:

$ tree .
.
├── node1
│   └── projects:1 <- this is the row key
│       ├── meta
│       ├── tasks:1
│       ├── tasks:2
│       ├── users:alice
│       └── users:bob
├── node2
└── node3
    └── projects:2 <- this is also a row key
        ├── meta
        ├── tasks:1
        ├── users:bob
        └── users:eve

6 directories, 9 files

2. Dimension: Column Key

A column is a single aggregate containing attributes with keys and values. The column key is used to select a single column from the row.

A column key can be thought of selecting a single file from a folder, which in turn contains keys and values:

$ cat node1/projects:1/meta 
name:Launch Website
created_at:2021-09-12
done:false

$ cat node1/projects:1/users:alice 
name:Alice
added_at:2021-09-13
tasks: [1]

Query-driven Modeling

With good data modeling, the performance of wide-column stores is excellent and highly scalable. For instance, a website that displays data for project:1 only needs to issue a single request. This request will return columns for project metadata, users, and tasks. This performance comes at the price of flexibility, as there are no joins because they are too slow (debrie20).

Access Patterns

To model data for a wide-column store, we must define all of our access patterns upfront and then generate the data model accordingly. This allows the model to answer specific queries efficiently. Some examples for access patterns are:

DynamoDB

DynamoDB is a NoSQL database service that was launched by Amazon Web Services (AWS) in 2012. It is designed to be highly scalable and to support high-availability applications and is based on the principles of a Distributed Hash Table , using consistent hashing to distribute data across a cluster of machines.