Chapter 3 - Storage and Retrieval

Simplest database

db_set () {
    echo "$1,$2" >> database
}

db_get () {
    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

Try it!

$ db_set 123456 '{"name":"London","attractions":["Big Ben","London Eye"]}'

$ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}'

$ db_get 42
{"name":"San Francisco","attractions":["Golden Gate Bridge"]}

Couldn't be better at write operations, is pretty crap at read (think of this file having lots of entries). Introducing Indexes.

Indexing

Hash map

In our simplest database, records are written on disk (database file). It is essentially a long string:

123456,{"name":"London","attractions":["Big Ben","London Eye"]}\n42,{"name":"San Francisco","attractions":["Golden Gate Bridge"]}

For faster lookups for keys, we can keep track of the position of the keys in this long chain, such as:

key	byte offset
123456	0
42	64

Keeping an index of all keys in memory can be a problem with large amounts of keys.

SSTables and LSM trees

SSTables: Sorted String Tables
- Keeps key-value pairs sorted by key for efficient lookups
LSM: Log-Structured Merge tree
- Same principle, gives its name to databases using this principle

Log structured are based on append-only principle. Never updates a file that has been written.

B-Trees

Most common type of index.

Also keeps sorting by keys
Breaks down the database into fixed-size segments, referenced like so:

B-trees are based on edit-in-place principle. Disks are a set of fixed-size pages which can be overridden.

Transaction vs Analytics models

Transaction (OLTP: Online Transaction Processing)
- The day-to-day operations: placing an order, comment on blog post, ...
Analytics (OLAP: Online Analytical Processing)
- Get insights from data: sales in January, average temperature in March, ...

Comparison:

Property	OLTP	OLAP
Main read pattern	Small number of records per query	Aggregate over large number of records
Main write pattern	Low latency writes from user input	Bulk import or event stream
Used by	End user via web app	Internal analyst for business intelligence
Data represents	Latest state (current)	History of events
Dataset size	GB to TB	TB to PB

OLAP and OLTP where using the same DB to start with but it's been found more appropriate to separate them, so that analytics have no influence (performance wise in particular) to the day-to-day critical operations: introducing the data warehouse for analytics.

Data warehouse solutions

Name	Type
Terradata	Proprietary
Vertica	Proprietary
SAP HANA	Proprietary
ParAccel	Proprietary
Apache Hive	Open Source
Apache Spark SQL	Open Source
Cloudera Impala	Open Source
Facebook Presto	Open Source
Apache Tajo	Open Source
Apache Drill	Open Source

Note that Amazon Redshift is a hosted version of ParAccel.

Unlike OLTP databases, where storage is done per row / document, in OLAP it's done per column.

Ch3 - Storage engines