Chapter 3 - Storage and Retrieval

Simplest database

db_set () {
    echo "$1,$2" >> database
}

db_get () {
    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

Try it!

$ db_set 123456 '{"name":"London","attractions":["Big Ben","London Eye"]}'

$ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}'

$ db_get 42
{"name":"San Francisco","attractions":["Golden Gate Bridge"]}

Couldn't be better at write operations, is pretty crap at read (think of this file having lots of entries). Introducing Indexes.

Indexing

Hash map

In our simplest database, records are written on disk (database file). It is essentially a long string:

123456,{"name":"London","attractions":["Big Ben","London Eye"]}\n42,{"name":"San Francisco","attractions":["Golden Gate Bridge"]}

For faster lookups for keys, we can keep track of the position of the keys in this long chain, such as:

key byte offset
123456 0
42 64

Keeping an index of all keys in memory can be a problem with large amounts of keys.

SSTables and LSM trees

  • SSTables: Sorted String Tables
    • Keeps key-value pairs sorted by key for efficient lookups
  • LSM: Log-Structured Merge tree
    • Same principle, gives its name to databases using this principle

Log structured are based on append-only principle. Never updates a file that has been written.

B-Trees

Most common type of index.

  • Also keeps sorting by keys
  • Breaks down the database into fixed-size segments, referenced like so:

B-trees are based on edit-in-place principle. Disks are a set of fixed-size pages which can be overridden.

Transaction vs Analytics models

Comparison:

Property OLTP OLAP
Main read pattern Small number of records per query Aggregate over large number of records
Main write pattern Low latency writes from user input Bulk import or event stream
Used by End user via web app Internal analyst for business intelligence
Data represents Latest state (current) History of events
Dataset size GB to TB TB to PB

OLAP and OLTP where using the same DB to start with but it's been found more appropriate to separate them, so that analytics have no influence (performance wise in particular) to the day-to-day critical operations: introducing the data warehouse for analytics.

ETL

Data warehouse solutions

Name Type
Terradata Proprietary
Vertica Proprietary
SAP HANA Proprietary
ParAccel Proprietary
Apache Hive Open Source
Apache Spark SQL Open Source
Cloudera Impala Open Source
Facebook Presto Open Source
Apache Tajo Open Source
Apache Drill Open Source

Note that Amazon Redshift is a hosted version of ParAccel.

Unlike OLTP databases, where storage is done per row / document, in OLAP it's done per column.

results matching ""

    No results matching ""