Chapter 4 - Encoding

Formats for encoding data

Typical encoding formats:

* XML
* JSON
* CSV

Limitations:

XML
- Cannot distinguish numbers from strings made of digits, unless relying on external schema
JSON
- Cannot distinguish ints from floats
CSV
- Cannot distinguish numbers from strings made of digits
- No schema

Example of record:

{
    "userName": "Martin",
    "favoriteNumber": 1337,
    "interests": ["daydreaming", "hacking"]
}

JSON encoding here is schemaless, so the field names have to be set (userName, favoriteNumber, ...)

Encoded size: 81 bytes.

Encoding with MessagePack

MessagePack can binary encode the record above.

Result is:

83 a8 75 73 65 72 4e 61 6d 65 a6 4d 61 72 74 69 6e ae 66 61 76 6f 72 69 74 65 4e 75 6d 62 65 72 cd 05 39 a9 69 6e 74 65 72 65 73 74 73 92 ab 64 61 79 64 72 65 61 6d 69 6e 67 a7 68 61 63 6b 69 6e 67

Encoded size: 66 bytes.

Encoding with Thrift and Protocol Buffers

Apache Thrift and Google Protocol Buffers also binary encode, but based on a schema.

Example with Thrift (IDL):

struct Person {
  1: required string       userName,
  2: optional i64          favoriteNumber,
  3: optional list<string> interests
}

Encoded size:

* Thrift (BinaryProtocol): 59 bytes
* Thrift (CompactProtocol): 34 bytes
* Protocol buffers: 33 bytes

How to update schema then?

* Add new fields: OK, just add another tag number. Can't be required though for backwards compatibility, or needs to have a default value.
* Remove fields: OK, same. Just you need to remove _optional_ fields for the same backwards compatibility reasons. And you cannot use the same tag number again.
* Updating field type: Should be ok, see case by case. Eventually replace by a new field.

Encoding with Apache Avro

Apache Avro is another binary encoding format, also using a schema.

Schema in Avro IDL (intended for human-readers):

record Person {
    string               userName;
    union { null, long } favoriteNumber = null;
    array<string>        interests;
}

And in JSON (intended for machine-readers):

{
    "type": "record",
    "name": "Person",
    "fields": [
        {"name": "userName",       "type": "string"},
        {"name": "favoriteNumber", "type": ["null", "long"], "default": null},
        {"name": "interests",      "type": {"type": "array", "items": "string"}}
    ]
}

Notice as compared to Thrift and Protocol buffers, there's no tag number to identify fields. This means the schema has to be exactly the same when decoding. Also notice that all fields are required by default hence you don't see the required property on fields. To make a field optional, you need to make it nullable by unioning its type with null (it's the case here for favoriteNumber.

Encoded size: 32 bytes (the smallest)

Using schemas with XML and JSON

Thrift, Protocol Buffers and Avro all use schema. XML and JSON also support schema, with support for more complex rules (this field has to be within the range 0-100, or match this regular expression, etc...).

Modes of dataflow

How to pass data from a process to another?

* Via databases
* Via service calls (REST and RPC)
* Via asynchronous message passing

Via Databases

One process writes encoded data, according to a particular schema. Another process reads it again sometime in the future, eventually according to another schema.

Via service calls

One process sends a request over the network and expects another process to reply as quickly as possible.

Via message passing

One process sends a message to a queue or topic through a message broker, and normally doesn't expect any response.

Message brokers:

* Ensures delivery of the message
* Act as buffers
* Can redeliver messages when not consumed by recipient
* Allow a single message to be sent to multiple recipients
* Decouple sender and recipient

Open source Message broker solutions:

* [RabbitMQ](https://www.rabbitmq.com/)
* [Apache ActiveMQ](http://activemq.apache.org/)
* [Jboss](http://www.jboss.org/) [HornetQ](http://hornetq.jboss.org/)
* [NATS](https://nats.io/)
* [Apache Kafka](https://kafka.apache.org/)

Have a read about distributed actor frameworks (like Akka) which provides an interesting programming model for concurrency and distribution across multiple nodes without typical problems such as race condition or deadlocks.

Ch4 - Data encoding, Schemas