Chapter 4 - Encoding
Formats for encoding data
Typical encoding formats:
* XML
* JSON
* CSV
Limitations:
- XML
- Cannot distinguish numbers from strings made of digits, unless relying on external schema
- JSON
- Cannot distinguish ints from floats
- CSV
- Cannot distinguish numbers from strings made of digits
- No schema
Example of record:
{
"userName": "Martin",
"favoriteNumber": 1337,
"interests": ["daydreaming", "hacking"]
}
JSON encoding here is schemaless, so the field names have to be set (userName
, favoriteNumber
, ...)
Encoded size: 81 bytes.
Encoding with MessagePack
MessagePack can binary encode the record above.
Result is:
83 a8 75 73 65 72 4e 61 6d 65 a6 4d 61 72 74 69 6e ae 66 61 76 6f 72 69 74 65 4e 75 6d 62 65 72 cd 05 39 a9 69 6e 74 65 72 65 73 74 73 92 ab 64 61 79 64 72 65 61 6d 69 6e 67 a7 68 61 63 6b 69 6e 67
Encoded size: 66 bytes.
Encoding with Thrift and Protocol Buffers
Apache Thrift and Google Protocol Buffers also binary encode, but based on a schema.
Example with Thrift (IDL):
struct Person {
1: required string userName,
2: optional i64 favoriteNumber,
3: optional list<string> interests
}
Encoded size:
* Thrift (BinaryProtocol): 59 bytes
* Thrift (CompactProtocol): 34 bytes
* Protocol buffers: 33 bytes
How to update schema then?
* Add new fields: OK, just add another tag number. Can't be required though for backwards compatibility, or needs to have a default value. * Remove fields: OK, same. Just you need to remove _optional_ fields for the same backwards compatibility reasons. And you cannot use the same tag number again. * Updating field type: Should be ok, see case by case. Eventually replace by a new field.
Encoding with Apache Avro
Apache Avro is another binary encoding format, also using a schema.
Schema in Avro IDL (intended for human-readers):
record Person {
string userName;
union { null, long } favoriteNumber = null;
array<string> interests;
}
And in JSON (intended for machine-readers):
{
"type": "record",
"name": "Person",
"fields": [
{"name": "userName", "type": "string"},
{"name": "favoriteNumber", "type": ["null", "long"], "default": null},
{"name": "interests", "type": {"type": "array", "items": "string"}}
]
}
Notice as compared to Thrift and Protocol buffers, there's no tag number to identify fields. This means the schema has to be exactly the same when decoding.
Also notice that all fields are required by default hence you don't see the required
property on fields. To make a field optional, you need to make it nullable by unioning its type with null
(it's the case here for favoriteNumber
.
Encoded size: 32 bytes (the smallest)
Using schemas with XML and JSON
Thrift, Protocol Buffers and Avro all use schema. XML and JSON also support schema, with support for more complex rules (this field has to be within the range 0-100, or match this regular expression, etc...).
Modes of dataflow
How to pass data from a process to another?
* Via databases
* Via service calls (REST and RPC)
* Via asynchronous message passing
Via Databases
One process writes encoded data, according to a particular schema. Another process reads it again sometime in the future, eventually according to another schema.
Via service calls
One process sends a request over the network and expects another process to reply as quickly as possible.
Via message passing
One process sends a message to a queue or topic through a message broker, and normally doesn't expect any response.
Message brokers:
* Ensures delivery of the message
* Act as buffers
* Can redeliver messages when not consumed by recipient
* Allow a single message to be sent to multiple recipients
* Decouple sender and recipient
Open source Message broker solutions:
* [RabbitMQ](https://www.rabbitmq.com/)
* [Apache ActiveMQ](http://activemq.apache.org/)
* [Jboss](http://www.jboss.org/) [HornetQ](http://hornetq.jboss.org/)
* [NATS](https://nats.io/)
* [Apache Kafka](https://kafka.apache.org/)
Have a read about distributed actor frameworks (like Akka) which provides an interesting programming model for concurrency and distribution across multiple nodes without typical problems such as race condition or deadlocks.