JSON vs Google Protocol Buffers (Protobuf), and when to use which one?

Before we compare these two data formats, let’s understand the context.

Every application in digital world has some sort of data that it deals with. And, often there is a need to perform an exchange of data between two applications. Also, often that data is traverses through some medium (storage, memory, database, file, etc.) as part of that exchange. This movement of data (through a medium) goes through a set of processes called – Serialization and Deserialization.

Serialization

In context of Data, Serialization is the process of converting data into a format:

  • that can be stored (in storage, memory, file, etc.) or simply transmitted over network,
  • and in such a manner as to maintain the integrity of data’s structure and content, so that it can be re-constructed later to its original form

 

Deserialization

The Deserialization is the process that re-constructs the data from its stored / transmitted form back to its original form.

 

Serialization and Deserialization

 

Data Format

Data being stored / transmitted can be in many different formats, with some of the popular ones being: plain text, JSON, XML, HTML, binary, etc.

 

Let’s say an Application X wants to use data produced by Application Y (via storage or transmission). In order for Application X to truly make use of the content from the data, it will need to understand the associated structure (schema) – without this understanding content may be useless.

Example: one particular data object from application Y has content like:

{

“MTO8943”,
“-2F”,
“60F”,
“50”,
“8000”,
“35,000.00”,
“USD”

}

 

For application X to process this data, it would need to know what each data point means. There are two common ways to do that:

  1. With each response, include the relevant (semantic) structure
  2. Establish a notion (a protocol) between these two applications so that Application X would know what each data point refers to.

 

Option 1:

Format like JSON helps here.

{

“Machine Specs:” {
“Model Name”: “MTO8943”,
“Lowest Operating Temperature”: “-2F”,
“Maximum Operating Temperature”: “60F”,
“Daily Processing Capacity in Tons”: “50”,
“Maximum Output Count”: ‘8000″
“Price”: {

“Retail Price”: “35,000.00”,
“Currency”: “USD”

}

}

 

The challenge with this would be that if your application is exchanging hundreds of thousands of such data objects, so much bandwidth (both in storage, and transmission) would be wasted in carrying schema notes over and over to the same application. Additionally, processing time would be much higher as well.

 

This is where Option 2 can help. A protocol like Protocol Buffer (Protobuf) can help establish an understanding between the two applications, and then only data content transfer would suffice.

 

Protocol Buffer (Protobuf):

It is a method (developed at Google) that establishes a communication channel for understanding (by means of defining structure in Interface Description Language – IDL), and handles Serializing / Deserializing through its own binary encoding / decoding mechanism.

 

Above example when described in IDL may look like:

Schema:

=============================
syntax = “proto3”;
message MachineSpecsResponse {

required string model_name = 1;
required string lowest_operating_temperature = 2;
required string maximum_operating_temperature = 3;
required int32 daily_processing_capacity_in_tons = 4;
required int32 maximum_output_count = 5;

message Price {

optional float retail_price = 1;
optional string currency = 2;

}

}
=============================

 

With this schema established with Application X, it can consume simply the content (for its corresponding calls) and process data much faster.

Similar Interface Description Language (IDL) based protocols:

  • Apache Thrift (developed at Facebook)
  • Ion (developed at Amazon)
  • Microsoft Bond protocol (do I need to tell who developed it? Really??)

Let’s look at quick comparison of pros and cons of Protobuf, and that can also guide when to use it, and when to move to other formats (such as JSON):

 

Advantages of Protobuf:

  • High performance, specially at the time of decoding (part of deserialization)
  • Well-defined schema – you can specify what’s mandatory and what’s optional
  • Lower cost due to lower data bandwidth consumption – especially when you are charged for transfer out (e.g., from AWS network)
  • Supports many popular languages: Java, Python, Objective-C, Ruby, and latest version (proto3) also supports Dart, Go, Ruby and C#

 

Disadvantages of Protobuf (and this is when JSON, or other formats may be better options):

  • It requires well-defined schema, so when data schema changes a lot at run-time, or there is no way to have well-defined structure in place, JSON may be better option (because you will be sending structure with the data)
  • Data is not human readable (or at least not understandable). When you need humans to be able to read data, in addition to consumption by machines, JSON is better option
  • Data is being read directly by a Web Browser
  • Though many popular languages are supported, many others are not.
  • Community support for Protobuf is much smaller compared to formats like JSON

 

In conclusion, it makes sense to adopt Protobuf over likes of JSON when you have large amount of data to deal with, particularly with well-defined schema. This is one of the reason why Protobuf is getting more popular with Data Scientists in Machine Learning projects. Having said that, there are many scenarios where JSON would serve better.