Table of Contents
I admit that I had never heard of this data format before this spring.
I was getting some errors following the official documentation of an open source library for creating ASR systems1.
Nothing was making sense until I noticed I wasn’t working with a JSON array of objects, but rather JSON objects separated by a newline. Eureka!
In its simplicity is truly an ingenious solution to the problem of deserializing large JSON files: you read it line by line which is simple, flexible, easy to understand and very effective!
ndjson-lines-seq what? #
This format is very simple, but it’s standardization is not; in fact there are 2 commonly accepted (not officially recognized but identical in practice) specs:
- Newline Delimited JSON (ndjson)
- JSON Lines (jsonl2)
The only difference i could find i those two specs are that ndjson says:
All serialized data MUST use the UTF8 encoding.
while jsonl says:
JSON allows encoding Unicode strings with only ASCII escape sequences, however those escapes will be hard to read when viewed in a text editor. The author of the JSON Lines file may choose to escape characters to work with plain ASCII files.
In practice they are both identical, and the original authors of ndjson stated he’s not against deprecating it in favour of jsonl3.
There is a third contender which was RFC’d in 2014: RFC 7464 which has some more rule about encoding, delimiting and invalid/incomplete JSON texts.
As far as i understand, this RFC is still a proposed standard and jsonl and ndjson are more widespread even though they have never been submitted to RFC4.
OK, this format is extremely simple, so why the formalism matters? #
If we talk about the file format itself, there aren’t many issues not having a standard, as long as both parties involved know how to serialize and deserialize it. After all there are countless non-standardized formats serialized and deserialized billions of times everyday, no?
The issue arise if we work with web servers and browsers: they need to know the right MIME type in order to support it properly. And since there is no standard they are either totally unsupported or supported sporadically.
At the moment ndjson is (or better SHOULD) reported as
application/x-ndjson and jsonl is reported as
application/jsonl (see this issue if you want to help writing the relevant RFC)
I personally prefer jsonl since it was the first one I’ve come across on my job… and the extension is shorter 👼!
I’d like to rant about the fact that the official documentation had such a format ambiguity easily fixable by using the right file extension; but the project was MIT licensed, and having neither reported nor opened a pull request for it, I can’t complain.
Moreover my employer has his “non-policies”, so i just have to forget and get over it. ↩︎
My networking teacher at University liked to say that most computer standards are the result of “un accidente della storia” or in english “an accidental case of history” ↩︎