October 28, 2011

The return of the CSV format

15 years ago we never thought that a flat format such as CSV would become the dominant format for open data and public data exchange, but these days it seems it is. That is remarkable since there have been so much attempts to create rich structured data formats. XML was getting popular as a hierarchical and easy to parse language. But these days it is heavily used for lower level data layers, but less for downloadable and easy accessible open public data. Excel used to be allover the world, but it turned out not to be a proper format for public data, since it is not open. Also, it is not well designed, mixing formatting constructs with data typing. RDF was promising, but was not the format of choice for hard-core data processing. SDMX tried in the field of statistics, but is still fighting for recognition. JSON is popular in the field of API's, not so much for the bare data download.

CSV is the winner...back to basics...easy to understand... plenty of tools to work on. Google's DSPL uses it, along with a rich model to describe the metadata of the data. The united nations database Undata uses it. Open data UK uses it and many more...

So the world of public data is getting more flat. The new data paradigm is to combine a simple flat data format with rich metadata about the flat data. Certain metadata, such as data types and hierarchical relationships among variables, can be included in the metadata but it is also possible that they can be derived from the flat data. This is how the treemapping tools of DrasticData work. They derive data types and hierarchical relationships from the bare flat data. Try the CSV examples included in the DrasticTreemapDesktop release and see how it works....