# CSV-LD: A CSV-based Serialization for Linked Data
Comma-separated-values (CSV) is a useful data serialization and sharing format.
This document introduces the idea of CSV-LD as a CSV-based format to serialize
Linked Data, mirroring the way that [JSON-LD](https://www.w3.org/TR/json-ld11/)
is a JSON-based format to serialize Linked Data. "CSV" here includes any dialect
that uses a different delimiter, such as tab-separated-values (TSV).
The syntax of CSV-LD is designed to easily integrate into deployed systems that
already use CSV, and provides a smooth upgrade path from CSV to CSV-LD. It is
primarily intended to be a way to use Linked Data as part of spreadsheet-based
data entry; to facilitate data validation, display, and conversion of CSV into
other formats via use of [CSV on the Web (CSVW)
metadata](https://www.w3.org/ns/csvw); and to build
[FAIR](https://doi.org/10.1038/sdata.2016.18) data services.
The term "CSV-LD" was previously used to describe [a now-obsoleted
precursor](https://github.com/gkellogg/csv-ld) to the [CSVW
specifications](https://www.w3.org/2013/csvw/wiki/Main_Page); both approaches
require a second file, a JSON-LD template document, to be shared along with a
CSV file. The approach described herein, in contrast, requires only a CSV file
from the data producer, one that includes links to CSVW-powered metadata.
## Data Producers
Data producers need to add two header rows to their CSV to make it CSV-LD: the
"key-spec" row, and the "column-spec" row. This mirrors the way that JSON can
minimally be made JSON-LD by adding a "@context" key-value pair. CSV producers
commonly include a header row to label columns with names, and it is not
uncommon to see CSV files with additional header rows (`headerRowCount` is part
of the [CSVW vocabulary](https://www.w3.org/ns/csvw)).
### The Key-Spec Row
The most important job of the top-most row, the key-spec row, is not at all
about specifying keys. Rather, it is to communicate to a data consumer that this
file is special in some way -- the left-most cell must have a URL in it, and
hopefully, someone who has never seen CSV-LD before will click on the link.
The link is `http://example.org/csv-ld/2021/01/inKey` (not really, but I'll
change this soon -- I registered `csv-ld.org` and currently point it to the CSVW
vocabulary page), and it will provide a friendly introduction to CSV-LD. The
file could indicate that it uses a different version of CSV-LD by using a
different prefix before `inKey`, which would link to that version's page.
The page will also explain that `inKey` marks a CSV column as being part of how
to identify a row uniquely. A CSV table could have a single key, such as an ID
column, or a compound key, such as year and semester (e.g. "Fall", "Spring")
uniquely identifying each row as an academic term. All key columns must be
contiguous and start on the left side -- this (1) makes the links easy to spot
for someone unfamiliar with CSV-LD, and (2) makes it easier for a data
consumer/steward to implement a CSV-LD processor.
Cell values in the key-spec row after (to the right of) the key columns (the
columns containing the `inKey` link) have only one requirement: they cannot have
the same `inKey`-link value as the key-column cells. They can be blank,
comments, whatever.
So, to summarize, the key-spec row (1) communicates that the file is a CSV-LD
file, and (2) communicates the (possibly compound) key that uniquely identifies
a row.
### The Column-Spec Row
The job of the column-spec row is to be an unambiguous labeling of each column.
It is fine for there to be a header row below the column-spec row that exhibits
the common header-row practice of using short names ("x", "y", etc.) as column
labels.
For a data producer, this task should be a simple matter of using a template or
reference guide authored by a data steward that provides URLs for each column of
interest. For example, one might be given the following table of terms to record
environmental metadata for collected biosamples:
|Term URL|Comment|
|---|---|
|http://example.org/nmdc/id|Sample ID|
|http://example.org/nmdc/lat_lon|Latitude and longitude|
|http://example.org/nmdc/ecosystem|Type of ecosystem|
|...|...|
These aren't real URLs (but I'll eventually update this example to be real).
It's also possible that a data steward may provide namespaces for data
producers, e.g. `http://example.org/nmdc/team42/`, and producers can use the
namespaces to prefix invented terms that will later resolve to working URLs
through work done by the data steward.
The term URLs should resolve to pages that explain how values should be
formatted. For example, `http://example.org/nmdc/lat_lon` could explain that the
value should be latitude in degrees, a space, and longitude in degrees. This
explanation could be automatically generated by CSVW metadata (authored by the
data steward) that will also be used by a CSV-LD processor to validate the data.
For example, the CSVW metadata for this field could look like
```json
{
"@context": {"@vocab": "http://www.w3.org/ns/csvw#"},
"name": "lat_long",
"separator": " ",
"ordered": true,
"datatype": {
"base": "number",
"minimum": "-180",
"maximum": "180"
}
}
```
## Data Consumers
A CSV-LD file is still just a CSV file, so a data consumer can simply ignore the
"extra" headers rows. Each of the "extra" header rows is prefixed by a "#" and a
space, so that popular parsers can recognize these rows as "comment" lines and
skip to the "real" header row, e.g. `pandas.read_csv(...,comment="#")` for the
popular Python `pandas` data-processing library. They could also click any link
in the column-spec header to learn more about how to interpret the data in that
column. If they have access to a CSV-LD processor, they can use it to validate
the data and/or convert it to another format like JSON (i.e., JSON-LD).
## Data Stewards
Data stewards are concerned with managing data integrity. They can author
JSON-LD metadata for `csvw:Column` entities, as shown in the above example for
`http://example.org/nmdc/lat_lon`, and make that metadata downloadable from the
URL used for the column.
A CSV-LD processor will request `http://example.org/nmdc/lat_lon` using an HTTP
`Accept` header that expresses a preference for a JSON-LD response, whereas a
human loading `http://example.org/nmdc/lat_lon` in their browser will get a web
page (HTML) response that the data steward has produced (perhaps auto-generated
from the metadata JSON-LD).
Thus, a data steward needs know how to serve web content, or needs to
collaborate with someone who can. I hope to provide in this repository a
reference server implementation, written in Python.
## Implementers of CSV-LD Processors
The final stakeholder in the CSV-LD world is the implementer of a CSV-LD
processor. A more detailed specification is to come, but I will try to adhere to
a ["worse is better"](https://en.wikipedia.org/wiki/Worse_is_better) approach
that prioritizes simplicity of implementation. Furthermore, this repository will
host a reference CSV-LD processor implementation, written in Python. Perhaps the
reference implementation will be good enough for most.
One thing to note here is that the job of the data producer is to aggregate a
set of records (rows) with well-defined fields (columns). What a record *is* or
*should be*, i.e. its type or class, and thus e.g. which columns are required,
is left open to data consumers depending on the application.
Validating a record as a whole is important, and the CSVW metadata spec can help
with this task. I expect to elaborate later on how exactly a CSV-LD processor
may be invoked e.g. as
```
csvld --entity "http://example.org/nmdc/BioSample" --out data.json data.csv
```
to validate each row of the CSV-LD, in addition to independently validating each
column value.