****
dsch
****
Introduction
============
Dsch provides a way to store data and its metadata in a structured, reliable
way. It is built upon well-known data storage engines, such as the `HDF5`_ file
format, providing performance and long-term stability.
The core feature is the schema-based approach to data storage, which means that
a pre-defined schema specification is used to determine:
* which data fields are available
* the (hierarchical) structure of data fields
* metadata of the stored values (e.g. physical units)
* expected data types and constraints for the stored values
In fact, this is similar to an API specification, but it can be attached to and
stored with the data. Programs *writing* datasets benefit from data validation
and the high-level interface. *Reading* programs can determine the given data's
schema upfront, and process accordingly. This is especially useful with schemas
evolving over time.
For persistent storage, dsch supports multiple storage engines via its
`backends`, but all through a single, transparent interface. Usually, there are
no client code changes required to support a new backend, and custom backends
can easily be added to dsch.
Currently, backends exist for these storage engines:
* `HDF5`_ files (through `h5py`_)
* `NumPy .npz`_ files
* `MATLAB .mat`_ files (through `SciPy`_)
Note that dsch is only a thin layer, so that users can still benefit from the
performance of the underlying storage engine. Also, files created with dsch can
always be opened directly (i.e. without dsch) and still provide all relevant
information, even the metadata!
.. _HDF5: https://hdfgroup.org
.. _h5py: http://www.h5py.org
.. _NumPy .npz: https://docs.scipy.org/doc/numpy/reference/generated/numpy.savez.html
.. _MATLAB .mat: https://www.mathworks.com/products/matlab.html
.. _SciPy: https://docs.scipy.org/doc/scipy-0.19.0/reference/io.html
Reasoning
=========
Dsch is a response to the challenges in low-level data acquisition scenarios,
which are commonly found in labs at universities or R&D departments. Frequent
changes in both hardware and software are commonplace in these environments, and
since those changes are often made by different people, the data acquisition
hardware, software and data consumption software tend to get out of sync. At the
same time, datasets are often stored (and used!) for many years, which makes
backwards-compatibility a significant issue.
Dsch aims to counteract these problems by making the data exchange process more
explicit. Using pre-defined schemas ensures backward-compatibility as long as
possible, and when it can no longer be retained, provides a clear way to detect
(and properly handle) multiple schema versions. Also, schema based validation
allows to detect possible errors upfront, so that most non-security-related
checks do not have to be re-implemented in data consuming applications.
Note that dsch is targeted primarily at these low-level applications. When using
high-level data processing or even data science and machine learning techniques,
data is often pre-processed and aggregated with regard to a specific
application, which often eliminates the need for some of dsch's features, such
as the metadata storage. One might think of dsch as the tool to handle data
*before* it is filled into something like `pandas`_.
.. _pandas: https://pandas.pydata.org/
*********
Changelog
*********
This project follows the guidelines of `Keep a changelog`_ and adheres to
`Semantic versioning`_.
.. _Keep a changelog: http://keepachangelog.com/
.. _Semantic versioning: https://semver.org/
`0.3.0`_ - 2021-02-12
=====================
Added
-----
* New ``data_tree`` method for exporting data as nested `dict`/`list`
structures.
Changed
-------
* Improve documentation.
* Improve tests
Fixed
-----
* Minor updates to handle ``h5py`` deprecations.
`0.2.1`_ - 2018-02-02
=====================
Changed
-------
* ``h5py`` and ``scipy``, needed for HDF5 and MAT file support, respectively,
are now listed as extras / optional dependencies in setup.py.
Fixed
-----
* Fix missing type conversion for ``Scalar`` in ``inmem`` backend that causes
validation to incorrectly fail in some cases.
`0.2.0`_ - 2018-02-01
=====================
Added
-----
* New node type for `bytes` data.
* In-memory backend, for handling data without needing e.g. a file on disk.
* Support for copying data between different storages.
* Support for creating new storages from existing ones, aka. "save as".
* ``PseudoStorage`` abstraction class for unified data access in libraries.
* Human-readable tree-representation of data nodes for use in interactive
sessions.
* Support ``==`` operator for schema nodes.
Changed
-------
* Data nodes in Compilations and Lists can no longer be overwritten
accidentally when trying to overwrite their stored value.
* Improve structure and conciseness of docs.
* Change List to evaluate ``empty``-ness recursively.
* Replace generic exceptions like ``TypeError`` by custom dsch exceptions.
`0.1.3`_ - 2018-01-11
=====================
Changed
-------
* Attempting to open a non-existent file now shows a sensible error message.
* Attempting to create an existing file now shows a sensible error message.
Fixed
-----
* Fix error when handling partially filled compilations.
* Fix typo in documentation.
`0.1.2`_ - 2017-08-25
=====================
Fixed
-----
* Fix incorrect ordering of list items.
`0.1.1`_ - 2017-06-09
=====================
Added
-----
* Cover additional topics in documentation.
Fixed
-----
* Fix error when handling single-element lists with `mat` backend.
`0.1.0`_ - 2017-05-18
=====================
Added
-----
* First preview release.
.. _Unreleased: https://github.com/emtpb/dsch
.. _0.3.0: https://github.com/emtpb/dsch/releases/tag/0.3.0
.. _0.2.1: https://github.com/emtpb/dsch/releases/tag/0.2.1
.. _0.2.0: https://github.com/emtpb/dsch/releases/tag/0.2.0
.. _0.1.3: https://github.com/emtpb/dsch/releases/tag/0.1.3
.. _0.1.2: https://github.com/emtpb/dsch/releases/tag/0.1.2
.. _0.1.1: https://github.com/emtpb/dsch/releases/tag/0.1.1
.. _0.1.0: https://github.com/emtpb/dsch/releases/tag/0.1.0