# `dantro`: handle, transform, and visualize hierarchically structured data
`dantro`—from *data* and *dentro* (Greek for *tree*)—is a Python package that provides a uniform interface for hierarchically structured and semantically heterogeneous data.
It is built around three main features:
* **data handling:** loading heterogeneous data into a tree-like data structure and providing a uniform interface for it
* **data transformation:** performing arbitrary operations on the data, if necessary using lazy evaluation
* **data visualization:** creating a visual representation of the processed data
Together, these stages constitute a **data processing pipeline**: an automated sequence of predefined, configurable operations.
Akin to a Continuous Integration pipeline, a data processing pipeline provides a uniform, consistent, and easily extensible infrastructure that contributes to more efficient and reproducible workflows.
This can be beneficial especially in a scientific context, for instance when handling data that was generated by computer simulations.
`dantro` is meant to be **integrated** into projects and to be used to set up such a data processing pipeline.
It is designed to be **easily customizable** to the requirements of the project it is integrated into, even if the involved data is hierarchically structured or semantically heterogeneous.
Furthermore, it allows a **configuration-based specification** of all operations via YAML configuration files; the resulting pipeline can then be controlled entirely via these configuration files and without requiring code changes.
The `dantro` package is **open source software** released under the LGPLv3+ license (see [copyright notice](#copyright) below).
It was developed alongside the [Utopia project][utopia-project], but is an independent package.
We describe the motivation and scope of `dantro` in more detail in [this publication in the Journal of Open Source Software][dantro-joss-doi].
For more information on the package, its features, philosophy, and integration, please visit its **documentation** at [`dantro.readthedocs.io`][dantro-docs].
If you encounter any issues with `dantro` or have suggestions or questions of any kind, please open an issue via the [**project page**][dantro-project].
## Installing dantro
The `dantro` package is available [on the Python Package Index][pypi-dantro] and [via `conda-forge`][conda-forge-dantro].
If you are unsure which installation method works best for you, we recommend to use `conda`.
Note that — in order to make full use of `dantro`'s features — it is meant to be *integrated* into your project and customized to its needs.
Basic usage examples and an integration guide can be found in the [package documentation][dantro-docs].
### Installation via [`conda`][conda]
As a first step, install [Anaconda][Anaconda] or [Miniconda][Miniconda], if you have not already done so.
You can then use the following command to install dantro and its dependencies:
```bash
$ conda install -c conda-forge dantro
```
### Installation via [`pip`][pip]
If you already have a Python installation on your system, you probably already have `pip` installed as well.
To install dantro and its dependencies, invoke the following command:
```bash
$ pip install dantro
```
In case the `pip` command is not available, follow [these instructions][pip-installation] to install it or switch to the `conda`-based installation.
_Note_ that if you have both Python 2 and Python 3 installed, you might have to use the `pip3` command instead.
### Dependencies
`dantro` is implemented for [Python >= 3.6][Python3] and depends on the following Python packages:
| Package Name | Minimum Version | Purpose |
| ----------------------------- | ---------------- | ------------------------ |
| [numpy][numpy] | 1.17.4 |
| [xarray][xarray] | 0.16 | For labelled N-dimensional arrays |
| [dask][dask] | 2.10.1 | To work with large data |
| [toolz][toolz] | 0.10 | For [dask.delayed][dask-delayed]
| [distributed][distributed] | 2.10 | For distributed computing |
| [scipy][scipy] | 1.4.1 | As engine for NetCDF files |
| [sympy][sympy] | 1.6.1 | For symbolic math operations |
| [h5py][h5py] | 2.10 | For reading HDF5 datasets |
| [matplotlib][matplotlib] | 3.2.1 | For data visualization |
| [networkx][networkx] | 2.2 | For network visualization |
| [ruamel.yaml][ruamelyaml] | 0.16.10 | For parsing YAML configuration files |
| [paramspace][paramspace] | 2.5 | For dictionary- or YAML-based parameter spaces |
## Developing dantro
### Installation for developers
For installation of versions that are not on the PyPI, `pip` allows specifying an URL to a git repository:
```bash
$ pip install git+<clone-link>@<some-branch-name>
```
Here, replace `clone-link` with the clone URL of this project and `some-branch-name` with the name of the branch that you want to install the package from (see the [`pip` documentation][pip-install-docs] for details).
Alternatively, omit the `@` and everything after it.
If you do not have SSH keys available, use the HTTPS link.
If you would like to contribute to `dantro` (yeah!), you should clone the repository to a local directory:
```bash
$ git clone <clone-link>
```
For development purposes, it makes sense to work in a specific [virtual environment][venv] for dantro and install dantro in editable mode:
```bash
$ python3 -m venv ~/.virtualenvs/dantro
$ source ~/.virtualenvs/dantro/bin/activate
(dantro) $ pip install -e ./dantro
```
### Additional dependencies
For development purposes, the following additional packages are required.
| Package Name | Minimum Version | Purpose |
| ----------------------------- | ---------------- | ------------------------ |
| [pytest][pytest] | 3.4 | Testing framework |
| [pytest-cov][pytest-cov] | 2.5.1 | Coverage report |
| [tox][tox] | 3.1.2 | Test environments |
| [Sphinx][sphinx] | 2.4 (< 3.0) | Documentation generator |
| [sphinx_rtd_theme][sphinxrtd] | 0.5 | Documentation HTML theme |
To install these development-related dependencies, enter the virtual environment, navigate to the cloned repository, and perform the installation using:
```bash
(dantro) $ cd dantro
(dantro) $ pip install -e .[dev]
```
### Testing framework
To assert correct functionality, tests are written alongside all features.
The [`pytest`][pytest] and [`tox`][tox] packages are used as testing frameworks.
All tests are carried out for Python 3.6 through 3.8 using the GitLab CI/CD and the newest versions of all [dependencies](#dependencies).
When merging to the master branch, `dantro` is additionally tested against the specified _minimum_ versions.
Test coverage and pipeline status can be seen on [the project page][dantro-project].
#### Running tests
To run all [defined tests](tests/), call:
```bash
(dantro) $ python -m pytest -v tests/ --cov=dantro --cov-report=term-missing
```
This also provides a coverage report, showing the lines that are *not* covered by the tests.
Alternatively, with [`tox`][tox], it is possible to select different python environments for testing.
Given that the interpreter is available, the test for a specific environment can be carried out with the following command:
```bash
(dantro) $ tox -e py37
```
### Documentation
#### Locally building the documentation
To build `dantro`'s documentation locally via [Sphinx][sphinx], install the required dependencies and invoke the `make doc` command:
```bash
(dantro) $ cd doc
(dantro) $ make doc
```
You can then view the documentation by ope