# Clust
Optimised consensus clustering of one or more heterogeneous datasets.
Try our *clust's Beta* website front-end at http://clust.baselabujamous.com?
Or read below for an easy-to-use *clust* command line!
### Contents
* [What does *Clust* do?](#what-does-clust-do)
* [How does *Clust* do it?](#how-does-clust-do-it)
* [Install *Clust*](#install-clust)
* [Run *Clust*](#run-clust)
* [Normalisation](#normalisation)
* [Handling replicates](#handling-replicates)
* [Data from multiple species](#data-from-multiple-species)
* [Data from multiple technologies (e.g. mixing RNA-seq and microarrays)](#data-from-multiple-technologies-eg-mixing-rna-seq-and-microarrays)
* [Handling missing genes](#handling-missing-genes)
* [Handling genes with low expression](#handling-genes-with-low-expression)
* [Are you obtaining noisy clusters?](#are-you-obtaining-noisy-clusters)
* [List of all parameters](#list-of-all-parameters)
* [Example datasets](#example-datasets)
* [Citation](#citation)
# What does Clust do?
*Clust* is a fully automated method for identification of clusters (groups) of genes that are consistently
co-expressed (well-correlated) in one or more heterogeneous datasets from one or multiple species.
#### The single dataset case:
![Clusters_oneDS](Images/Clusters_1DS.png)
*Figure 1: Clust processes one gene expression dataset to identify (*K*) clusters of co-expressed genes. Clust
automatically identifies the number of clusters (*K*).*
#### The multiple datasets case:
![Clusters_multiDS](Images/Clusters.png)
*Figure 2: Clust processes multiple gene expression datasets (X1, X2, ... X(*L*)) to identify clusters of genes
that are co-expressed (well-correlated) in each of the input datasets. The left-hand panel shows the gene expression
profiles of all genes in each one of the input datasets, while the right-hand panel shows the gene expression profiles
of the genes in the clusters (C1, C2, ... C(*k*)). Note that the number of conditions or time points are different for
each dataset.*
### Features!
1. No need to pre-process your data; *clust* automatically normalises the data.
2. No need to preset the number of clusters; *clust* finds this number automatically.
3. You can control the tightness of the clusters by varying a single parameter `-t`
4. It is okay if the datasets:
* Were generated by different technologies (e.g. RNA-seq or microarrays)
* Are from different species
* Have different numbers of conditions or time points
* Have multiple replicates for the same condition
* Require different types of normalisation
* Were generated in different years and laboratories
* Have some missing values
* Do not include every single gene in every single dataset
5. *Clust* generates the following output files:
* A table of clustering statistics
* A table listing genes included in each cluster
* Pre-processed (normalised, summarised, and filtered) datasets' files
* Plotted gene expression profiles of clusters (a PDF file)
# How does Clust do it?
![Clust workflow](Images/Workflow_PyPkg.png)
*Figure 3: Automatic Clust analysis pipeline*
# Install *Clust*
### Way 1
* `sudo pip install clust`
Then run it from any directory as:
* `clust ...`
### Way 2
* `pip install --user clust`
Then run it from any directory as:
* `clust ...`
### Way 3 (less recommended)
First, make sure you have all of the following Python packages installed:
* numpy
* scipy
* matplotlib
* scikit-learn
* pandas
* joblib
* portalocker
Then, download the latest release file (clust-*.*.*.tar.gz) file from the
[release tab](https://github.com/BaselAbujamous/clust/releases)
and run *clust* without installation directly by running the script `clust.py`
that is in the top level directory of the source code by:
* `python clust.py ...`
**Hint**: you can check which package you have installed by:
* `pip freeze`
### Upgrade clust to a newer version
If you already have *clust* and you want to upgdare it, then based on the
way you used to install *clust* (from the ways above), upgrade it by:
- Way 1. `sudo pip install clust --upgrade`
- Way 2. `pip install --user clust --upgrade`
- Way 3. Download the newer release file (clust-*.*.*.tar.gz) and use it
to run clust instead of the older one
### For Windows users
Clust has not been tried in Windows thoroughly. If you try it, your feedback will be much appreciated.
We recommend that you download and install WinPython which provides
you with many Python packages that *clust* requires from http://winpython.github.io/
Open `WinPython Powershell Prompt.exe` from the directory in which you installed WinPython.
Run:
* `pip install clust`
Then you can run *clust* by:
* `clust ...`
# Run *Clust*
For normalised homogeneous datasets, simply run:
- `clust data_path`
- `clust data_path -o output_directory [...]`
Where `data_path` is either the path to a single data file (**v1.8.5+**),
or a path to a directory including one or more data files. This command
runs *clust* with default parameters. If the output directory is not
provided using the `-o` option, *clust* creates a new directory for the
results within the current working directory.
For raw RNA-seq TPM, FPKM, or RPKM data, consider the [Normalisation](#normalisation) section below.
Other sections below address handling [replicates](#handling-replicates), handling data from
[mulitple species](#data-from-multiple-species), and handling
[microarray data](#data-from-multiple-technologies-eg-microarrays) (only or mixed with RNA-seq data).
### Data files format
Each dataset is represented in a single TAB delimited (TSV) file in which the first column represents gene IDs,
the first row represents unique labels of the samples, and the rest of the file includes numerical values, mainly
gene expression values.
![Data_simple](Images/Data_simple.png)
*Figure 4: Snapshots of the first few lines of three data files X1.txt, X2.txt, and X3.txt.*
* When the same gene ID appears in different datasets, it is considered to refer to the same gene.
* If more than one row in the same file had the same identifier, they are automatically summarised by
summing up their values.
* **IMPORTANT**: Gene names should not include spaces, commas, or semicolons.
# Normalisation
**NEW FEATURE: AUTOMATIC NORMALISATION! (V1.7.0 and newer)**
*Clust* applies data normalisation during its pre-processing step.
* Version 1.7.0 and newer: *Clust* **automatically detects** the most suitable normalisation for each dataset unless
otherwise stated by the user via the `-n` option. The normalisation codes that *clust* decides to
apply are stored in the output file `/Normalisation_actual`
* Version 1.6.0 and earlier: The required normalisation techniques should be stated by the user via the `-n` option.
Otherwise, no normalisation is applied.
#### The `-n` option:
Tell *clust* how to normalise your data in one of two ways:
1. `clust data_path -n code1 [code2 code3 ...] [...]` **(V1.7.0 and newer)**
* List one or more normalisation codes (from the table below) to be applied to your one or more datasets
* Example: `clust data_path -n 101 3 4 [...]`
2. `clust data_path -n normalisation_file [...]`
* Provide a file listing the normalisation codes for each dataset (see Fig. 5).
* Each line of the file includes these elements in order:
1. The name of the dataset file (e.g. X0.txt)
2. One or more normalisation codes. **The order** of these codes defines the order of the
application of normalisation techniques.
* Delimiters between these elements can be spaces, TABs, commas, or semicolons.
![NormalisationFile](Images/NormalisationFile.png)
*Figure 5: Normalisation file indicating the types of normalisation that should be applied to each of the datasets.*
#### Codes suggested for commonly used datasets
* RNA-seq TPM, FPKM, and RPKM data: **101 3 4**
* Log2 RNA-seq TPM, FPKM, and RPKM data: **101 4**
* One-colour microarray gene expression da