# refineC
[![GitHub release (latest by date including pre-releases)](https://img.shields.io/github/v/release/genomewalker/refine-contigs?include_prereleases&label=version)](https://github.com/genomewalker/refine-contigs/releases) [![refine-contigs](https://github.com/genomewalker/refine-contigs/workflows/refineC_ci/badge.svg)](https://github.com/genomewalker/refine-contigs/actions) [![PyPI](https://img.shields.io/pypi/v/refine-contigs)](https://pypi.org/project/refine-contigs/) [![Conda](https://img.shields.io/conda/v/genomewalker/refine-contigs)](https://anaconda.org/genomewalker/refine-contigs)
refineC is a simple tool to identify potential misassemblies in contigs recovered from ancient metagenomes. The assembly of ancient metagenomics data is challenging due to the short length and post-mortem damage in reads. The short length of the reads pushes the limits of the assemblers, and the recovered contigs might contain misassemblies that can, for example, lead to misleading ecological insights, spurious pangenomic analyses, erroneous prediction of the functional potential, or impact the outcome of the binning process by mixing distantly related or unrelated phylogenetic gene markers. With the assembly of ancient metagenomics data, we can face different problems:
- **Inter-genomic mosaic**: Chimeric contigs containing a mixture of sequence from multiple organisms
- **Intra-genomic mosaic**: Chimeric contigs mixing different genomic regions from the same organism
- **Temporal mosaic**: Chimeric contigs containing a mixture of sequence from different times (and organisms)
At the moment, refineC can mitigate the effects of the first and second type of problems by exploiting the information of de Bruijn (Megahit/Spades) and overlap-based assemblers (PLASS/PenguiN). While the de Bruijn assemblers will assemble longer contigs, the overlap-based assemblers will recover more of the ancient sequence space present in the ancient sample. In any case, both types of assemblers will end up generating misassembled contigs, especially when we fine-tune the assemblers to recover as much as possible.
RefineC follows a simple approach to identify the misassemblies:
- Perform an all-vs-all contig comparison
- Identify groups of contigs that share a certain amount of sequence identity and coverage
- Find overlapping regions for each contig supported by other contigs and extract the longest one. Keep the leftover parts of the contigs if they are longer than a certain threshold
- Remove redundancy of the overlapping fragments by sequence clustering
- Add the new contig set to the original set of contigs without hits in the all-vs-all comparison
- Remove redundancy by sequence clustering
![assets/images/refineC-wf.png](assets/images/refineC-wf.png#center)
In addition, `refineC` has a **merge** option where it tries to find contigs that might be merged using Minimus2. On some occasions, there are overlaps between contigs that are very well supported by other contigs in the sample, and refineC cannot collapse them. This usually happens when terminal regions of the contigs are overlapping other terminal regions. For this reason, although not usually recommended, we use Minimus to merge overlapping contigs. We apply a conservative approach to the already refined contigs, where we find overlaps in the same manner as in the `split` subcommand, but in this case, we select the maximum clique in a component to be merged by Minimus2.
# Installation
We recommend having [**conda**](https://docs.conda.io/en/latest/) installed to manage the virtual environments
### Using pip
First, we create a conda virtual environment with:
```bash
wget https://raw.githubusercontent.com/genomewalker/refine-contigs/master/environment.yml
conda env create -f environment.yml
```
Then we proceed to install using pip:
```bash
pip install refine-contigs
```
### Using conda
```bash
conda install -c conda-forge -c bioconda -c genomewalker refine-contigs
```
### Install from source to use the development version
Using pip
```bash
pip install git+ssh://git@github.com/genomewalker/refine-contigs.git
```
By cloning in a dedicated conda environment
```bash
git clone git@github.com:genomewalker/refine-contigs.git
cd refine-contigs
conda env create -f environment.yml
conda activate refine-contigs
pip install -e .
```
# Usage
refineC only needs a contig file. For a complete list of option
```
$ refineC --help
usage: refineC [-h] [--version] [--debug] {split,merge} ...
Finds misassemblies in ancient data
positional arguments:
{split,merge} positional arguments
split Find misassemblies
merge Merge potential overlaps
optional arguments:
-h, --help show this help message and exit
--version Print program version
--debug Print debug messages (default: False)
```
For the split mode:
```
$refineC split --help
usage: refineC split [-h] [--tmp DIR] [--threads INT] [--keep-files]
[--output OUT] [--prefix PREFIX] --contigs FILE
[--min-id FLOAT] [--min-cov FLOAT] [--glob-cls-id FLOAT]
[--glob-cls-cov FLOAT] [--frag-min-len INT]
[--frag-cls-id FLOAT] [--frag-cls-cov FLOAT]
optional arguments:
-h, --help show this help message and exit
--tmp DIR Temporary directory (default:./tmp)
--threads INT Number of threads (default: 2)
--keep-files Keep temporary data (default: False)
--output OUT Fasta file name to save the merged contigs (default:
contigs)
--prefix PREFIX Prefix for contigs name (default: contig)
required arguments:
--contigs FILE Contig file to check for misassemblies
overlap identification arguments:
--min-id FLOAT Minimun id to use for the overlap (default: 0.9)
--min-cov FLOAT Minimun percentage of the coverage for the overlap
(default: 0.25)
global clustering arguments:
--glob-cls-id FLOAT Minimum identity to cluster the refined dataset
(default: 0.99)
--glob-cls-cov FLOAT Minimum coverage to cluster the refined dataset
(default: 0.9)
fragment refinement arguments:
--frag-min-len INT Minimum fragment length to keep (default: 500)
--frag-cls-id FLOAT Minimum identity to cluster the fragments (default:
0.9)
--frag-cls-cov FLOAT Minimum coverage to cluster the fragments (default:
0.6)
```
And for the `merge` mode:
```
$refineC merge --help
usage: refineC merge [-h] [--tmp DIR] [--threads INT] [--keep-files]
[--output OUT] [--prefix PREFIX] --contigs FILE
[--min-id FLOAT] [--min-cov FLOAT] [--glob-cls-id FLOAT]
[--glob-cls-cov FLOAT] [--mnm2-threads INT]
[--mnm2-overlap INT] [--mnm2-minid FLOAT]
[--mnm2-maxtrim INT] [--mnm2-conserr FLOAT]
optional arguments:
-h, --help show this help message and exit
--tmp DIR Temporary directory (default:./tmp)
--threads INT Number of threads (default: 2)
--keep-files Keep temporary data (default: False)
--output OUT Fasta file name to save the merged contigs (default:
contigs)
--prefix PREFIX Prefix for contigs name (default: contig)
required arguments:
--contigs FILE Contig file to check for misassemblies
overlap identification arguments:
--min-id FLOAT Minimun id to use for the overlap (default: 0.9)
--min-cov FLOAT Minimun percentage of the coverage for the overlap
(default: 0.25)
global clustering arguments:
--glob-cls-id FLOAT Minimum identity to cluster the refined dataset
(default: 0.99)
--glob-cls-cov FLOAT Minimum coverage to cluster the refined dataset
(def
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
资源分类:Python库 所属语言:Python 资源全名:refine-contigs-0.0.12.tar.gz 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059
资源推荐
资源详情
资源评论
收起资源包目录
refine-contigs-0.0.12.tar.gz (19个子文件)
refine-contigs-0.0.12
MANIFEST.in 57B
PKG-INFO 501B
setup.cfg 703B
refine_contigs
utils.py 26KB
__main__.py 1KB
__init__.py 92B
_version.py 498B
merge.py 7KB
split.py 8KB
setup.py 1KB
refine_contigs.egg-info
PKG-INFO 501B
requires.txt 114B
SOURCES.txt 458B
entry_points.txt 58B
top_level.txt 15B
dependency_links.txt 1B
README.md 16KB
versioneer.py 69KB
scripts
minimus2_mod 3KB
共 19 条
- 1
资源评论
挣扎的蓝藻
- 粉丝: 14w+
- 资源: 15万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功