معرفی شرکت ها

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

Set of utilities on sequences and BAM files

ویژگی	مقدار
سیستم عامل	-
نام فایل	CMSeq-1.0.4
نام	CMSeq
نسخه کتابخانه	1.0.4
نگهدارنده	[]
ایمیل نگهدارنده	[]
نویسنده	Moreno Zolfo
ایمیل نویسنده	moreno.zolfo@unitn.it
آدرس صفحه اصلی	http://github.com/SegataLab/cmseq/
آدرس اینترنتی	https://pypi.org/project/CMSeq/
مجوز	LICENSE.txt

# CMSeq # CMSeq is a set of commands to provide an interface to .bam files for coverage and sequence consensus. **Requires:** * samtools (> 1.x) * numpy * pysam * pandas * Biopython with bcbio-gff module _(warning: Biopython <= 1.76 is required for `polymut.py`)_ ## Installation [![PyPi version](https://pypip.in/v/CMSeq/badge.png)](https://pypi.org/project/CMSeq/) [![Anaconda-Server Badge](https://anaconda.org/bioconda/cmseq/badges/version.svg)](https://anaconda.org/bioconda/cmseq) [![Anaconda-Server Badge](https://anaconda.org/bioconda/cmseq/badges/installer/conda.svg)](https://conda.anaconda.org/bioconda) ### Install via source and pip ### ``` git clone https://github.com/SegataLab/cmseq.git pip install . ``` ### Install via PyPi ### `pip install CMSeq` ### Install via Bioconda ### `conda install -c bioconda cmseq` ## Functions * [Breadth and Depth of Coverage](#breadth-and-depth-of-coverage-with-breadth_depthpy) * [Polymorphic Rate on CDS](#polymorphic-rate-over-protein-coding-genes-with-polymutpy) * [Polymorphic Rate on the whole genome](#polymorphic-rate-with-polypy) * [Reference free consensus](#reference-free-but-guided-consensus-with-consensuspy) **Note: CMSeq can be used [as python module](README_class.md) as well** ## Breadth and Depth of coverage with breadth_depth.py Provides breadth and depth of coverage the references of BAM alignment file, in tabular format. The file must be indexed and sorted (alternatively, --sortindex can be used). ``` usage: breadth_depth.py [-h] [-c REFERENCE ID] [-f] [--sortindex] [--minlen MINLEN] [--minqual MINQUAL] [--mincov MINCOV] [--truncate TRUNCATE] BAMFILE ``` ``` positional arguments: BAMFILE The file on which to operate optional arguments: -h, --help show this help message and exit -c REFERENCE ID, --contig REFERENCE ID Focus on a subset of references in the BAM file. Can be a list of references separated by commas or a FASTA file (the IDs are used to subset) -f If set unmapped (FUNMAP), secondary (FSECONDARY), qc- fail (FQCFAIL) and duplicate (FDUP) are excluded. If unset ALL reads are considered (bedtools genomecov style). Default: unset --sortindex Sort and index the file --minlen MINLEN Minimum Reference Length for a reference to be considered --minqual MINQUAL Minimum base quality. Bases with quality score lower than this will be discarded. This is performed BEFORE --mincov. Default: 30 --mincov MINCOV Minimum position coverage to perform the polymorphism calculation. Position with a lower depth of coverage will be discarded (i.e. considered as zero-coverage positions). This is calculated AFTER --minqual. Default: 1 --truncate TRUNCATE Number of nucleotides that are truncated at either contigs end before calculating coverage values. ``` Breadh and Depth of coverage outputs a table with the breadth of coverage, average and median depth-of-coverage of each reference. Values are calculated only on the covered portion of the reference: |contig|Breadth|Depth (avg)|Depth (median)| |------|-------|-----------|--------------| |EF401177.1.1491|0.101274312542|1.0|1.0| |EF405039.1.1494|0.101070950469|2.69536423841|3.0| |all_contigs|-|1.84768211921|1.0| The last line is a summary line calculated as if all the reads were coming from the same (big) contig. ### Examples: ### Extract breadth and depth of coverage for all the references within a sorted and indexed `BAM` file ``` breadth_depth.py mybam.sorted.bam ``` Extract breadth and depth of coverage for all the references within an unsorted `BAM` file ``` breadth_depth.py --sortindex mybam.sorted.bam ``` Extract breadth and depth of coverage for all the references within a sorted `BAM` file, count only the reads with minimum quality of 25 and positions with a minimum coverage of 10 ``` breadth_depth.py --mincov 10 --minqual 20 mybam.bam ``` Extract breadth and depth of coverage for the references: genome_1 and genome_2 within a sorted `BAM` file ``` breadth_depth.py -c genome_1,genome_2 mybam.bam ``` Extract breadth and depth of coverage for the references present in MYFASTA.fasta, within a sorted `BAM` file ``` breadth_depth.py -c MYFASTA.fasta mybam.sorted.bam ``` ## Polymorphic rate over protein-coding genes with polymut.py **Warning:** Biopython <= 1.76 is required for `polymut.py` This function calculates polymorphic site rates over protein coding genes. It considers dominant and second-dominant alleles over protein-coding genes on the nucleotide level, translates the ORFs into proteins and then calculates and outputs the number of synonymous and non-synonymous mutations (on the protein level) between the dominant and second-dominant protein sequences. Positions with a ratio between second-dominant and dominant allele coverage smaller than dominant_frq_thrsh are considered non-variant. This function was used in [Pasolli et al., 2019](https://pubmed.ncbi.nlm.nih.gov/30661755/) as an ad-hoc measure to calculate strain heterogeneity in metagenomes. Since the likelihood of finding more than one strain in the same gut varies strongly across gut commensals (as well as different within-species genetic diversity), this function does not allow a rigorous classification of metagenomes into strain-mixed and non-strain-mixed, but it can be shown that - considering polymorphic site rates over i.e. core genes of any given speices - samples with a higher polymorphic site rate are more likely to harbour more than one strain. Please supply a gff file from Prokka and make sure that the contig names between the bam file and the gff file can be matched. ``` usage: polymut.py [-h] [-c REFERENCE ID] [-f] [--sortindex] [--minlen MINLEN] [--minqual MINQUAL] [--mincov MINCOV] [--dominant_frq_thrsh DOMINANT_FRQ_THRSH] [--gff_file GFF_FILE] BAMFILE Reports the polymorpgic rate of each reference (polymorphic bases / total bases). Focuses only on covered regions (i.e. depth >= 1) positional arguments: BAMFILE The file on which to operate optional arguments: -h, --help show this help message and exit -c REFERENCE ID, --contig REFERENCE ID Focus on a subset of references in the BAM file. Can be a list of references separated by commas or a FASTA file (the IDs are used to subset) -f If set unmapped (FUNMAP), secondary (FSECONDARY), qc- fail (FQCFAIL) and duplicate (FDUP) are excluded. If unset ALL reads are considered (bedtools genomecov style). Default: unset --sortindex Sort and index the file --minlen MINLEN Minimum Reference Length for a reference to be considered. Default: 0 --minqual MINQUAL Minimum base quality. Bases with quality score lower than this will be discarded. This is performed BEFORE --mincov. Default: 30 --mincov MINCOV Minimum position coverage to perform the polymorphism calculation. Position with a lower depth of coverage will be discarded (i.e. considered as zero-coverage positions). This is calculated AFTER --minqual. Default:1 --dominant_frq_thrsh DOMINANT_FRQ_THRSH Cutoff for degree of `allele dominance` for a position to be considered polymorphic. Default: 0.8 --gff_file GFF_FILE GFF file used to extract protein-coding genes ``` The functions prints three values: * the total number of non-synonymous mutations * the total number of synonymous mutations * the total number of considered positions (total number of positions covered higher than the parameter specified with --mincov) Please note that this function is meant to be used on multi-contig genomes, so **polymut.py reports the sum of non-synonimous and synonimous positions** for all the contigs considered. If you specify a list of contigs with `-c`, only those will be considered. ### Examples ### Calculate the number of non-synonymous, synonymous and the total number of considered positions (on the nucleotide level!) over your contig of interest. ``` python polymut.py -c "contig_of_interest" bam_of_interest.bam --mincov 10 --minqual 30 --dominant_frq_thrsh 0.8 --gff_file gff_from_prokka.gff ``` ## Polymorphic Rate with poly.py Provides the Polymorphic-rate of each reference in a sorted and indexed BAMFILE. The polymorphic rate is defined as: number_of_polymorhpic_sites / number_of_total_nucleotides. Beware that *number_of_total_nucleotides* depends on --minqual and --mincov, as if a position is not covered (e.g. coverage = 0) will not be counted in the denominator. ``` usage: poly.py [-h] [-c REFERENCE ID] [-f] [--sortindex] [--minlen MINLEN] [--minqual MINQUAL] [--mincov MINCOV] [--pvalue PVALUE] [--seq_err SEQ_ERR] [--dominant_frq_thrsh DOMINANT_FRQ_THRSH] BAMFILE Reports the polymorpgic rate of each reference (polymorphic bases / total bases). Focuses only on covered regions (i.e. depth >= 1) positional arguments: BAMFILE The file on which to operate optional arguments: -h, --help show this help message and exit -c REFERENCE ID, --contig REFERENCE ID Focus on a subset of references in the BAM file. Can be a list of references separated by commas or a FASTA file (the IDs are used to subset) -f If set unmapped (FUNMAP), secondary (FSECONDARY), qc- fail (FQCFAIL) and duplicate (FDUP) are excluded. If unset ALL reads are considered (bedtools genomecov style). Default: unset --sortindex Sort and index the file --minlen MINLEN Minimum Reference Length for a reference to be considered. Default: 0 --minqual MINQUAL Minimum base quality. Bases with quality score lower than this will be discarded. This is performed BEFORE --mincov. Default: 30 --mincov MINCOV Minimum position coverage to perform the polymorphism calculation. Position with a lower depth of coverage will be discarded (i.e. considered as zero-coverage positions). This is calculated AFTER --minqual. Default:1 --pvalue PVALUE Binomial p-value threshold for the binomal-polymorphic test. Default: 0.01 --seq_err SEQ_ERR Sequencing error rate. Default: 0.001 --dominant_frq_thrsh DOMINANT_FRQ_THRSH Cutoff for degree of `allele dominance` for a position to be considered polymorphic. Default: 0.8 ``` The output is strucutred as follows: ``` |referenceID|dominant_allele_distr_mean|dominant_allele_distr_perc_10|...|dominant_allele_distr_sd|tot_covered_bases|tot_polymorphic_bases|polymorphic_rate| |----|----|----|----|----|----|----|----|----| |EF401177.1.1491|-|-...|-|151.00|0.00|0.00| |EF405039.1.1494|0.65|0.67|...|0.04|151.00|13.00|0.09| |-GENOME-|0.65|0.67|...|0.04|302.00|13.00|0.04| ``` As for ``breadh_depth.py``, also the polymorphic rate analyisis is subjected to ``mincov``, ``minqual``, and ``minlen``. Additionally, two parameters can be set to decide when a site is polymorphic: * ``dominant_frq_thrsh`` is a percentage: if the majoritary allele frequency at position x is greater than the threshold, x is considered non-polymorphic. Otherwise, a binomial test is performed to assure that x is polymorpfic (polymorphic if p < ``pvalue``) ### Examples ### Extract polymorphic rate from a sorted and indexed bam file ``` poly.py mybam.sorted.bam ``` Extract polymorphic rate from an unsorted bam file ``` poly.py --sortindex mybam.sorted.bam ``` Extract polymorphic rate from an unsorted bam file, counting only bases with minimum quality of 30 and minimum position-coverage of 10 ``` poly.py --sortindex --mincov 10 --minqual 30 mybam.unsorted.bam ``` Extract polymorphic rate from an unsorted bam file, only for reads aligning against genome_1 or genome_2. Consider polymorphic only sites with majoritary-allele-freq < 70% ``` poly.py --sortindex -c genome_1,genome_2 --dominant_frq_thrsh 0.7 mybam.unsorted.bam ``` ## Reference Free (but guided) consensus with consensus.py Provides the Reference Free consensus for the references in a BAM alignment file, reconstructing the sequence from the raw reads, in FASTA format to standard output. The file must be indexed and sorted (alternatively, --sortindex can be used). Note that the length of the reconstructed sequence is bound to the original length of the reference. On that length, not all the positions may be covered. This can happen because: * there are no reads mapping to the position * there are too few reads (*i.e < ``mincov``*) mapping to the position * the reads that map to the position have a low quality (*i.e. < ``minqual``*) * the distribution of nucleotides at that position is potentially problematic (*i.e. dominant_allele_frequency < ``dominant_frq_thrsh``*): in this case, the position is excluded to reduce noise. ``` usage: consensus.py [-h] [-c REFERENCE ID] [-f] [--sortindex] [--minqual MINQUAL] [--mincov MINCOV] [--dominant_frq_thrsh DOMINANT_FRQ_THRSH] [--minlen MINLEN] [--trim TRIM] BAMFILE outputs the consensus in FASTA format. Non covered positions (or quality- trimmed positions) are reported as a dashes: - positional arguments: BAMFILE The file on which to operate optional arguments: -h, --help show this help message and exit -c REFERENCE ID, --contig REFERENCE ID Focus on a subset of references in the BAM file. Can be a list of references separated by commas or a FASTA file (the IDs are used to subset) -f If set unmapped (FUNMAP), secondary (FSECONDARY), qc- fail (FQCFAIL) and duplicate (FDUP) are excluded. If unset ALL reads are considered (bedtools genomecov style). Default: unset --sortindex Sort and index the file --minqual MINQUAL Minimum base quality. Bases with quality score lower than this will be discarded. This is performed BEFORE --mincov. Default: 30 --mincov MINCOV Minimum position coverage to perform the polymorphism calculation. Position with a lower depth of coverage will be discarded (i.e. considered as zero-coverage positions). This is calculated AFTER --minqual. Default: 0 --dominant_frq_thrsh DOMINANT_FRQ_THRSH Cutoff for degree of `allele dominance` for a position to be considered polymorphic. Default: 0.8 --minlen MINLEN Minimum Reference Length for a reference to be considered. Default: 0 --trim TRIM Trim the reads before computing the consensus. A value of 10:10 means that the first and last 10 positions of each read will be ignored. Default: None ``` Note that positions with a majoritary allele frequency lower than dominant_frq_thrsh will be considered "problematic" and substituted with a "-", even with sufficient coverage and quality. ``` consensus.py ~/tmp.bam.sorted -c EF401177.1.1491,EF405039.1.1494 --mincov 1 --dominant_frq_thrsh 0.5 >EF401177.1.1491_consensus ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ -----------------------------------TACGTAGGGGGCAAGCGTTATCCGG ATTTACTGGGTGTAAAGGGAGCGTAGACGGCGAGACAAGTCTGAAGTGAAAGCCCGGGGC TCAACCCCGGGACTGCTTTGGAAACTGCCTTGCTAGAGTGCTGGAGAGGTAAGTGGAATT CCTAGT------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ --------------------------------------------------- >EF405039.1.1494_consensus ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ -------------------------------------TACGTAGGTGGCAAGCGTTATCC GGATTTACTGGGTGTAAAGGGCGTGCAGCCGGGTCTGCAAGTCAGATGTGAAATCCATGG GCTCAACCCATGAACTGCATTTGAAACTGTAGATCTTGAGTGTCGGAGGGGCAATCGGAA TTCCTAGT---------------------------------------------------- ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------ ``` ### Examples ### Extract the consensus from all the references from a sorted and indexed BAM file, in FASTA format: ``` consensus.py mybam.sorted.bam ``` Extract the consensus from all the references from an unsorted BAM file, in FASTA format: ``` consensus.py --sortindex mybam.sorted.bam ``` Extract the consensus of genome_1 and genome_2 from a BAM file. Positions with coverage lower than 5 are ignored (- is reported instead of base-call): ``` consensus.py --mincov 5 -c genome_1,genome_2 mybam.sorted.bam ``` Extract the consensus of genome_1 and genome_2 from a BAM file. Positions with coverage lower than 5 "high quality" bases are ignored (- is reported instead of base-call). Additionally, positions with less than 50% majoritary-letters will be substituted by a "-": ``` consensus.py --mincov 5 --minqual 30 -c genome_1,genome_2 --dominant_frq_thrsh 0.5 mybam.sorted.bam ``` Same as above, but a FASTA file is used to filter references instead: ``` consensus.py --mincov 5 --minqual 30 -c FILTER_FASTA.fasta --dominant_frq_thrsh 0.5 mybam.sorted.bam ``` ### Ancient DNA consensus Extract the consensus of genome from a BAM file, in the scenario of ancient metagenomics study. Positions with coverage lower than 5 and damage probability (Stats_out_MCMC_correct_prob.csv from mapDamage2) higher than 0.95 are ignored. ``` consensus_aDNA.py --mincov 5 -r reference.fna --pos_specific_prob_tab Stats_out_MCMC_correct_prob.csv --pos_damage_prob_thrsh 0.95 mybam.sorted.bam ```

نیازمندی

مقدار	نام
-	numpy
-	scipy
-	pysam
-	pandas
-	biopython
-	bcbio-gff

نحوه نصب

نصب پکیج whl CMSeq-1.0.4:

pip install CMSeq-1.0.4.whl

نصب پکیج tar.gz CMSeq-1.0.4:

pip install CMSeq-1.0.4.tar.gz