معرفی شرکت ها

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

Annotation of Genomic and Transcriptomic Intervals using GTF/GFF files.

ویژگی	مقدار
سیستم عامل	-
نام فایل	AGouTI-1.0.3
نام	AGouTI
نسخه کتابخانه	1.0.3
نگهدارنده	[]
ایمیل نگهدارنده	[]
نویسنده	Jan Kosinski
ایمیل نویسنده	jankos@amu.edu.pl
آدرس صفحه اصلی	https://github.com/zywicki-lab/agouti
آدرس اینترنتی	https://pypi.org/project/AGouTI/
مجوز	GNU General Public License

# AGouTI - Annotation of Genomic and Transcriptomic Intervals ## Introduction High-throughput sequencing techniques have become very popular in molecular biology research. In many cases, obtained results are described by positions corresponding to the transcript, gene, or genome. Usually, to infer the biological function of such regions, it is necessary to annotate these regions with overlapping known genomic features, such as genes, transcripts, exons, UTRs, CDSs, etc. AGouTI is a tool designed to annotate any genomic or transcriptomic coordinates using known genome annotation data in GTF or GFF files. #### Main features 1. AGouTI works with coordinates describing positions within the genome and within the transcripts, 2. Ability to assign intragenic regions from provided GTF/GFF annotation (UTRs, CDS, etc.) or de novo (5’ part, middle, 3’ part, whole), 3. Annotation of intervals in standard BED or custom column-based text files (TSV, CSV, etc.) in any non-standard format, 4. Flexible handling of multiple annotations for a single region, 5. Flexible selection of GTF/GFF attributes to include in the annotation. <hr> ## Installation Python >= 3.7 is required to run this software. You can install AGouTI using `pip` as follows (recommended) `pip install AGouTI` or by `python setup.py install` #### Having troubles with an older Python version? You can easily manage python versions using `conda` and the concept of virtual environments. `Anaconda` can be downloaded from https://www.anaconda.com/distribution/#download-section (Python 3.x version) To create a virtual environment with a specified Python version, you can type ` conda create --name your_virtualenv_name python=3.8 ` Afterward, you need to activate your virtual environment. ` conda activate your_virtualenv_name ` Now you can install AGouTI using `pip` or `conda` and perform your analysis. After your job is finished, you can leave your virtual environment. ` conda deactivate ` As an alternative to conda you can use Python's `venv` <hr> ## Run AGouTI You can now access AGouTI as follows: ` agouti --help ` Running AGouTI is a two-step process. First, you need to create a dedicated database based on your annotation file (GTF/GFF3) using the `create_db` module and then annotate your intervals with the features of interest using `annotate`. ` agouti create_db --help ` ` agouti annotate --help ` ### Step 1. Create the database We have decided to rely on the SQLite database to efficiently store and access annotation data. Thus, the first step of our annotation pipeline is to create such a database using information included in GTF or GFF3 files. All feature names and attributes from the GTF/GFF3 file are automatically converted to lowercase to uniform the feature selection during the annotation step. By default, the initial database is created in RAM. Then it is inspected to provide the user with a list of features and attributes available in the GTF/GFF file and visualize the hierarchy of those features in a graph-based manner. The inspection process is efficient while the initial database is stored in memory (by default), but for low-memory machines, it can also be stored on the hard drive (an SSD drive would be recommended for speed). Memory consumption depends on the number of features and attributes that must be stored. For example, estimated memory use for Gencode annotation files of the human genome is up to 5GB. Finally, the database is written on a hard drive and can be used for annotation. All those steps are done automatically using create_db mode. Example invocations: ` agouti create_db -f GTF -a gtf_of_your_choice.gtf -d database_name ` or ` agouti create_db -f GFF3 -a gff3_of_your_choice.gff3 -d database_name ` ##### Required options -a, --annotation : Input file containing gene annotations in GTF or GFF3 format. -f, --format : Input file format (GTF or GFF3), can be gzip-compressed. -d, --db : Name for the output database. ##### Additional options -l, --low-ram : Creates the database as a sqlite3 file directly on your disk. By default, the initial database is created in RAM to quickly inspect contents and relations between features and afterward saved on a local drive. Using this option may significantly slow down database creation. Use only when your RAM size is limited in comparison to the expected size of your database. -i, --infer_genes : Infer genes. Use only with GTF files that do not contain separate lines describing genes. This step might be very time-consuming. -j, --infer_transcripts : Infer transcripts. Use only with GTF files that do not contain separate lines describing transcripts. This step might be very time-consuming. ##### Output The output files include: 1. database_name - the SQLite database file 2. database_name.relations - text file storing relations between feature types. This file is required for annotation with AGouTI. Therefore it must be stored in the same directory as the database file. 3. database_name.attributes_and_features.pickle - python dictionary stored as a pickle file. This file is required for annotation with AGouTI. Therefore it must be stored in the same directory as the database file. 4. database_name.database.structure.txt - additional text file listing all the features and attributes present in the GTF/GFF3 file and showing relations between them in a tree-like structure The output consists of a tree-based structure representing the hierarchy of the features in the GTF or GFF3 file and a list of available attributes for each feature type. That information is by default displayed on stdout. Having insights into the database contents, users can choose only a subset of features and attributes to annotate the dataset of interest. ### Step 2. Annotate your file After creating the database, you can run annotation with AGouTI using the `agouti` annotate command. For example, AGouTI can annotate intervals stored in standard BED or any column-based text file containing information about genomic or transcriptomic coordinates (see the `--custom` option). The transcriptomic mode enables the annotation of intervals encoded as positions within the transcripts instead of chromosomes. However, it is instrumental in annotating results from transcript-focused analyses (e.g., RBPs binding sites, identification of RNA structural motifs, etc.), it is essential to use the same source and version of annotation files as during the generation of the intervals submitted for annotation. The transcript IDs often change between the annotation releases as the transcript layout is improved (e.g., ENST00000613119.1 -> ENST00000613119.2 -> ENST00000613119.3, etc.). Since the transcript IDs are part of the coordinate system in transcriptomic mode, any difference will result in an annotation error. Basic command: `agouti annotate -i input.bed -d database_name` ##### Required Options -i, --input : Input file in BED or another column-based format (see --custom). -d, --database : Database created by agouti create_db. ##### Additional options -m, --custom : Specify that the input text file is in custom format, besides BED. It should contain columns with information about feature id (id), chromosome (chr), start (s), and end (e) coordinates. Users can optionally specify a column with strand information (strand); otherwise, AGouTI will set it to '.'. Format should be specified as column indexes (starting from 1), in the following order: "id,chr,s,e,strand" or "id,chr,s,e", e.g. --custom 1,2,4,5,6. The field separator used in your file can be provided using the --separator option. -p, --separator : Field separator for the --custom option. Default is tabulator. -b, --first_base_num' : The first base number in the input file (BED/CUSTOM). Either 0 (0-based coordinates) or 1 (1-based coordinates). Default is 0 (standard for genomic BED files). -n, --header_lines : The number of header lines. 0 by default. If a single header line is present, set this parameter to 1, etc. -t, --transcriptomic : Transcriptomic annotation mode. In this mode, transcript IDs from the GTF/GFF3 are expected to be placed in the first column of provided BED file instead of chromosome names. Coordinates in this mode are assumed to reflect positions within the transcript. Optional. -f, --select_features : Comma-separated list of feature names to be reported, e.g., "mRNA,CDS". Refer to [db_name].database.structure.txt file written during the database creation for a list of valid features. By default, all features are reported. -a, --select_attributes : Comma-separated list of attribute names to be reported, e.g., "ID,description". Refer to [db_name].database.structure.txt file written during the database creation for a list of valid attributes. By default, all attributes are reported. -c, --combine : List of specific feature-attribute combinations to be reported. Desired combination should be specified in the format: 'feature1-attribute1:attribute2,feature2-attribute1', e.g., "mRNA:description,biotype" for each mRNA, will provide annotation of mRNA description and mRNA biotype. -s, --strand_specific : Strand-specific search. -w, --completly_within : The feature must lie entirely within the GTF/GFF3 feature to be reported. -l, --level : Group results on a specific level (e.g., 1 for gene level, 2 for mRNA, tRNA). Must be one of [1,2]. An annotation may be done on gene or transcript levels (level 1 or 2) so that each output line will correspond to the gene or transcript. Please note that --level 1 cannot be combined with --transcriptomic. Default is 2. -r, --annotate_feature_region : Report region within the GTF or GFF3 feature, which overlaps with an entry from the input file. Designed to work with `--transcriptomic`. Possible values: 1. `5 prime` - when the annotated feature starts within the first quarter of gene or transcript and ends in the first half 2. `middle` - starts and ends within the second and third quarter 3. `3 prime` - starts within the third quarter and ends in the last one 4. `whole` - starts within the first quarter and ends within the last one. The length of the annotated feature does not exceed 90% of transcript or gene length. 5. `full` - starts within the first quarter and ends within the last one. The length of the annotated feature does exceed 90% of transcript or gene length. --statistics: Calculate additional statistics. Those will be displayed at the end of the software\'s output (starting with #). --stats_only: Display statistics only. ##### Output Output is by default displayed on stdout in the form of a self-explanatory `.tsv` table. <hr> ## Test case 1 All files used in this case are available at https://github.com/zywicki-lab/agouti (`agouti/agouti_pkg/sample_data.tar.gz`). ### Annotate human single nucleotide polymorphisms (SNPs) stored in the BED file (`common_snp.bed`). We've downloaded BED file from the UCSC Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables) using the following filters: Clade: Mammal Genome: Human Assembly: Dec. 2013 (GRCh38/hg38) Group: Variation Track: Common SNPs(151) Table: snp151Common Position: chrX Output format: BED - browser extensible data and saved only the first 1000 SNPs in `common_snp.bed`. Furthermore, we've downloaded gene annotations (only the X chromosome, genome version GRCh38.p13) in the GFF3 file format from the Ensembl database (https://www.ensembl.org) and subtracted to contain only the first 18178 lines. To run AGouTI, the User needs to create the database based on the contents of the GFF3 file. It can be done using the command `agouti create_db -f GFF3 -a Homo_sapiens.GRCh38.105.chromosome.X.gff3.gz -d Homo_sapiens.GRCh38.105.chromosome.X.db` After the job completes, the User can explore the structure and contents of the database and GFF3 file by examining the `Homo_sapiens.GRCh38.105.chromosome.X.db.database.structure.txt`. Let’s say that we are not interested in pseudogenes, unconfirmed_transcripts, and non-coding RNAs. To annotate SNPs using our database and calculate additional statistics (`--statistics`), we can type `agouti annotate -d Homo_sapiens.GRCh38.105.chromosome.X.db -i common_snp.bed -f gene,lnc_rna,exon,mrna,five_prime_utr,three_prime_utr,cds --statistics > annotated_snp.tsv` You can examine the results stored in the `annotated_snp.tsv` Please note! Some SNPs lie in the intergenic regions, and the closest gene upstream/downstream is marked as `None`, because we operate near the chromosome border, and no gene is annotated upstream or downstream from such SNPs. To discard SNPs located in the intergenic regions, we can use `grep`, `awk`, or similar tools, e.g. `grep -v intergenic annotated_snp.tsv > annotated_snp_intergenic_discarded.tsv` You can also make annotations on gene instead of transcript level (option `-l`) `agouti annotate -d Homo_sapiens.GRCh38.105.chromosome.X.db -i common_snp.bed -f gene,lnc_rna,exon,mrna,five_prime_utr,three_prime_utr,cds -l 1 --statistics | grep -v intergenic > annotated_snp_l1.tsv` You can explore the differences by examining the `annotated_snp_l1.tsv` file. <hr> ## Test case 2 All files used in this case are available at https://github.com/zywicki-lab/agouti (`agouti/agouti_pkg/sample_data.tar.gz`). ### Annotate sample results obtained with the missRNA software (`missRNA.tsv`). Records represent coordinates of the small RNA molecules excised from the longer RNA molecules. We've downloaded human gene annotations (only the 1st chromosome, genome version GRCh38.p13) in the GFF3 file format from the Ensembl database (https://www.ensembl.org) and subtracted to contain only the first 68571 lines. To create the databasae, run: `agouti create_db -f GFF3 -a Homo_sapiens.GRCh38.105.chromosome.1.gff3.gz -d Homo_sapiens.GRCh38.105.chr1.db` To annotate the file in format other than standard BED using transcriptomic mode, use: `agouti annotate -i missRNA.tsv -d Homo_sapiens.GRCh38.105.chr1.db -t -r -m 2,1,4,5,6 > missRNA-results.tsv` The self-explanatory output file is stored as `missRNA-results.tsv` in the same directory as sample datasets. <hr> ### Contribute If you notice any errors and mistakes or would like to suggest some new features, please use Github's issue tracking system to report them. You are also welcome to send a pull request with your corrections and suggestions. ### License This project is licensed under the GNU General Public License v3.0 license terms. Anytree (Copyright (c) 2016 c0fec0de) and gffutils (Copyright (c) 2013 Ryan Dale) packages are distributed with this software to ensure full compatibility. Documentation, authors, license and additional information can be found at https://anytree.readthedocs.io/en/latest/ and https://github.com/daler/gffutils.

نیازمندی

مقدار	نام
-	numpy
-	pandas

زبان مورد نیاز

مقدار	نام
>=3.7.0	Python

نحوه نصب

نصب پکیج whl AGouTI-1.0.3:

pip install AGouTI-1.0.3.whl

نصب پکیج tar.gz AGouTI-1.0.3:

pip install AGouTI-1.0.3.tar.gz