معرفی شرکت ها

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

Perform homolog detectability

ویژگی	مقدار
سیستم عامل	-
نام فایل	absense-0.1.6
نام	absense
نسخه کتابخانه	0.1.6
نگهدارنده	[]
ایمیل نگهدارنده	['Troy Comi <tcomi@princeton.edu>']
نویسنده	-
ایمیل نویسنده	Troy Comi <tcomi@princeton.edu>
آدرس صفحه اصلی	-
آدرس اینترنتی	https://pypi.org/project/absense/
مجوز	-

# abSENSE: a method to interpret undetected homologs [![Try Now!][pyodide-badge]][pyodide-link] [![Actions Status][actions-badge]][actions-link] [![Documentation Status][rtd-badge]][rtd-link] [![PyPI version][pypi-version]][pypi-link] [![PyPI platforms][pypi-platforms]][pypi-link] [pyodide-badge]: https://img.shields.io/badge/Pyodide-Try%20It!-blue.svg [pyodide-link]: https://potential-waffle-8b9309b7.pages.github.io/ [actions-badge]: https://github.com/PrincetonUniversity/abSENSE/workflows/CI/badge.svg [actions-link]: https://github.com/PrincetonUniversity/abSENSE/actions [pypi-link]: https://pypi.org/project/abSENSE/ [pypi-platforms]: https://img.shields.io/pypi/pyversions/abSENSE [pypi-version]: https://badge.fury.io/py/abSENSE.svg [rtd-badge]: https://readthedocs.org/projects/abSENSE/badge/?version=latest [rtd-link]: https://abSENSE.readthedocs.io/en/latest/?badge=latest ## INTRODUCTION abSENSE is a method that calculates the probability that a homolog of a given gene would fail to be detected by a homology search (using BLAST or a similar method) in a given species, even if the homolog were present and evolving normally. The result of this calculation informs how one interprets the result of a homology search failing to find homologs of a gene in some species. One possibility to explain such a result is that the gene is _actually absent_ from the genome in that species: a biological, and potentially interesting (e.g. if due to a gene loss or the birth of a new gene), result. A second explanation, often ignored, is that the homolog _is_ present in the genome of that species, but that the homology search merely lacks statistical power to detect it. Here, the apparent absense of the homolog is a technical/statistical limitation, and does not reflect underlying biology. By calculating the probability that your homology search would fail to detect a homolog _even if one were present_ and _even if it were evolving normally_ (e.g. no rate accelerations on a specific branch, potentially suggestive of biologically interesting changes), abSENSE informs the interpretation of a negative homology search result. If abSENSE finds that there is a high probability of a homolog being undetected even if present, you may not be as inclined to invoke a biological explanation for the result: the null model of a failure of the homology search is sufficient to explain what you observe. **The method is explained in complete detail in the paper in which it's introduced. This is the paper that is referred to in the rest of this README:** **Weisman CM, Murray AW, Eddy SR (2020) Many, but not all, lineage-specific genes can be explained by homology detection failure. PLoS Biol 18(11): e3000862. https://doi.org/10.1371/journal.pbio.3000862** There, it is applied to the specific case of lineage-specific genes, for which homologs appear absent in all species outside of a narrow lineage. The method itself is applicable to any case in which a homolog appears absent (e.g. a single species missing a homolog that one might interpret as a gene loss), and likewise, this code is applicable to all such cases. This is a rewrite of the [original repository](https://github.com/caraweisman/abSENSE/tree/c355c458e83722a0ffdf7284d4ea1f6f29ce205f) ## Usage ### abSENSE THE BASICS #### Quickstart: main analysis The main analysis script, `Run_abSENSE.py`, calculates the probabilities of homologs of genes from some "focal" species being undetected in a set of N other species. It can perform this analysis for an arbitrary number of genes in the taxon at one time. It requires a minimum of two input files: i) A file containing the bitscores of homologs of each gene to be analyzed in at least three of the species (including the focal species itself, so two others). ii) A file containing the N evolutionary distances, in substitutions/site, between the focal species and each other species. The distance between the focal species and itself should be 0. (If you don't already have such distances, a description of how to calculate them relatively painlessly can be found in Weisman et al 2020). Examples of both of these files for a subset of genes from S. cerevisiae and their orthologs 11 other fungal species (the same species analyzed in Weisman et al 2020) can be found in the folder Quickstart\_Examples: the bitscore file is Fungi\_Example\_Bitscores, and the distance file is Fungi\_Distances. They exemplify the formatting required for abSENSE to run (explained in more detail below). To run abSENSE on a given bitscore and distance file: ``` python Run_abSENSE.py \ --distfile <distance_file> \ --scorefile <score_file> ``` For example, to run abSENSE on the example fungal genes, type: ``` python Run_abSENSE.py \ --distfile Quickstart_Examples/Fungi_Distances \ --scorefile Quickstart_Examples/Fungi_Example_Bitscores ``` For each gene in the input bitscore file, the following will be computed: a) The probabilities of a homolog being undetected in each species (in the file `Detection_Failure_Probabilities`); b) The expected bitscores of homologs in each species (in the file `Predicted_bitscores`); c) The 99% confidence interval around this bitscore in each species (low and high bounds listed in separate files: `Bitscore_99PI_lowerbound_predictions` and `Bitscore_99PI_higherbound_predictions`) These results will be output to a set of tab-delimited files in a separate directory, by default named with the start time of the analysis. (You can specify the name with a command line option, see below). Additional information on output files is below. #### Quickstart: visualization A supplemental visualization script, `Plot_abSENSE.py`, performs the same analysis as above, but for one gene at a time, and also produces a visualization of the results (see Weisman et al 2020). It is run in the same way, except that it also requires specifying which single gene in the bitscore input file you wish to analyze. To run abSENSE on gene GENEID contained in a given bitscore with a given distance file: ``` python Plot_abSENSE.py \ --distfile <distance_file> \ --scorefile <score_file> \ --gene <GENEID> ``` For example, to analyze the S. cerevisiae gene Uli1, listed in the bitscore file under its RefSeq ID (NP\_116682.3), type: ``` python Plot_abSENSE.py \ --distfile Quickstart_Examples/Fungi_Distances \ --scorefile Quickstart_Examples/Fungi_Example_Bitscores \ --gene NP_116682.3 ``` The same results as above will be computed, but now they will be output to the terminal, and then the visualization will be shown. ### All Options You can specify advanced options with the additional command line options. You can view them all with `python Run_abSENSE.py --help` They are: `--out`: The prefix of the directory to which your results will be output. Default is the time at which you ran the analysis (to avoid overwriting of results). `--Eval`: The E-value threshold to be used (above this value, homologs will be considered undetected). Default is 0.001 (fairly permissive). `--genelenfile`: Allows you to specify a file containing the lengths (in aa) of all genes in the bitscore file to be analyzed. Default is 400 amino acids (~average protein size in many species) for all proteins. abSENSE predicts a bitscore, which is then converted to an E-value to determine detectability; this conversation technically requires knowledge of both the size of the database in which the search occurs (see below) and the length of the gene being searched. Because the conversion between these values and E-value is logarithmic, though, only fairly large changes in these values substantially affect results. Examples of such files containing the lengths of all S. cerevisiae and D. melanogaster genes can be found in Fungi\_Data/S\_cer\_Protein\_Lengths and Insect\_Data/D\_mel\_Protein\_Lengths. The format required is described more below. `--dblenfile`: Allows you to specify a file containing the sizes (in aa) of the database in which the homology search for each of your N species is performed. Default is 400 amino acids * 20,000 amino acids / gene = 8,000,000 amino acids (~average protein and proteome size in many species) for all species. abSENSE predicts a bitscore, which is then converted to an E-value to determine detectability; this conversation technically requires knowledge of both the size of the database in which the search occurs and the length of the gene being searched (see above). Because the conversion between these values and E-value is logarithmic, though, only fairly large changes in these values substantially affect results. Examples containing the lengths of all S. cerevisiae and D. melanogaster genes can be found in Fungi\_Data/Fungi_Database_Lengths and Insect\_Data/Insect\_Database\_Lengths. The format required is described more below. `--predall`: Default is False. When True, Causes abSENSE to calculate the probability of homologs being undetected, the expected bitscores, and 99\% confidence intervals not only in species in which homologs were actually undetected, but also for species in which homologs have been found. This is obviously not the main use case, and is especially uninformative when those homologs and their bitscores have been used in the prediction itself (see below). May potentially be useful to see if a homolog in one species, although detected, seems to be behaving anomalously compared to those in other species (eg due to rate acceleration). `--includeonly`: Allows you to restrict the species whose bitscores are used in predicting bitscores in other species. Mainly useful to do control-type analyses, such as Figure 5 in Weisman et al 2020, to show that predictions made from only a subset of the data are nonetheless reliable. If not specified, abSENSE uses all available bitscores in the prediction. The format here should be a list of the species names as they are written in the distance and bitscore files, separated by commas and NO SPACES. For example, to include only data from the yeasts S. cerevisiae, S. paradoxus, and S. kudriavzevii in an analysis based on the files in `Fungal_Data,` you would type: `--includeonly S_cer,S_par,S_kud` For example, to run an analysis on all S. cerevisiae proteins in the selected fungal species in which the lengths of each S. cerevisiae protein and the sizes of each species' search database (their annotated proteomes as indicated in the supplement of Weisman et al 2020) are specified: ``` python Run_abSENSE.py \ --distfile Fungi_Data/Fungi_Distances \ --scorefile Fungi_Data/Fungi_Bitscores \ --genelenfile Fungi_Data/S_cer_Protein_Lengths \ --dblenfile Fungi_Data/Fungi_Database_Lengths ``` The visualization script `Plot_abSENSE.py` takes all of the same options, with the exception of again requiring that the gene to be analyzed is specified by the --gene option, and also that the --genelenfile option is instead --genelen, after which should be entered an integer corresponding to the length of the gene. (With a single gene, it's hardly worth requiring a whole file: just give the number.) ### Output Files `Run_abSENSE.py` outputs six output files. Examples resulting from running abSENSE on the provided insect and fungal data can be found in `Fungi_Data/Fungi_abSENSE_Results/` and `Insect_Data/Insect_abSENSE_Results/` respectively. #### `Detection_failure_probabilities` The central output of the program. For each gene in the analysis, this contains the predicted probability that a homolog in each species would be undetected at the specified E-value by a homology search, even if the homolog were present. By default, this is only calculated in species in which the gene was not detected. Results for species in which homologs were detected are therefore listed as "detected". The setting --predall will calculate this value for all species, even those in which a homolog was in fact detected. If not enough data for a gene was provided to generate a bitscore prediction (bitscores of homologs from at least three species are needed), the results will read "not\_enough\_data". #### `Predicted_bitscores` For each gene in the analysis, this contains the predicted (maximum likelihood) bitscore of a homolog in each species. By default, bitscores are only predicted in species in which the gene was not detected. Results for species in which homologs were detected are therefore listed as "detected". The setting --predall will calculates this value for all species, even those for which a homolog was in fact detected. Here, the known bitscore (often used in the prediction process; see the option --includeonly) will be shown alongside the prediction. If the known bitscore was used in the fitting process, of course, these will usually be quite similar! If not enough data for a gene was provided to generate a bitscore prediction (bitscores of homologs from at least three species are needed), the results will read "not\_enough\_data". #### `Bitscore_99PI_upperbound_predictions` For each gene in the analysis, this contains the upper bound of the 99\% confidence interval for the bitscore of a homolog in each species. By default, this is only calculated in species in which the gene was not detected. Results for species in which homologs were detected are therefore listed as "detected". The setting --predall will calculates this value for all species, even those for which a homolog was in fact detected. Here, the known bitscore (often used in the prediction process; see the option --includeonly) will be shown alongside the prediction. If the known bitscore was used in the fitting process, of course, these will usually be quite similar! If not enough data for a gene was provided to generate a bitscore prediction (bitscores of homologs from at least three species are needed), the results will read "not\_enough\_data". #### `Bitscore_99PI_lowerbound_predictions` For each gene in the analysis, this contains the lower bound of the 99\% confidence interval for the bitscore of a homolog in each species. By default, this is only calculated in species in which the gene was not detected. Results for species in which homologs were detected are therefore listed as "detected". The setting --predall will calculates this value for all species, even those for which a homolog was in fact detected. Here, the known bitscore (often used in the prediction process; see the option --includeonly) will be shown alongside the prediction. If the known bitscore was used in the fitting process, of course, these will usually be quite similar! If not enough data for a gene was provided to generate a bitscore prediction (bitscores of homologs from at least three species are needed), the results will read "not\_enough\_data". #### `Parameter_values` For each gene in the analysis, this contains the best-fit (maximum likelihood) values of the a and b parameters. (See Weisman et al 2020 for full explanation.) These a and b parameters are calculated from bitscores of homologs in species included in the prediction process. If the command line option --includeonly is used, this will be only the species specified by that option. By default, all provided bitscores are used. If not enough data for a gene was provided to generate a bitscore prediction (bitscores of homologs from at least three species are needed), the results will read "not\_enough\_data". #### `Run_info` Contains information about the analysis, including names of input files, options/settings used, and the analysis time. ### Input File Formats Required files: #### The bitscore file For an analysis of M genes in N species (including the focal species), the bitscore file should be a tab-delimited file of N+1 columns by M+1 rows. The first row should begin with a blank entry, and should be followed by N entries containing the names of the N species in your analysis. These names should match those in the distance file (below) exactly. The remaining M rows should each begin with the name/identifier of the gene from the focal species to be analyzed, followed by the bitscore of that gene against its homolog in the species indicated at the top of the given column. For species in which homologs are undetected, this value should be __0__. For species in which a homolog is detected, but the orthology is unclear and so you wish to exclude it from being used in the fit (see Weisman et al 2020), this value should be __'N/A'__. Two examples are provided: Fungi\_Data/Fungi_\Bitscores and Insect_\Data/Insect\_Bitscores. #### The distance file For an analysis with N species (including the focal species), the distance file should be a tab-delimited file of 2 columns by N rows. Entries in the first column should contain the name of each species in your analysis. These names should match those in the bitscore file (above) exactly. Entries in the second column should contain the evolutionary distance between each species in the indicated column and the focal species. (The distance between the focal species and itself should always be 0.) Two examples are provided: Fungi\_Data/Fungi\_Distances and Insect\_Data/Insect\_Distances. #### Optional files: ##### The gene length file For an analysis of M genes, the gene length file should be a tab-delimited file of 2 columns by M rows. The first column should contain the names/identifiers of the gene from the focal species to be analyzed. These should match exactly the names in the bitscore file (above). The second column should contain the length in amino acids of that gene. If you don't already have such a file, here is a command to make one (named OUTPUTFILENAME), from a FASTA file (SEQFILENAME) containing all of the sequences: (It requires the easel package, which comes with the HMMER software, available at http://hmmer.org/documentation.html.) ``` esl-seqstat -a (SEQFILENAME) | \ awk '{print $2 "\t" $3}' | \ tac | \ sed -e '1,7d' | \ tac > (OUTFILENAME) ``` ##### The database length file For an analysis of N species, the database length file should be a tab-delimited file of 2 columns by N rows. The first column should contain the names of the N species. These should match exactly the names in the bitscore and distance files (above). The second column should contain the sizes, in amino acids, of the database in which the homology search for each species is performed. For example, if you are searching for a gene in each of the species' annotated proteomes, it should be the size of that species' proteome in amino acids. If instead you are searching against a pan-specific database, like for example NR, for all species, it should be the size of that database in amino acids. If you don't already have such a file, you can make one easily if you have the databases themselves in eg FASTA format (again requiring the easel package, downloadable with HMMER as in c) above: just run the command `esl-seqstat (FASTA FILE)` on each database; this will report the total length in aa of each database file. You can then put these into a tab-delimited file manually. If your BLAST search is done on a nucleotide genome of the outgroup species via TBLASTN, the database length that you should use for each genome is 2N, where N is the genome length in nucleotides. (For a genome of N nucleotides, there are ~N/3 codons in it, which can be read in each of 6 possible reading frames, for a total of 6N/3 = 2N amino acids.)

نیازمندی

مقدار	نام
-	click>=8.1.2
-	matplotlib>=3.5.1
-	numpy>=1.21.1
-	pandas>=1.1.5
-	scipy>=1.6.1
=3.7	typing_extensions
xtr	pytest>=6;
xtr	pytest-mock>=3.7.0;
xtr	Sphinx>=4.0;
xtr	myst_parser>=0.13;
xtr	sphinx-book-theme>=0.1.0;
xtr	sphinx_copybutton;
xtr	pytest>=6;
xtr	pytest-mock>=3.7.0;

زبان مورد نیاز

مقدار	نام
>=3.7	Python

نحوه نصب

نصب پکیج whl absense-0.1.6:

pip install absense-0.1.6.whl

نصب پکیج tar.gz absense-0.1.6:

pip install absense-0.1.6.tar.gz