# IBSpy

[](https://codeclimate.com/github/Uauy-Lab/IBSpy/maintainability)
Python library to identify Identical By State regions
To build the mker database for kmc and the tests run this comand:
```sh
kmc -k31 -r -ci1 -fm data/test4B.jagger.fa data/test4B.jagger.kmc_k31 tmp
```
## Installyng IBSpy
There easiest way to install IBSpy is to use pip3.
```sh
pip3 install IBSpy
```
If ```pip3``` fails, you can clone the project and compiling it with:
```sh
pip3 install cython biopython pyfaidx
python3 setup.py develop
```
Then you should have the IBSpy command available.
### KMC3
If you want to use the [KMC](https://github.com/refresh-bio/KMC) binder, install the KMC and compile the python instructions.
Then, run the following command to setup the path for it.
```sh
cd KMC/py_kmc_api
source set_path.sh
```
## Preparing the databases
IBSpy requires to have a kmer database from the sequencing files. Currently two formats are supported:
1. Jellyfish: Follow the instructions in its [website](https://github.com/gmarcais/Jellyfish/blob/master/doc/Readme.md)
2. kmerGWAS: Has an adhoc file format that contains only the kmers in a binary representation, sorted. This option is faster than the jellyfish version, but creating the kmer table is less straight forward. The manual is [here](https://github.com/voichek/kmersGWAS/blob/master/manual.pdf).
## Runn unit tests
To makes sure that your changes havent broken the core IBSpy, run the unit tests:
```sh
python3 setup.py test
```
## Running IBSPy
IBSpy has relatively few options, you can look at them with the ```--help``` command.
```sh
IBSPy --help
usage: IBSPy [-h] [-w WINDOW_SIZE] [-k KMER_SIZE] [-d DATABASE] [-r REFERENCE]
[-z] [-o OUTPUT] [-f {kmerGWAS,jellyfish}]
optional arguments:
-h, --help show this help message and exit
-w WINDOW_SIZE, --window_size WINDOW_SIZE
window size to analyze
-k KMER_SIZE, --kmer_size KMER_SIZE
Kmer size of the database
-d DATABASE, --database DATABASE
Kmer database
-r REFERENCE, --reference REFERENCE
The reference with the position of the kmers
-z, --compress When an ouput file is present, it is compressed as .gz
-o OUTPUT, --output OUTPUT
Output file. If missing, the ouptut is sent to stdout
-f {kmerGWAS,kmerGWAS_mmap,jellyfish,kmc3}, --database_format {kmerGWAS,kmerGWAS_mmap,jellyfish,kmc3}
Database format
```
To generate the table with the number of observed kmers and variants run the following command, using the kmer database from kmerGWAS use the following command:
```sh
IBSpy --output "kmer_windows_LineXXX.tsv.gz" -z --database kmers_with_strand --reference arinaLrFor.fa --window_size 50000 --compress --database_format kmerGWAS
```
For KMC3, the database is the name used while creating the database, not the filename.
## Running IBSplot
Look at the IBSplot commands using ```--help```.
```sh
IBSPy --help
usage: IBSplot [-h] [-i IBSPY_COUNTS] [-w WINDOW_SIZE] [-f FILTER_COUNTS]
[-n N_COMPONENTS] [-c COVARIANCE_TYPE] [-s STITCH_NUMBER]
[-o OUTPUT] [-r REFERENCE] [-q QUERY] [-p PLOT_OUTPUT]
optional arguments:
-h, --help show this help message and exit
-i IBSPY_COUNTS, --IBSpy_counts IBSPY_COUNTS
tvs file genetared by IBSpy output
-w WINDOW_SIZE, --window_size WINDOW_SIZE
Windows size to count variations within
-f FILTER_COUNTS, --filter_counts FILTER_COUNTS
Filter number of variaitons above this threshold to
compute GMM model, default=None
-n N_COMPONENTS, --n_components N_COMPONENTS
Number of componenets for the GMM model, default=3
-c COVARIANCE_TYPE, --covariance_type COVARIANCE_TYPE
type of covariance used for GMM model, default="full"
-s STITCH_NUMBER, --stitch_number STITCH_NUMBER
Consecutive "outliers" in windows to stitch, default=3
-o OUTPUT, --output OUTPUT
tsv file with variations count by windows and summary
statistics
-r REFERENCE, --reference REFERENCE
genome reference name
-q QUERY, --query QUERY
query sample
-p PLOT_OUTPUT, --plot_output PLOT_OUTPUT
histograms and ascatter files in .PDF format
```
IBSplot uses the output table generated by IBSpy described above (e.g., ```"kmer_windows_LineXXX.tsv.gz"```). It can be used to count variant assigning larger windows. In the example below it is using 400,000 bp windows to compute a GMM model and generate the plots.
To generate the table with variant count categorized by the GMM model as IBS or non-IBS and generate the plots, run the following command:
The description of the GMM model is [here](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture)
```sh
# minimal arguments
IBSplot --IBSpy_counts "kmeribs-Wheat_Jagger-Flame.tsv.gz" --window_size 400000 --output gmm_ibs.tsv.gz --reference Jagger --query Flame --plot_output gmm_plots.pdf
```
In addition, you can include some or all of the following commands to tune the GMM model parameters and define the best IBS and non-IBS according to the reference and query sample used:
```sh
IBSplot --filter_counts 1000 --n_components 3 --covariance_type 'full' --stitch_number 3
```