# Relative K-mer Project
## Abstract
### WGS analysis reveals extended natural transformation in Campylobacter impacting diagnostics and the pathogens adaptive potential.
### Running title: WGS analysis of Campylobacter hybrid strains
### Julia C. Golz 1a, Lennard Epping 2#, Marie-Theres Knüver 1a, Maria Borowiak 1b, Felix Hartkopf 2, Carlus Deneke 1b, Burkhard Malorny 1b, Torsten Semmler 2, Kerstin Stingl 1a*
1 German Federal Institute for Risk Assessment, Department of Biological Safety, a National Reference Laboratory for *Campylobacter*, b Study Centre for Genome Sequencing and Analysis, Berlin, Germany
2 Robert Koch Institute, Microbial Genomics, Berlin, Germany
\# sharing first author
\* corresponding author
In the past decade, *Campylobacter* infections are getting more common worldwide. These infections can lead to diarrhea, abdominal pain, fever, headache, nausea, and/or vomiting and pose a serious danger for public health. This sparked efforts to improve prevention, treatment and reduce transmissions. As further stated by Kaakoush et al. [1], the main risks are the consumption of animal products and water, contact with animals and international travels.
As the threat to public health differs among *Campylobacter* species, it is important to identify dangerous *Campylobacter* species and investigate their characteristics in genotype and phenotype. In this work, a kmer mapping approach is used to identify recombination events and involved genes to describe hybrid species. Therefore, hybrids of *Campylobacter jejuni* and *Campylobacter coli* are analyzed to validate this approach and to develop a workflow that can be applied to emerging hybrids in general. This would allow a fast and reliable classification of hybrids.
KMC3 [2] and BEDTools [5] are utilized to extract kmers of *Campylobacter* genomes and to calculate shared kmers of two species and their hybrids. Subsequently, these kmers can be used in combination with Blast [3] and Bowtie 2 [4] to select genes that are shared with the hybrid genomes. These genes can be grouped into batches that were involved in a single recombination event. A visualization of the gene coverage generated using R provides further information about the selected genes.
This work will provide a new generic tool for hybrid analysis that could be expanded to other bacteria and enable researchers to classify new species and recombination events in a fast and reliable manner.
[1] Global Epidemiology of Campylobacter Infection
Nadeem O. Kaakoush, Natalia Castaño-Rodríguez, Hazel M. Mitchell, Si Ming Man
Clinical Microbiology Reviews Jun 2015, 28 (3) 687-720; DOI: 10.1128/CMR.00006-15
[2] Marek Kokot, Maciej Długosz, Sebastian Deorowicz, KMC 3: counting and manipulating k-mer statistics, Bioinformatics, Volume 33, Issue 17, 01 September 2017, Pages 2759–2761, https://doi.org/10.1093/bioinformatics/btx304
[3] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, David J. Lipman,
Basic local alignment search tool, Journal of Molecular Biology, Volume 215, Issue 3, 1990, Pages 403-410, ISSN 0022-2836, https://doi.org/10.1016/S0022-2836(05)80360-2.
[4] Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359.
[5] Aaron R. Quinlan, Ira M. Hall, BEDTools: a flexible suite of utilities for comparing genomic features, Bioinformatics, Volume 26, Issue 6, 15 March 2010, Pages 841–842, https://doi.org/10.1093/bioinformatics/btq033
## Requirements
+ [Conda](https://docs.conda.io/en/latest/)
or
+ Python 3.X
+ numpy = 1.17.3
+ matplotlib = 3.1.2
+ pandas = 0.25.3
+ biopython = 1.76
+ argparse = 1.4.0
+ tqdm = 4.41.1
+ kmc = 3.1.1
+ bowtie2 = 2.3.5
+ bedtools = 2.29.2
+ r = 3.6
+ pheatmap = 1.0.12
+ gplots = 3.0.1.1
+ blast = 2.9.0
+ samtools = 1.10
+ bedops = 2.4.37
+ seqkit=0.11.0
## Installation
1.
Change to src directory in RKP repository:
```bash
cd path/to/repo/src
```
2.
Create environment with all dependencies needed by RKP:
```bash
conda env create -f RKP.yaml
```
3.
Activate RKP environment:
```bash
conda activate RKP
```
4.
Run RKP:
```bash
python RKP.py -A <acceptor genome dir A> -B <hybrid genome dir B> -C <donor genome dir C> -k <kmerlength> -a <acceptor treshold> -c <donor threshold> -g <acceptor reference genome fasta> -f <acceptor refernecs genome gff> -o <output directory>
```
Required parameters:
| Parameter | Description |
|------------|--------------|
| -A, -C | Two directories with genomes (.fna) of acceptor and donor |
| -B | Directory with genomes (.fasta) and fnn files of hybrids |
| -k | Length of kmers |
| -at | Relative amount (0 to 1) of isolates of acceptor that should have kmer x|
| -dt | Relative amount (0 to 1) of isolates of donor that should have kmer x|
| -g | acceptor reference genome |
| -f | acceptor reference gff file |
| -o | output directory|
Optional parameters:
| Parameter | Description |
|------------|--------------|
| -d | Keep all temporary files |
| --version | Show version of RKP |
| -h | Show help |
| -t | number of threads, default = 8|
## File structure of output
```
output
│
│
│
└───Acceptor
│ │ (only temporary files)
│
└───Hybrid
| │ *_iso_seq_protein.fasta
| | *_iso_seq.fasta
| | mapping_result_Genes_count.csv
| | mapping_result_Genes_cutoff_20.csv
| | mapping_result_Genes_raw.csv
| | mapping_result.csv
| | mapping_result.pdf
| | recombination_cov_<kmerLength>_W50.pdf
| | recombination_cov_<kmerLength>_W100.pdf
| | recombination_cov_<kmerLength>_W200.pdf
| | recombination_cov_<kmerLength>_W300.pdf
| | recombination_cov_<kmerLength>_W400.pdf
| | recombination_cov_<kmerLength>_W500.pdf
| | Recombination_result_<kmerLength>_W50.csv
| | Recombination_result_<kmerLength>_W100.csv
| | Recombination_result_<kmerLength>_W200.csv
| | Recombination_result_<kmerLength>_W300.csv
| | Recombination_result_<kmerLength>_W400.csv
| | Recombination_result_<kmerLength>_W500.csv
|
└───Donor
| │ (only temporary files)
|
└───RKP.log
```
## Call structure
```mermaid
graph TD;
RKP.py-->create_kmers.sh;
create_kmers.sh-->map_kmers.sh;
RKP.py-->heatmap.R;
```
## Workflow
