# CORNCOB: Beta-binomial regression of count data
## Introduction
This is a python implementation of the [R corncob package](https://github.com/bryandmartin/corncob).
This is a beta-binomal model-based approach for hypothesis testing of count data such as that generated by microbiome 16S rRNA or RNAseq experiments.
The method was first developed by [Dr. Willis](http://statisticaldiversitylab.com/team) and her mentee [Bryan D Martin](https://bryandmartin.github.io/), and is fully described in [Modeling microbial abundances and dysbiosis with beta-binomial regression](https://projecteuclid.org/euclid.aoas/1587002666).
CORNCOB stands for: **CO**unt **R**egressio**N** for **C**orrelated **O**bservations with the **B**eta-binomial
Why you should use this technique is detailed below.
## Installation
Installation can be done via pip:
`pip install corncob`
## Usage
`corncob -C counts.csv -VA covariates_abund.csv -VD covariates_disp.csv -O output.csv`
### _counts.csv_ format
```
element,specimen_label_01,specimen_label_02,....
total,1244,1344,....
sv_name,20,0,.....
```
The header row should provide specimen (nee observation) labels.
The first data row should be for the total count across all elements for each specimen (e.g. the total count of reads for all ASVs in a 16S rRNA experiment.)
Each row after should be for one element (ESV, OTU, gene, etc). The first column should be an unique identifier for that element. The remainder of columns should be the counts for _that_ element.
Zeros are fine. Unlike other count-based metrics (i.e. DESEQ2) you do _not_ need to zero-inflate and in fact should not. Nulls or blank cells will be replaced with zeros.
### _covariates.csv_ format
```
cov1,cov2,cov3,...
331,0.12,0000.1,...
....
```
This is an exogenous matrix for the covariates. Each column corresponds to one covariate to be fitted. An intercept column is always added. Any empty cells will be replaced with zeros. If not provided, only an intercept will be used for regression.
One can provide distinct covariate tables for abundance (`-VA`) and dispersion (`-VD`)
## Output
A CSV file with a header. The header has the various attributes, in the rough format of
`element_id,converged,abd__cov_name__Estimate,abd__cov_name__se,abd__cov_name__t,abd__cov_name__p,...`
The first block indicates if this is for abundance (`abd`) or dispersion (`disp`).
The second block in the same of the covariate, taken directly from the input files.
The third block is
- `Estimate`: The fitted estimated coefficent for that covariate for abundance or dispersion
- `se`: The standard error of the estimate
- `t`: The t-value, derived via the Wald statistic
- `p`: The p-value (derived from the t-value). Two tailed.
## Why use beta-binomial regression?
Thanks to next-generation sequencing, biological scientists have an opportunity to make discoveries with count data. Whether RNAseq, ATAC-seq, microbiome 16S rRNA amplicons, or shotgun metagenomes, often the data we wish to compare to our outcomes are ultimately a big table of number of reads assigned in each specimen to a gene, locus, microbe, etc. These read-counts pose an analytic challenge.
While superficially appearing like any other continuous measure (particularly when normalized to relative abundance), these counts-of-reads are tricky. It is typical that the total number of reads to vary a lot from specimen-to-specimen. Even modest noise in the most predominant features (e.g. OTU) can wreak havoc on the precision of true featues further into the tail. Each read-count is not _really_ independent; thre are likely all sorts of latent covariance we are barely beginning to understand. Compounding all this is that most of these read-count matricies are sparse; most of the values are zeros.
*An ideal regression method for read-count data would:*
- Be fine with zeros, even a _lot_ of zeros, and not neet to do something like "add one to each count" to work.
- Recognize read-counts are likely correlated with one another in ways that we do not fully understand
- Take advantage of the variable total read depth (total reads) per specimen.
- Consider how our covariates (outcomes) relate not just to the relative abundance but also the variability (dispersion) of a given feature's read count.
The adaptation of *beta-binomial* models by [Martin et al](https://projecteuclid.org/euclid.aoas/1587002666) fullfils these criteria.