# biobear (v0.4.0)
biobear is a Python library designed for reading and searching bioinformatic file formats, using Rust as its backend and producing Arrow or Polars DataFrames as its output.
The python package has minimal dependencies and only requires Polars. Biobear can be used to read various bioinformatic file formats, including FASTA, FASTQ, VCF, BAM, and GFF. It can also query some indexed file formats, including VCF and BAM.
- [Installation](#installation)
- [Usage](#usage)
- [Similar Packages](#similar-packages)
- [API Documentation](#api-documentation)
- [vcf\_reader](#vcf_reader)
- [VCFReader](#vcfreader)
- [VCFIndexedReader](#vcfindexedreader)
- [genbank\_reader](#genbank_reader)
- [GenbankReader](#genbankreader)
- [fasta\_reader](#fasta_reader)
- [FastaReader](#fastareader)
- [compression](#compression)
- [Compression](#compression-1)
- [\_\_init\_\_](#__init__-4)
- [bam\_reader](#bam_reader)
- [BamReader](#bamreader)
- [BamIndexedReader](#bamindexedreader)
- [fastq\_reader](#fastq_reader)
- [FastqReader](#fastqreader)
- [gff\_reader](#gff_reader)
- [GFFReader](#gffreader)
## Installation
pip install biobear
Prefer python 3.10 or higher, though python 3.7+ should work.
## Usage
Read a FASTQ file:
import biobear as bb
df = bb.FastqReader("test.fq").read()
# ┌─────────┬───────────────────────┬───────────────────────────────────┬───────────────────────────────────┐
# │ name ┆ description ┆ sequence ┆ quality │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ str │
# ╞═════════╪═══════════════════════╪═══════════════════════════════════╪═══════════════════════════════════╡
# │ SEQ_ID ┆ This is a description ┆ GATTTGGGGTTCAAAGCAGTATCGATCAAATA… ┆ !''*((((***+))%%%++)(%%%%).1***-… │
# │ SEQ_ID2 ┆ null ┆ GATTTGGGGTTCAAAGCAGTATCGATCAAATA… ┆ !''*((((***+))%%%++)(%%%%).1***-… │
# └─────────┴───────────────────────┴───────────────────────────────────┴───────────────────────────────────┘
Read a gzipped FASTQ file:
import biobear as bb
from biobear.compression import Compression
df = bb.FastqReader("./python/tests/data/test.fastq.gz", compression=Compression.GZIP).read()
# ┌─────────┬─────────────┬───────────────────────────────────┬───────────────────────────────────┐
# │ name ┆ description ┆ sequence ┆ quality │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ str │
# ╞═════════╪═════════════╪═══════════════════════════════════╪═══════════════════════════════════╡
# │ SEQ_ID ┆ null ┆ GATTTGGGGTTCAAAGCAGTATCGATCAAATA… ┆ !''*((((***+))%%%++)(%%%%).1***-… │
# │ SEQ_ID2 ┆ null ┆ GATTTGGGGTTCAAAGCAGTATCGATCAAATA… ┆ !''*((((***+))%%%++)(%%%%).1***-… │
# └─────────┴─────────────┴───────────────────────────────────┴───────────────────────────────────┘
# The compression type is also inferred from the extension of the file
df = bb.FastqReader("test.fq.gz").read()
# ┌─────────┬─────────────┬───────────────────────────────────┬───────────────────────────────────┐
# │ name ┆ description ┆ sequence ┆ quality │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ str │
# ╞═════════╪═════════════╪═══════════════════════════════════╪═══════════════════════════════════╡
# │ SEQ_ID ┆ null ┆ GATTTGGGGTTCAAAGCAGTATCGATCAAATA… ┆ !''*((((***+))%%%++)(%%%%).1***-… │
# │ SEQ_ID2 ┆ null ┆ GATTTGGGGTTCAAAGCAGTATCGATCAAATA… ┆ !''*((((***+))%%%++)(%%%%).1***-… │
# └─────────┴─────────────┴───────────────────────────────────┴───────────────────────────────────┘
Query an indexed VCF file:
import biobear as bb
# Will error if test.vcf.gz.tbi is not present
df = bb.VCFIndexedReader("test.vcf.gz").query("1")
# ┌────────────┬──────────┬───────┬───────────┬───┬───────────────┬────────┬───────────────────────────────────┬────────────────┐
# │ chromosome ┆ position ┆ id ┆ reference ┆ … ┆ quality_score ┆ filter ┆ info ┆ format │
# │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i32 ┆ str ┆ str ┆ ┆ f32 ┆ str ┆ str ┆ str │
# ╞════════════╪══════════╪═══════╪═══════════╪═══╪═══════════════╪════════╪═══════════════════════════════════╪════════════════╡
# │ 1 ┆ 3000150 ┆ ┆ C ┆ … ┆ 59.200001 ┆ PASS ┆ AN=4;AC=2 ┆ GT:GQ │
# │ 1 ┆ 3000151 ┆ ┆ C ┆ … ┆ 59.200001 ┆ PASS ┆ AN=4;AC=2 ┆ GT:DP:GQ │
# │ 1 ┆ 3062915 ┆ id3D ┆ GTTT ┆ … ┆ 12.9 ┆ q10 ┆ DP4=1,2,3,4;AN=4;AC=2;INDEL;STR=… ┆ GT:GQ:DP:GL │
# │ 1 ┆ 3062915 ┆ idSNP ┆ G ┆ … ┆ 12.6 ┆ test ┆ TEST=5;DP4=1,2,3,4;AN=3;AC=1,1 ┆ GT:TT:GQ:DP:GL │
# │ 1 ┆ 3106154 ┆ ┆ CAAA ┆ … ┆ 342.0 ┆ PASS ┆ AN=4;AC=2 ┆ GT:GQ:DP │
# └────────────┴──────────┴───────┴───────────┴───┴───────────────┴────────┴───────────────────────────────────┴────────────────┘
## Similar Packages
Similar packages and/or inspiration for this package:
- https://github.com/abdenlab/saimin/
- https://github.com/tshauck/brrrr/
- https://github.com/natir/vcf2parquet/
- https://github.com/zaeleus/noodles/
- https://github.com/eto-ai/lance
## API Documentation
These docs are auto-generated, please file an issue if something is amiss.
<a id="vcf_reader"></a>
### vcf\_reader
VCF File Readers.
<a id="vcf_reader.VCFReader"></a>
#### VCFReader
class VCFReader()
A VCF File Reader.
This class is used to read a VCF file and convert it to a polars DataFrame.
<a id="vcf_reader.VCFReader.__init__"></a>
##### \_\_init\_\_
def __init__(path: Path)
Initialize the VCFReader.
- `path` _Path_ - Path to the VCF file.
<a id="vcf_reader.VCFReader.to_arrow_record_batch_reader"></a>
##### to\_arrow\_record\_batch\_reader
def to_arrow_record_batch_reader() -> pa.RecordBatchReader
Convert the VCF reader to an arrow batch reader.
<a id="vcf_reader.VCFReader.to_arrow_scanner"></a>
##### to\_arrow\_scanner
def to_arrow_scanner() -> ds.Scanner
Convert the VCF reader to an arrow scanner.
<a id="vcf_reader.VCFReader.read"></a>
##### read
def read() -> pl.DataFrame
Read the VCF file and return a polars DataFrame.
<a id="vcf_reader.VCFIndexedReader"></a>
#### VCFIndexedReader
class VCFIndexedReader()
An Indexed VCF File Reader.
This class is used to read or query an indexed VCF file and convert it to a
polars DataFrame.
<a id="vcf_reader.VCFIndexedReader.__init__"></a>
##### \_\_init\_\_
def __init__(path: Path)
Initialize the VCFIndexedReader.
<a id="vcf_reader.VCFIndexedReader.read"></a>
##### read
def read() -> pl.DataFrame
Read the VCF file and return a polars DataFrame.
<a id="vcf_reader.VCFIndexedReader.to_arrow_record_batch_reader"></a>
##### to\_arrow\_record\_batch\_reader
def to_arrow_record_batch_reader() -> pa.RecordBatchReader
Convert the VCF reader to an arrow batch reader.
<a id="vcf_reader.VCFIndexedReader.to_arrow_scanner"></a>
##### to\_arrow\_scanner
def to_arrow_scanner() -> ds.Scanner
Convert the VCF reader to an arrow scanner.
<a id="vcf_reader.VCFIndexedReader.query"></a>
##### query
def query(region: str) -> pl.DataFrame
Query the VCF file and return a polars DataFrame.
- `region` _str_ - The region to query.
<a id="genbank_reader"></a>
### genbank\_reader
Genbank file reader.
<a id="genbank_reader.GenbankReader"></a>
#### GenbankReader
class GenbankReader()
<a id="genbank_reader.GenbankReader.__init__"></a>
##### \_\_init\_\_
def __init__(path: Path)
Read a fasta file.
- `path` _Path_ - Path to the fasta file.
<a id="genbank_reader.GenbankReader.read"></a>
##### read
def read() -> pl.DataFrame
Read the fasta file and return a polars DataFrame.
<a id="genbank_reader.GenbankReader.to_arrow_scanner"></a>
##### to\_arrow\_scanner
def to_arrow_scanner() -> ds.Scanner
Convert the fasta reader to an arrow scanner.
<a id="genbank_reader.GenbankReader.to_arrow_record_batch_reader"></a>
##### to\_arrow\_record\_batch\_reader
def to_arrow_record_batch_reader() -> pa.RecordBatchReader
Convert the fasta reader to an arrow batch reader.
<a id="fasta_reader"></a>
### fasta\_reader
FASTA file reader.
<a id="fasta_reader.FastaReader"></a>
#### FastaReader
class FastaReader()
<a id="fasta_reader.FastaReader.__init__"></a>
##### \_\_init\_\_
def __init__(path: Path, compression: Compression = Compression.INFERRED)
Read a fasta file.
- `path` _Path_ - Path to the fasta file.
- `compression` _Compression_ - Compression type of the file. Defaults to
<a id="fasta_reader.FastaReader.read"></a>
##### read
def read() -> pl.DataFrame
Read the fasta file and return a polars DataFrame.
<a id="fasta_reader.FastaReader.to_arrow_scanner"></a>
##### to\_arrow\_scanner
def to_arrow_scanner() -> ds.Scanner
Convert the fasta reader to an arrow scanner.
<a id="fasta_reader.FastaReader.to_arrow_record_batch_reader"></a>
##### to\_arrow\_record\_batch\_reader
def to_arrow_record_batch_reader() -> pa.RecordBatchReader
Convert the fasta reader to an arrow batch reader.
<a id="compression"></a>
### compression
Compression configuration.
<a id="compression.Compression"></a>
#### Compression
class Compression(Enum)
Compression types for files.
<a id="compression.Compression.from_file"></a>
##### from\_file
def from_file(cls, path: os.PathLike) -> "Compression"
Infer the compression type from the file extension.
<a id="compression.Compression.infer_or_use"></a>
##### infer\_or\_use
def infer_or_use(path: os.PathLike) -> "Compression"
Infer the compression type from the file extension if needed.
<a id="__init__"></a>
### \_\_init\_\_
Main biobear package.
<a id="bam_reader"></a>
### bam\_reader
BAM File Readers.
<a id="bam_reader.BamReader"></a>
#### BamReader
class BamReader()
A BAM File Reader.
<a id="bam_reader.BamReader.__init__"></a>
##### \_\_init\_\_
def __init__(path: Path)
Initialize the BamReader.
- `path` _Path_ - Path to the BAM file.
<a id="bam_reader.BamReader.to_arrow_record_batch_reader"></a>
##### to\_arrow\_record\_batch\_reader
def to_arrow_record_batch_reader() -> pa.RecordBatchReader
Convert the BAM reader to an arrow batch reader.
<a id="bam_reader.BamReader.to_arrow_scanner"></a>
##### to\_arrow\_scanner
def to_arrow_scanner() -> ds.Scanner
Convert the BAM reader to an arrow scanner.
<a id="bam_reader.BamReader.read"></a>
##### read
def read() -> pl.DataFrame
Read the BAM file and return a polars DataFrame.
<a id="bam_reader.BamIndexedReader"></a>
#### BamIndexedReader
class BamIndexedReader()
An Indexed BAM File Reader.
<a id="bam_reader.BamIndexedReader.__init__"></a>
##### \_\_init\_\_
def __init__(path: Path, index: Path)
Initialize the BamIndexedReader.
- `path` _Path_ - Path to the BAM file.
- `index` _Path_ - Path to the BAM index file.
<a id="bam_reader.BamIndexedReader.read"></a>
##### read
def read() -> pl.DataFrame
Read the BAM file and return a polars DataFrame.
<a id="bam_reader.BamIndexedReader.to_arrow_record_batch_reader"></a>
##### to\_arrow\_record\_batch\_reader
def to_arrow_record_batch_reader() -> pa.RecordBatchReader
Convert the BAM reader to an arrow batch reader.
<a id="bam_reader.BamIndexedReader.to_arrow_scanner"></a>
##### to\_arrow\_scanner
def to_arrow_scanner() -> ds.Scanner
Convert the BAM reader to an arrow scanner.
<a id="bam_reader.BamIndexedReader.query"></a>
##### query
def query(chrom: str, start: int, end: int) -> pl.DataFrame
Query the BAM file and return a polars DataFrame.
- `chrom` _str_ - The chromosome to query.
- `start` _int_ - The start position to query.
- `end` _int_ - The end position to query.
<a id="fastq_reader"></a>
### fastq\_reader
FASTQ reader.
<a id="fastq_reader.FastqReader"></a>
#### FastqReader
class FastqReader()
<a id="fastq_reader.FastqReader.__init__"></a>
##### \_\_init\_\_
def __init__(path: os.PathLike,
compression: Compression = Compression.INFERRED)
Read a fastq file.
- `path` _Path_ - Path to the fastq file.
- `compression` _Compression_ - Compression type of the file. Defaults to
<a id="fastq_reader.FastqReader.read"></a>
##### read
def read() -> pl.DataFrame
Read the fasta file and return a polars DataFrame.
<a id="fastq_reader.FastqReader.to_arrow_scanner"></a>
##### to\_arrow\_scanner
def to_arrow_scanner() -> ds.Scanner
Convert the fasta reader to an arrow scanner.
<a id="fastq_reader.FastqReader.to_arrow_record_batch_reader"></a>
##### to\_arrow\_record\_batch\_reader
def to_arrow_record_batch_reader() -> pa.RecordBatchReader
Convert the fasta reader to an arrow batch reader.
<a id="gff_reader"></a>
### gff\_reader
GFF File Reader.
<a id="gff_reader.GFFReader"></a>
#### GFFReader
class GFFReader()
A GFF File Reader.
<a id="gff_reader.GFFReader.__init__"></a>
##### \_\_init\_\_
def __init__(path: Path, compression: Compression = Compression.INFERRED)
Initialize the GFFReader.
- `path` - The path to the GFF file.
<a id="gff_reader.GFFReader.read"></a>
##### read
def read() -> pl.DataFrame
Read the GFF file and return a polars DataFrame.
<a id="gff_reader.GFFReader.to_arrow_record_batch_reader"></a>
##### to\_arrow\_record\_batch\_reader
def to_arrow_record_batch_reader() -> pa.RecordBatchReader
Convert the GFF reader to an arrow batch reader.
<a id="gff_reader.GFFReader.to_arrow_scanner"></a>
##### to\_arrow\_scanner
def to_arrow_scanner() -> ds.Scanner
Convert the GFF reader to an arrow scanner.