<div align="center"><img style="width: 50%" src="assets/logo.png" alt="logo"></div>
---
[![Python package](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-package.yml/badge.svg?branch=main)](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-package.yml) [![Upload Python Package](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-publish.yml/badge.svg?branch=main)](https://github.com/uniwue-zpd/PAGETools/actions/workflows/python-publish.yml)
Small collection of [PAGE XML](https://github.com/PRImA-Research-Lab/PAGE-XML) related Python scripts used at the
[Centre for Philology and Digitality (ZPD), University of Würzburg](https://github.com/uniwue-zpd).
## Installing
### Installation using pip
The suggested method is to install `pagetools` into a virtual environment using pip:
```bash
python -m venv VENV_NAME
source VENV_NAME/bin/activate
pip install pagetools
```
To install the package from source, clone this repository and run inside the project directory
```bash
python -m venv VENV_NAME
source VENV_NAME/bin/activate
pip install .
```
## Usage
### Transformations
#### Extraction
```
Usage: pagetools extract [OPTIONS] XMLS...
Extract elements as image (optionally with text) files.
Options:
--include [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*]
PAGE XML element types to extract (highest
priority).
--exclude [TextRegion|ImageRegion|LineDrawingRegion|GraphicRegion|TableRegion|ChartRegion|MapRegion|SeparatorRegion|MathsRegion|ChemRegion|MusicRegion|AdvertRegion|NoiseRegion|NoiseRegion|UnknownRegion|CustomRegion|TextLine|*]
PAGE XML element types to exclude from
extraction (lowest priority).
--no-text Suppresses text extraction.
-ie, --image-extension TEXT Extension of image files. Must be in the
same directory as corresponding XML file.
-o, --output TEXT Path where generated files will get saved.
-e, --enumerate-output Enumerates output file names instead of
using original names.
-z, --zip-output Add generated output to zip archive.
-bg, --background-color INTEGER...
RGB color code used to fill up background.
Used when padding and / or deskewing.
--background-mode [median|mean|dominant]
Color calc mode to fill up background
(overwrites -bg / --background-color).
-p, --padding INTEGER... Padding in pixels around the line image
cutout (top, bottom, left, right).
-ad, --auto-deskew Automatically deskew extracted line images
(Experimental!).
-d, --deskew FLOAT Angle for manual clockwise rotation of the
line images.
-gt, --gt-index INTEGER Index of the TextEquiv elements containing
ground truth.
-pred, --pred-index INTEGER Index of the TextEquiv elements containing
predicted text.
--help Show this message and exit.
```
##### Examples
Only extract `TextLine` elements:
```
pagetools extract <Path/to/xml/files>/*.xml -ie <img_extension> -o <Path/to/output/dir> --include TextLine --exclude "*"
```
Pay in mind that --include / --exclude currently work different from e.g. the same arguments in `rsync` (due to limitations with the `click` library). Inclusion of certain element types always trumps exclusion of the same type, regardless of the order in the call.
#### line2page
Merges line images with corresponding text-files in page-images and page-xml
```
Usage: pagetools line2page [OPTIONS]
Links line images and corresponding texts in a page and creates a combined
image and XML-File of each page
Options:
-c, --creator TEXT Creator tag for PAGE XML
-s, --source-folder TEXT Path to images and GT [required]
-i, --image-folder TEXT Path to images
-gt, --gt-folder TEXT Path to GT
-d, --dest-folder TEXT Path to merge objects
-e, --ext TEXT Image extension
-p, --pred BOOLEAN Set flag to also store .pred.txt
-l, --lines INTEGER RANGE Lines per page
-ls, --line-spacing INTEGER RANGE
Spacing between lines in pixel
-b, --border INTEGER RANGE... Border in pixel: top bottom left right
--debug [10|20|30|40|50] Sets the level of feedback to receive:
DEBUG=10, INFO=20, WARNING=30, ERROR=40,
CRITICAL=50
--threads INTEGER RANGE Thread count to be used
--xml-schema [17|19] Sets the year of the xml-Schema to be used
--help Show this message and exit.
```
Please note that each image file has to have the same name as its Ground Truth file.
```
foo.nrm.png -> foo.gt.txt (& foo.pred.txt)
bar.bin.png -> bar.gt.txt (& bar.pred.txt)
```
#### Regularization
```
Usage: pagetools regularize [OPTIONS] XMLS...
Regularize the text content of PAGE XML files using custom rulesets.
Options:
--remove-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces]
Removes specified default ruleset.
--add-default [various|quotes|ligatures_consonantal|ligatures_vocal|roman_digits|uvius|punctuation|spaces]
Adds specified default ruleset. Overrides
all other default options.
-nd, --no-default Disables all default rulesets.
-r, --rules PATH File(s) which contains serialized ruleset.
-nu, --normalize-unicode [NFC|NFD|NFKC|NFKD]
Normalize unicode for both rules and PAGE
XML tests.
-s, --safe / -us, --unsafe Creates backups of original files before
overwriting.
--help Show this message and exit.
```
#### Change index
```
Usage: pagetools change-index [OPTIONS] XMLS... SOURCE TARGET
Change index on TextEquiv elements.
Options:
-s, --safe / -us, --unsafe Creates backups of original files before
overwriting.
--help Show this message and exit.
```
### Analytics
#### Get Codec
```
Usage: pagetools get-codec [OPTIONS] FILES...
Retrieves codec of PAGE XML files.
Options:
-l, --level [region|line|word|glyph]
-idx, --index INTEGER Considers only text from TextEquiv elements
with a certain index.
-mc, --most-common INTEGER Only prints n most common entries. Shows all
by default.
-o, --output TEXT File to which results are written.
-rw, --remove-whitespace
-of, --output-format [json|csv|txt]
Available result formats.
-freq, --frequencies Outputs character frequencies.
--text-output-newline Inserts new line after every character in
txt output. Only applies when frequencies
aren't output.
--verbose / --silent Choose between verbose or silent output.
--help Show this message and exit.
```
### Get text count
```
Usage: pagetools get-text-count [OPTIONS] FILES...
Returns the amount of text equiv elements in certain elements for certain
indices.
Options:
-e, --element [TextRegion|TextLine|Word]
-i, --index TEXT [required]
-so, --stats-out TEXT Output directory for detailed stats csv
file.
--help Show this message and exit.
```