معرفی شرکت ها

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

A Library for active learning. Supports text classification and sequence tagging tasks.

ویژگی	مقدار
سیستم عامل	-
نام فایل	acleto-0.0.7
نام	acleto
نسخه کتابخانه	0.0.7
نگهدارنده	[]
ایمیل نگهدارنده	[]
نویسنده	Tsvigun A., Sanochkin L., Kuzmin G., Larionov D., and Dr Shelmanov A.
ایمیل نویسنده	-
آدرس صفحه اصلی	-
آدرس اینترنتی	https://pypi.org/project/acleto/
مجوز	MIT

# рџ› ALToolbox  [![PyPI version](https://img.shields.io/pypi/v/acleto.svg)](https://pypi.python.org/pypi/acleto/) [![License](https://img.shields.io/github/license/AIRI-Institute/al_toolbox)](./LICENSE) [![Documentation Status](https://readthedocs.org/projects/al-toolbox/badge/?version=latest)](https://al-toolbox.readthedocs.io/en/latest/?badge=latest) [![Tests](reports/junit/tests-badge.svg)](reports/junit/tests-badge.svg) ALToolbox is a framework for practical active learning in NLP. <hr> [Installation](#installation) | [Quick Start](#quick_start) | [Overview](#overview) | [Docs](#documentation) | [Citation](#citation)  ALToolbox is a framework for **active learning** annotation in natural language processing. Currently, the framework supports text classification and sequence tagging tasks. ALToolbox provides state-of-the-art query strategies, serverless annotation tool for Jupyter IDE, and a set of tools that help to reduce computational overhead / duration of AL iterations and increase annotated data reusability.  ## <a name="installation"></a>вљ™пёЏ Installation ```bash pip install acleto ``` To annotate instances for active learning in Jupyter Notebook or Jupyter Lab one have to install additional widget after framework installation. In case of Jupyter Notebook usage run: ```bash jupyter nbextension install --py --symlink --sys-prefix text_selector jupyter nbextension enable --py --sys-prefix text_selector ``` In case of Jupyter Lab usage run: ```bash jupyter labextension install js jupyter labextension install text_selector ``` ## <a name="quick_start"></a>рџ’« Quick Start For quick start, please see the examples of launching an active learning annotation or benchmarking a novel query stategy / unlabeled pool subsampling strategy for sequence tagging and text classification tasks: | # | Notebook | |-----|--------------------------------------------------------------------------------------------------------------------------| | 1 | [Launching Active Learning for Token Classification](jupyterlab_demo/ner_demo.ipynb) | | 2 | [Launching Active Learning for Text Classification](jupyterlab_demo/cls_demo.ipynb) | | 3 | [Benchmarking a novel AL query strategy / unlabeled pool subsampling strategy](examples/benchmark_custom_strategy.ipynb) | ## <a name="overview"></a>рџ” Overview ### 1. Query Strategies | # | Strategy | Citation | |-----|------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------| | 1 | [ALPS](acleto/al4nlp/query_strategies/alps_sampling.py) | [Citation](https://aclanthology.org/2020.emnlp-main.637/) | | 2 | [BADGE](acleto/al4nlp/query_strategies/badge_sampling.py) | [Citation](https://openreview.net/forum?id=ryghZJBKPS) | | 3 | [BAIT](acleto/al4nlp/query_strategies/bait_sampling.py) | [Citation](https://proceedings.neurips.cc/paper/2021/file/4afe044911ed2c247005912512ace23b-Paper.pdf) | | 4 | [BALD](acleto/al4nlp/query_strategies/bald_sampling.py) | [Citation](https://arxiv.org/abs/1112.5745) | | 5 | [BatchBALD](acleto/al4nlp/query_strategies/batchbald_sampling.py) | [Citation](https://proceedings.neurips.cc/paper/2019/file/95323660ed2124450caaac2c46b5ed90-Paper.pdf) | | 6 | [Breaking Ties (BT) (also Maximum Margin)](acleto/al4nlp/query_strategies/breaking_ties_sampling.py) | [Citation](https://ieeexplore.ieee.org/document/1334570) | | 7 | [Contrastive Active Learning (CAL)](acleto/al4nlp/query_strategies/cal_sampling.py) | [Citation](https://aclanthology.org/2021.emnlp-main.51/) | | 8 | [Cluster Margin](acleto/al4nlp/query_strategies/cluster_margin_sampling.py) | [Citation](https://arxiv.org/abs/2107.14263) | | 9 | [Coreset](acleto/al4nlp/query_strategies/coreset_sampling.py) | [Citation](https://openreview.net/forum?id=H1aIuk-RW) | | 10 | [Expected Gradient Length (EGL)](acleto/al4nlp/query_strategies/egl_sampling.py) | [Citation](https://openreview.net/forum?id=ryghZJBKPS (?)) | | 11 | [Embeddings KM](acleto/al4nlp/query_strategies/embeddings_km_sampling.py) | [Citation](https://aclanthology.org/2020.emnlp-main.637/) | | 12 | [Entropy](acleto/al4nlp/query_strategies/entropy_sampling.py) | [Citation](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.9855&rep=rep1&type=pdf) | | 13 | [Least Confidence (LC)](acleto/al4nlp/query_strategies/lc_sampling.py) | [Citation](https://arxiv.org/abs/cmp-lg/9407020) | | 14 | [Mahalanobis Distance](acleto/al4nlp/query_strategies/mahalanobis_sampling.py) | [Citation](https://proceedings.neurips.cc/paper/2018/file/abdeb6f575ac5c6676b747bca8d09cc2-Paper.pdf) | | 15 | [Maximum Normalized Log-Probability (MNLP)](acleto/al4nlp/query_strategies/mnlp_sampling.py) | [Citation](https://aclanthology.org/W17-2630/) | | 16 | [Random (No AL)](acleto/al4nlp/query_strategies/random_sampling.py) | - | ### 3. Unlabeled Pool Subsampling Strategies | # | Strategy | Citation | |-----|--------------------------------------------------------------------------|--------------------------------------------------------------| | 1 | [UPS](acleto/al4nlp/pool_subsampling_strategies/ups_subsampling.py) | [Citation](https://aclanthology.org/2022.findings-naacl.90/) | | 2 | [NaГЇve](acleto/al4nlp/pool_subsampling_strategies/naive_subsampling.py) | [Citation](https://aclanthology.org/2022.findings-naacl.90/) | | 3 | [Random](acleto/al4nlp/pool_subsampling_strategies/random_subsampling.py) | - | ### 4. Pipelines for postprocessing of annotated data and preparation of acquisition models * PLASM postprocessing pipeline for annotated data reusability. * Acquisition model distillation. * Domain adaptation of acquisition models. ### 5. GUI Annotator tool in Jupyter IDE Our framework provides a serverless GUI annotation tool integrated into the Jupyter IDE: ![GUI](gui.svg) ### 6. Extensible benchmark for query strategies TODO: ## <a name="documentation"></a>рџ“• Documentation ### Usage The `configs` folder contains config files with general settings. The `experiments` folder contains config files with experimental design. To run an experiment with a chosen configuration, specify config file name in `HYDRA_CONFIG_NAME` variable and run `train.sh` script (see `./examples/al` for details). For example to launch PLASM on AG-News with ELECTRA as a successor model: ``` cd PATH_TO_THIS_REPO HYDRA_CONFIG_PATH=../experiments/ag_news HYDRA_EXP_CONFIG_NAME=ag_plasm python active_learning/run_tasks_on_multiple_gpus.py ``` ### Config structure explanation - `cuda_devices`: list of CUDA devices to use: one experiment on one CUDA device. `cuda_devices=[0,1]` means using zero-th and first devices. - `config_name`: name of config from **configs** folder with general settings: dataset, experiment setting (e.g. LC/ASM/PLASM), model checkpoints, hyperparameters etc. - `config_path`: path to config with general settings. - `command`: **.py** file to run. For AL experiments, use **run_active_learning.py**. - `args`: arguments to modify from a general config in the current experiment. `acquisition_model.name=xlnet-base-cased` means that _xlnet-base-cased_ will be used as an acquisition model. - `seeds`: random seeds to use. `seeds=[4837, 23419]` means that two separate experiments with the same settings (except for **seed**) will be run: one with **seed == 4837**, one with **seed == 23419**. ### Output Explanation By default, the results will be present in the folder `RUN_DIRECTORY/workdir_run_active_learning/DATE_OF_RUN/${TIME_OF_RUN}_${SEED}_${MODEL_CHECKPOINT}`. For instance, when launching from the repository folder: `al_nlp_feasible/workdir/run_active_learning/2022-06-11/15-59-31_23419_distilbert_base_uncased_bert_base_uncased`. - When running a classic AL experiment (acquisition and successor models coincide, regardless of using UPS), the file with the model metrics is `acquisition_metrics.json`. - When running an acquisition-successor mismatch experiment, the file with the model metrics is `successor_metrics.json`. - When running a PLASM experiment, the file with the model metrics is `target_tracin_quantile_-1.0_metrics.json` (**-1.0** stands for the filtering value, meaning adaptive filtering rate; when using a deterministic filtering rate (e.g. **0.1**), the file will be named `target_tracin_quantile_0.1_metrics.json`). The file with the metrics of the model **without filtering** is `target_metrics.json`. ### Post-processing Our framework provides tools for effective data post-processing for its re-usability and a possibility to build powerful models on it. PLASM, which aims to alleviate the acquisition-successor mismatch problem and allow to build a model of an arbitrary type using the labeled data without performance degradation, is implemented in `post_processing/pipeline_plasm`. It uses the config `cls_plasm` / `ner_plasm` (from `jupyterlab_demo/configs). A brief explanation of the config structure: - pseudo-labeling model parameters are contained in the key `labeling_model`; - successor model parameters are contained in the key `successor_model`; - post-processing options are contained in the key `post_processing`: - `label_smoothing`: str / float / None, a parameter for label smoothing (LS) for pseudo-labeled instances. Accepts several options: - "adaptive": LS value equals the quality of the labeling model on the validation data. - float, 0 < value < 1: absolute value of label smoothing - None (default): no label smoothing is used - `labeled_weight`: int / float, weight for the labeled-by-human data. 1 < value < +inf - `use_subsample_for_pl`: int / float / None, the size of the subsample used for pseudo-labeling (float means taking the share of the unlabeled data). None means that no subsampling is used. - `uncertainty_threshold`: float / None, the value of the threshold for filtering by uncertainty. If None, no filtering by uncertainty is used. - `filter_by_quantile`: bool, only used for classification, ignored if `uncertainty_threshold` is None. If True, `uncertainty_threshold` most uncertain instances are filtered. Otherwise, all instances whose (1 - max_prob) < `uncertainty_threshold` are filtered. - `tracin`: - `use`: bool, whether to use TracIn for filtering - `max_num_processes`: int, value > 0, maximum number of processes per one GPU - `quantile`: str / float (0 < value < 1), share of unlabeled data instances to filter using the TracIn score. - `num_model_checkpoints`: int, value > 0, how many model checkpoints to save and use for TracIn. - `nu`: float / int, value for TracIn algorithm. ### рџ†•пёЏ New strategies addition An AL query strategy should be designed as a function that: 1) Receives 3 positional arguments and additional strategy kwargs: - `model` of inherited class `TransformersBaseWrapper` or `PytorchEncoderWrapper` or `FlairModelWrapper`: model wrapper; - `X_pool` of class `Dataset` or `TransformersDataset`: dataset with the unlabeled instances; - `n_instances` of class `int`: number of instances to query; - `kwargs`: additional strategy-specific arguments. 2) Outputs 3 objects in the following order: - `query_idx` of class `array-like`: array with the indices of the queried instances; - `query` of class `Dataset` or `TransformersDataset`: dataset with the queried instances; - `uncertainty_estimates` of class `np.ndarray`: uncertainty estimates of the instances from `X_pool`. The higher the value - the more uncertain the model is in the instance. The function with the strategy should be named the same as the file where it is placed (e.g. function `def my_strategy` inside a file `path_to_strategy/my_strategy.py`). Use your strategy, setting `al.strategy=PATH_TO_FILE_YOUR_STRATEGY` in the experiment config. The example is presented in `examples/benchmark_custom_strategy.ipynb` ### рџ†•пёЏ New pool subsampling strategies addition The addition of a new pool subsampling query strategy is similar to the addition of an AL query strategy. A subsampling strategy should be designed as a function that: 1) It must receive 2 positional arguments and additional subsampling strategy kwargs: - `uncertainty_estimates` of class `np.ndarray`: uncertainty estimates of the instances in the order they are stored in the unlabeled data; - `gamma_or_k_confident_to_save` of class `float` or `int`: either a share / number of instances to save (as in random / naive subsampling) or an internal parameter (as in UPS); - `kwargs`: additional subsampling strategy specific arguments. 2) It must output the indices of the instances to use (sampled indices) of class `np.ndarray`. The function with the strategy should be named the same as the file where it is placed (e.g. function `def my_subsampling_strategy` inside a file `path_to_strategy/my_subsampling_strategy.py`). Use your subsampling strategy, setting `al.sampling_type=PATH_TO_FILE_YOUR_SUBSAMPLING_STRATEGY` in the experiment config. The example is presented in `examples/benchmark_custom_strategy.ipynb` ### Datasets The research has employed 2 Token Classification datasets (CoNLL-2003, OntoNotes-2012) and 2 Text Classification datasets (AG-News, IMDB). If one wants to launch an experiment on a custom dataset, they need to use one of the following ways to add it: 1) Upload to [Hugging Face datasets](https://huggingface.co/datasets) and set: `config.data.path=datasets, config.data.dataset_name=DATASET_NAME, config.data.text_name=COLUMN_WITH_TEXT_OR_TOKENS_NAME, config.data.label_name=COLUMN_WITH_LABELS_OR_NER_TAGS_NAME` 2) Upload to **data/DATASET_NAME** folder, create **train.csv** / **train.json** file with the dataset, and set: `config.data.path=PATH_TO_THIS_REPO/data, config.data.dataset_name=DATASET_NAME, config.data.text_name=COLUMN_WITH_TEXT_OR_TOKENS_NAME, config.data.label_name=COLUMN_WITH_LABELS_OR_NER_TAGS_NAME` 3) \* Upload to **data/DATASET_NAME** **train.txt**, **dev.txt**, and **test.txt** files and set the arguments as in the previous point. 4) \*\* Upload to **data/DATASET_NAME** with each folder for each class, where each file in the folder contains a text with the label of the folder. For details, please see the **bbc_news** dataset in **./data**. The arguments must be set as in the previous two points. \* - only for Token Classification datasets \*\* - only for Text Classification datasets ### Models The current version of the repository supports all models from [HuggingFace Transformers](https://huggingface.co/models), which can be used with `AutoModelForSequenceClassification` / `AutoModelForTokenClassification` classes (for Text / Token classification). For CNN-based / BiLSTM-CRF models, please see the **al_cls_cnn.yaml** / **al_ner_bilstm_crf_flair.yaml** configs from **./configs** folder for details. ### Testing By default, the tests will be run on the `cuda:0` device if CUDA is available or on CPU, otherwise. If one wants to manually specify the device for running the tests: - On CPU: `CUDA_VISIBLE_DEVICES="" python -m pytest PATH_TO_REPO/tests`; - On CUDA: `CUDA_VISIBLE_DEVICES="DEVICE_OR_DEVICES_NUMBER" python -m pytest PATH_TO_REPO/tests`. We recommend to use CPU for the robustness of the results. The tests for CUDA are written under **Tesla V100-SXM3 32GB, CUDA V.10.1.243**. ## рџ‘Ї Alternatives [FAMIE](https://github.com/nlp-uoregon/famie), [Small-Text](https://github.com/webis-de/small-text), [modAL](https://github.com/modAL-python/modAL), [ALiPy](https://github.com/NUAA-AL/ALiPy), [libact](https://github.com/ntucllab/libact) ## <a name="citation"></a>рџ’¬ Citation ``` ``` ## рџ“„ License В© 2022 Autonomous Non-Profit Organization "Artificial Intelligence Research Institute" (AIRI). All rights reserved. Licensed under the [MIT License](LICENSE).

نحوه نصب

نصب پکیج whl acleto-0.0.7:

pip install acleto-0.0.7.whl

نصب پکیج tar.gz acleto-0.0.7:

pip install acleto-0.0.7.tar.gz