معرفی شرکت ها


chariot-0.5.6


Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

Deliver the ready-to-train data to your NLP model.
ویژگی مقدار
سیستم عامل -
نام فایل chariot-0.5.6
نام chariot
نسخه کتابخانه 0.5.6
نگهدارنده []
ایمیل نگهدارنده []
نویسنده icoxfog417
ایمیل نویسنده icoxfog417@yahoo.co.jp
آدرس صفحه اصلی https://github.com/chakki-works/chariot
آدرس اینترنتی https://pypi.org/project/chariot/
مجوز Apache License 2.0
# chariot [![PyPI version](https://badge.fury.io/py/chariot.svg)](https://badge.fury.io/py/chariot) [![Build Status](https://travis-ci.org/chakki-works/chariot.svg?branch=master)](https://travis-ci.org/chakki-works/chariot) [![codecov](https://codecov.io/gh/chakki-works/chariot/branch/master/graph/badge.svg)](https://codecov.io/gh/chakki-works/chariot) **Deliver the ready-to-train data to your NLP model.** * Prepare Dataset * You can prepare typical NLP datasets through the [chazutsu](https://github.com/chakki-works/chazutsu). * Build & Run Preprocess * You can build the preprocess pipeline like [scikit-learn Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). * Preprocesses for each dataset column are executed in parallel by [Joblib](https://pythonhosted.org/joblib/index.html). * Multi-language text tokenization is supported by [spaCy](https://spacy.io/). * Format Batch * Sampling a batch from preprocessed dataset and format it to train the model (padding etc). * You can use pre-trained word vectors through the [chakin](https://github.com/chakki-works/chakin). **chariot** enables you to concentrate on training your model! ![chariot flow](./docs/images/chariot_feature.gif) ## Install ``` pip install chariot ``` ## Prepare dataset You can download various dataset by using [chazutsu](https://github.com/chakki-works/chazutsu). ```py import chazutsu from chariot.storage import Storage storage = Storage("your/data/root") r = chazutsu.datasets.MovieReview.polarity().download(storage.path("raw")) df = storage.chazutsu(r.root).data() df.head(5) ``` Then ``` polarity review 0 0 synopsis : an aging master art thief , his sup... 1 0 plot : a separated , glamorous , hollywood cou... 2 0 a friend invites you to a movie . this film wo... ``` `Storage` class manage the directory structure that follows [cookie-cutter datascience](https://drivendata.github.io/cookiecutter-data-science/). ``` Project root └── data ├── external <- Data from third party sources (ex. word vectors). ├── interim <- Intermediate data that has been transformed. ├── processed <- The final, canonical datasets for modeling. └── raw <- The original, immutable data dump. ``` ## Build & Run Preprocess ### Build a preprocess pipeline All preprocessors are defined at `chariot.transformer`. Transformers are implemented by extending [scikit-learn `Transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html). Because of this, the API of Transformer is familiar to you. And you can mix [scikit-learn's preprocessors](https://scikit-learn.org/stable/modules/preprocessing.html). ```py import chariot.transformer as ct from chariot.preprocessor import Preprocessor preprocessor = Preprocessor() preprocessor\ .stack(ct.text.UnicodeNormalizer())\ .stack(ct.Tokenizer("en"))\ .stack(ct.token.StopwordFilter("en"))\ .stack(ct.Vocabulary(min_df=5, max_df=0.5))\ .fit(train_data) preprocessor.save("my_preprocessor.pkl") loaded = Preprocessor.load("my_preprocessor.pkl") ``` There is 6 type of transformers are prepared in chariot. * TextPreprocessor * Preprocess the text before tokenization. * `TextNormalizer`: Normalize text (replace some character etc). * `TextFilter`: Filter the text (delete some span in text stc). * Tokenizer * Tokenize the texts. * It powered by [spaCy](https://spacy.io/) and you can choose [MeCab](https://github.com/taku910/mecab) or [Janome](https://github.com/mocobeta/janome) for Japanese. * TokenPreprocessor * Normalize/Filter the tokens after tokenization. * `TokenNormalizer`: Normalize tokens (to lower, to original form etc). * `TokenFilter`: Filter tokens (extract only noun etc). * Vocabulary * Make vocabulary and convert tokens to indices. * Formatter * Format (preprocessed) data for training your model. * Generator * Genrate target data to train your (language) model. ### Build a preprocess for dataset When you want to make preprocess to each of your dataset column, you can use `DatasetPreprocessor`. ```py from chariot.dataset_preprocessor import DatasetPreprocessor from chariot.transformer.formatter import Padding dp = DatasetPreprocessor() dp.process("review")\ .by(ct.text.UnicodeNormalizer())\ .by(ct.Tokenizer("en"))\ .by(ct.token.StopwordFilter("en"))\ .by(ct.Vocabulary(min_df=5, max_df=0.5))\ .by(Padding(length=pad_length))\ .fit(train_data["review"]) dp.process("polarity")\ .by(ct.formatter.CategoricalLabel(num_class=3)) preprocessed = dp.preprocess(data) # DatasetPreprocessor has multiple preprocessor. # Because of this, save file format is `tar.gz`. dp.save("my_dataset_preprocessor.tar.gz") loaded = DatasetPreprocessor.load("my_dataset_preprocessor.tar.gz") ``` ## Train your model with chariot `chariot` has feature to traing your model. ```py formatted = dp(train_data).preprocess().format().processed model.fit(formatted["review"], formatted["polarity"], batch_size=32, validation_split=0.2, epochs=15, verbose=2) ``` ```py for batch in dp(train_data.preprocess().iterate(batch_size=32, epoch=10): model.train_on_batch(batch["review"], batch["polarity"]) ``` You can use pre-trained word vectors by [chakin](https://github.com/chakki-works/chakin). ```py from chariot.storage import Storage from chariot.transformer.vocabulary import Vocabulary # Download word vector storage = Storage("your/data/root") storage.chakin(name="GloVe.6B.50d") # Make embedding matrix vocab = Vocabulary() vocab.set(["you", "loaded", "word", "vector", "now"]) embed = vocab.make_embedding(storage.path("external/glove.6B.50d.txt")) print(embed.shape) # (len(vocab.count), 50) ```


نحوه نصب


نصب پکیج whl chariot-0.5.6:

    pip install chariot-0.5.6.whl


نصب پکیج tar.gz chariot-0.5.6:

    pip install chariot-0.5.6.tar.gz