معرفی شرکت ها


compress-fasttext-0.1.3


Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

A set of tools to compress gensim fasttext models
ویژگی مقدار
سیستم عامل -
نام فایل compress-fasttext-0.1.3
نام compress-fasttext
نسخه کتابخانه 0.1.3
نگهدارنده []
ایمیل نگهدارنده []
نویسنده David Dale
ایمیل نویسنده dale.david@mail.ru
آدرس صفحه اصلی https://github.com/avidale/compress-fasttext
آدرس اینترنتی https://pypi.org/project/compress-fasttext/
مجوز MIT
# Compress-fastText This Python 3 package allows to compress fastText word embedding models (from the `gensim` package) by orders of magnitude, without significantly affecting their quality. This [blogpost in Russian](https://habr.com/ru/post/489474) and [this one in English](https://towardsdatascience.com/eb212e9919ca) give more details about the motivation and methods for compressing fastText models. **Note: gensim==4.0.0 has introduced some backward-incompatible changes:** * With gensim<4.0.0, please use compress-fasttext<=0.0.7 (and optionally Russian models from [our first release](https://github.com/avidale/compress-fasttext/releases/tag/v0.0.1)). * With gensim>=4.0.0, please use compress-fasttext>=0.1.0 (and optionally Russian or English models from [our 0.1.0 release](https://github.com/avidale/compress-fasttext/releases/tag/gensim-4-draft)). * Some models are no longer supported in the new version of gensim+compress-fasttext (for example, multiple models from [RusVectores](https://rusvectores.org/ru/models/) that use `compatible_hash=False`). * For any particular model, compatibility should be determined experimentally. If you notice any strange behaviour, please report in the Github issues. The package can be installed with `pip`: ```commandline pip install compress-fasttext[full] ``` If you are not going to perform matrix decomposition or quantization, you can install a variety with less dependencies: ```commandline pip install compress-fasttext ``` ### Model compression You can use this package to compress your own fastText model (or one downloaded e.g. from [RusVectores](https://rusvectores.org/ru/models/)): Compress a model in Gensim format: ```python import gensim import compress_fasttext big_model = gensim.models.fasttext.FastTextKeyedVectors.load('path-to-original-model') small_model = compress_fasttext.prune_ft_freq(big_model, pq=True) small_model.save('path-to-new-model') ``` Import a model in Facebook original format and compress it: ```python from gensim.models.fasttext import load_facebook_model import compress_fasttext big_model = load_facebook_model('path-to-original-model').wv small_model = compress_fasttext.prune_ft_freq(big_model, pq=True) small_model.save('path-to-new-model') ``` To perform this compression, you will need to `pip install gensim==3.8.3 pqkmeans` beforehand. Different compression methods include: - matrix decomposition (`svd_ft`) - product quantization (`quantize_ft`) - optimization of feature hashing (`prune_ft`) - feature selection (`prune_ft_freq`) The recommended approach is combination of feature selection and quantization (`prune_ft_freq` with `pq=True`). ### Model usage If you just need a tiny fastText model for Russian, you can download [this](https://github.com/avidale/compress-fasttext/releases/download/gensim-4-draft/geowac_tokens_sg_300_5_2020-100K-20K-100.bin) 21-megabyte model. It's a compressed version of [geowac_tokens_none_fasttextskipgram_300_5_2020](http://vectors.nlpl.eu/repository/20/214.zip) model from [RusVectores](https://rusvectores.org/ru/models/). If `compress-fasttext` is already installed, you can download and use this tiny model ```python import compress_fasttext small_model = compress_fasttext.models.CompressedFastTextKeyedVectors.load( 'https://github.com/avidale/compress-fasttext/releases/download/gensim-4-draft/geowac_tokens_sg_300_5_2020-100K-20K-100.bin' ) print(small_model['спасибо']) # [ 0.26762889 0.35489027 ... -0.06149674] # a 300-dimensional vector print(small_model.most_similar('котенок')) # [('кот', 0.7391024827957153), ('пес', 0.7388300895690918), ('малыш', 0.7280327081680298), ... ] ``` The class `CompressedFastTextKeyedVectors` inherits from `gensim.models.fasttext.FastTextKeyedVectors`, but makes a few additional optimizations. For English, you can use [this](https://github.com/avidale/compress-fasttext/releases/download/v0.0.4/cc.en.300.compressed.bin) tiny model, obtained by compressing [the model by Facebook](https://fasttext.cc/docs/en/crawl-vectors.html). ```python import compress_fasttext small_model = compress_fasttext.models.CompressedFastTextKeyedVectors.load( 'https://github.com/avidale/compress-fasttext/releases/download/v0.0.4/cc.en.300.compressed.bin' ) print(small_model['hello']) # [ 1.84736611e-01 6.32683930e-03 4.43901886e-03 ... -2.88431027e-02] # a 300-dimensional vector print(small_model.most_similar('Python')) # [('PHP', 0.5252903699874878), ('.NET', 0.5027452707290649), ('Java', 0.4897131323814392), ... ] ``` More compressed models for 101 various languages can be found at https://zenodo.org/record/4905385. ### Example of application In practical applications, you usually feed fastText embeddings to some other model. The class `FastTextTransformer` uses [the scikit-learn interface](https://scikit-learn.org/stable/data_transforms.html) and represents a text as the average of the embedding of its words. With it you can, for example, train a classifier on top of fastText to tell edible things from inedible ones: ```python import compress_fasttext from sklearn.pipeline import make_pipeline from sklearn.linear_model import LogisticRegression from compress_fasttext.feature_extraction import FastTextTransformer small_model = compress_fasttext.models.CompressedFastTextKeyedVectors.load( 'https://github.com/avidale/compress-fasttext/releases/download/v0.0.4/cc.en.300.compressed.bin' ) classifier = make_pipeline( FastTextTransformer(model=small_model), LogisticRegression() ).fit( ['banana', 'soup', 'burger', 'car', 'tree', 'city'], [1, 1, 1, 0, 0, 0] ) classifier.predict(['jet', 'train', 'cake', 'apple']) # array([0, 0, 1, 1]) ``` ### Notes This code is heavily based on the [navec](https://github.com/natasha/navec) package by Alexander Kukushkin and [the blogpost](https://medium.com/@vasnetsov93/shrinking-fasttext-embeddings-so-that-it-fits-google-colab-cd59ab75959e) by Andrey Vasnetsov about shrinking fastText embeddings.


نحوه نصب


نصب پکیج whl compress-fasttext-0.1.3:

    pip install compress-fasttext-0.1.3.whl


نصب پکیج tar.gz compress-fasttext-0.1.3:

    pip install compress-fasttext-0.1.3.tar.gz