معرفی شرکت ها


embedded-topic-model-1.0.2


Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

A package to run embedded topic modelling
ویژگی مقدار
سیستم عامل -
نام فایل embedded-topic-model-1.0.2
نام embedded-topic-model
نسخه کتابخانه 1.0.2
نگهدارنده []
ایمیل نگهدارنده []
نویسنده Luiz F. Matos
ایمیل نویسنده lfmatosmelo@id.uff.br
آدرس صفحه اصلی https://github.com/lffloyd/embedded-topic-model
آدرس اینترنتی https://pypi.org/project/embedded-topic-model/
مجوز MIT license
# Embedded Topic Model [![PyPI version](https://badge.fury.io/py/embedded-topic-model.svg)](https://badge.fury.io/py/embedded-topic-model) [![Actions Status](https://github.com/lffloyd/embedded-topic-model/workflows/Python%20package/badge.svg)](https://github.com/lffloyd/embedded-topic-model/actions) [![License](http://img.shields.io/badge/license-MIT-blue.svg?style=flat)](https://github.com/lffloyd/embedded-topic-model/blob/main/LICENSE) This package was made to easily run embedded topic modelling on a given corpus. ETM is a topic model that marries the probabilistic topic modelling of Latent Dirichlet Allocation with the contextual information brought by word embeddings-most specifically, word2vec. ETM models topics as points in the word embedding space, arranging together topics and words with similar context. As such, ETM can either learn word embeddings alongside topics, or be given pretrained embeddings to discover the topic patterns on the corpus. ETM was originally published by Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei on a article titled ["Topic Modeling in Embedding Spaces"](https://arxiv.org/abs/1907.04907) in 2019. This code is an adaptation of the [original](https://github.com/adjidieng/ETM) provided with the article. Most of the original code was kept here, with some changes here and there, mostly for ease of usage. With the tools provided here, you can run ETM on your dataset using simple steps. # Installation You can install the package using ```pip``` by running: ```pip install -U embedded_topic_model``` # Usage To use ETM on your corpus, you must first preprocess the documents into a format understandable by the model. This package has a quick-use preprocessing script. The only requirement is that the corpus must be composed by a list of strings, where each string corresponds to a document in the corpus. You can preprocess your corpus as follows: ```python from embedded_topic_model.utils import preprocessing import json # Loading a dataset in JSON format. As said, documents must be composed by string sentences corpus_file = 'datasets/example_dataset.json' documents_raw = json.load(open(dataset, 'r')) documents = [document['body'] for document in documents_raw] # Preprocessing the dataset vocabulary, train_dataset, _, = preprocessing.create_etm_datasets( documents, min_df=0.01, max_df=0.75, train_size=0.85, ) ``` Then, you can train word2vec embeddings to use with the ETM model. This is optional, and if you're not interested on training your embeddings, you can either pass a pretrained word2vec embeddings file for ETM or learn the embeddings using ETM itself. If you want ETM to learn its word embeddings, just pass ```train_embeddings=True``` as an instance parameter. To pretrain the embeddings, you can do the following: ```python from embedded_topic_model.utils import embedding # Training word2vec embeddings embeddings_mapping = embedding.create_word2vec_embedding_from_dataset(documents) ``` To create and fit the model using the training data, execute: ```python from embedded_topic_model.models.etm import ETM # Training an ETM instance etm_instance = ETM( vocabulary, embeddings=embeddings_mapping, # You can pass here the path to a word2vec file or # a KeyedVectors instance num_topics=8, epochs=300, debug_mode=True, train_embeddings=False, # Optional. If True, ETM will learn word embeddings jointly with # topic embeddings. By default, is False. If 'embeddings' argument # is being passed, this argument must not be True ) etm_instance.fit(train_dataset) ``` Also, to obtain the topics, topic coherence or topic diversity of the model, you can do as follows: ```python topics = etm_instance.get_topics(20) topic_coherence = etm_instance.get_topic_coherence() topic_diversity = etm_instance.get_topic_diversity() ``` # Citation To cite ETM, use the original article's citation: ``` @article{dieng2019topic, title = {Topic modeling in embedding spaces}, author = {Dieng, Adji B and Ruiz, Francisco J R and Blei, David M}, journal = {arXiv preprint arXiv: 1907.04907}, year = {2019} } ``` # Acknowledgements Credits given to Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei for the original work. # License Licensed under [MIT](LICENSE) license. # Changelog This changelog was inspired by the [keep-a-changelog](https://github.com/olivierlacan/keep-a-changelog) project and follows [semantic versioning](https://semver.org). ## [1.0.2] - 2021-06-23 ### Changed - deactivates debug mode by default - documents get_most_similar_words method ## [1.0.1] - 2021-02-15 ### Changed - optimizes original word2vec TXT file input for model training - updates README.md ## [1.0.0] - 2021-02-15 ### Added - adds support for original word2vec pretrained embeddings files on both formats (BIN/TXT) ### Changed - optimizes handling of gensim's word2vec mapping file for better memory usage ## [0.1.1] - 2021-02-01 ### Added - support for python 3.6 ## [0.1.0] - 2021-02-01 ### Added - ETM training with partially tested support for original ETM features. - ETM corpus preprocessing scripts - including word2vec embeddings training - adapted from the original code. - adds methods to retrieve document-topic and topic-word probability distributions from the trained model. - adds docstrings for tested API methods. - adds unit and integration tests for ETM and preprocessing scripts.


نیازمندی

مقدار نام
==3.8.3 gensim
==3.5 nltk
==1.19.5 numpy
==0.23.2 scikit-learn
==1.5.2 scipy
==1.6.0 torch


زبان مورد نیاز

مقدار نام
>=3.6 Python


نحوه نصب


نصب پکیج whl embedded-topic-model-1.0.2:

    pip install embedded-topic-model-1.0.2.whl


نصب پکیج tar.gz embedded-topic-model-1.0.2:

    pip install embedded-topic-model-1.0.2.tar.gz