معرفی شرکت ها

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

A Python package for exploratory analysis of text data

ویژگی	مقدار
سیستم عامل	OS Independent
نام فایل	arabica-1.0.2
نام	arabica
نسخه کتابخانه	1.0.2
نگهدارنده	[]
ایمیل نگهدارنده	[]
نویسنده	Petr Koráb
ایمیل نویسنده	Petr Korab <xpetrkorab@gmail.com>
آدرس صفحه اصلی	https://github.com/PetrKorab/Arabica
آدرس اینترنتی	https://pypi.org/project/arabica/
مجوز	-

# Arabica **A Python package for exploratory analysis of text data** Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include Twitter tweets, product reviews, and newspaper headlines. Arabica provides functions to make the exploratory analysis of such datasets simple. Arabica provides these methods: * **arabica_freq**: calculates unigram, bigram, and trigram frequencies over a period (year, month, day) It can apply all or a selected combination of the following cleaning operations: * Remove digits from the text * Remove punctuation from the text * Remove standard list of stopwords * Remove an additional specific list of words `arabica` uses `clean-text` for punctuation cleaning and `nltk` corpus of stopwords. Arabica works with **texts** of languages based on the Latin alphabet and enables stopword removal for languages in the ntlk corpus of stopwords. It reads **dates** in standard date and datetime formats (e.g., 2013–12–31, 2013/12/31, 09-Feb-2009, 2013–12–31 11:46:17, 09/02/2009 09:26). It is preferable to use the US-style dates (MM/DD/YYYY) rather than the European-style date format (DD/MM/YYYY) since there might be a mismatch between months and days in small datasets. ## Installation Arabica requires [Python 3](https://www.python.org/downloads/), [NLTK](http://www.nltk.org/install.html), [clean-text](https://pypi.org/project/cleantext/#description), and [numpy](https://pypi.org/project/numpy/) to execute. To install using pip, use: `pip install arabica` ## Usage * **Import the library**: ``` python from arabica import arabica_freq ``` * **Choose a method:** **arabica_freq** returns a dataframe with aggregated unigrams, bigrams, and trigrams frequencies over a period. To remove stopwords, select aggregation period, and choose a specific set of cleaning operations: ``` python def arabica_freq(text: str, # Text time: str, # Time stopwords: [], # Languages for stop words skip: [], # Strings to be skipped punct: bool = False, # Remove all punctuation lower_case: bool = False, # Make all text lowercase before n-gram calculation max_words: int ='', # Max number for unigrams, bigrams and trigrams displayed time_freq: str ='', # Aggregation period: 'Y'/'M'/'D', if no aggregation: 'ungroup' numbers: bool = False # Remove all digits ) ``` A list of available languages for stopwords is printed with: ``` python from nltk.corpus import stopwords print(stopwords.fileids()) ``` It is possible to remove more sets of stopwords at once by `stopwords = ['language 1', 'language2','etc..']` ## Examples ### Time-series n-gram analysis Returns a table with unigram, bigram, and trigram frequencies over a period of time. ``` python import pandas as pd from arabica import arabica_freq ``` ``` python data = pd.DataFrame({'text': ['The ordering process was very easy & straight forward. They have great customer service and sorted any issues out very quickly.', 'So far seems to be the wrong product for me :-/ grrrrr...', 'Excellent, service, thank you really, really, really much!!!'], 'time': ['2013-08-8', '2013-09-8','2014-10-8']}) ``` ``` python arabica_freq(text = data['text'], time = data['time'], time_freq = 'M', # Calculates monthly n-gram frequencies max_words = 2, # Displays only the first two most frequent unigrams, bigrams, and trigrams stopwords = ['english'], # Removes English set of stopwords skip = ['grrrrr'], # Excludes string from n-gram calculation numbers = True, # Removes numbers punct = True, # Removes punctuation lower_case = True) # Makes all text lowercase before n-gram calculation ``` ### Descriptive n-gram analysis Returns unigram, bigram, and trigram frequencies without period aggregation. ``` python arabica_freq(text = data['text'], time = data['time'], time_freq = 'ungroup', # No aggregation made max_words = 2, stopwords = ['english'], skip = ['grrrrr'], numbers = True, punct = True lower_case = True) ``` ## Tutorial For more examples of coding, read these tutorials: **Text as Time Series: Arabica 1.0.0 Brings New Features for Exploratory Text Data Analysis** [here](https://towardsdatascience.com/text-as-time-series-arabica-1-0-brings-new-features-for-exploratory-text-data-analysis-88eaabb84deb?sk=229ec0602d0b8514f25bce501ed9ecb9) **Arabica: A Python Package for Exploratory Analysis of Text Data** [here](https://towardsdatascience.com/arabica-a-python-package-for-exploratory-analysis-of-text-data-3bb8d7379bd7?sk=cc91cabb56d44e0f285825d9a666b064) ## License ##### MIT For any questions, issues, bugs, and suggestions, please visit [here](https://github.com/PetrKorab/arabica/issues).

نیازمندی

مقدار	نام
-	pandas
>3.6.1	nltk
>=1.23.1	numpy
-	regex
>=1.1.4	cleantext
>=7.26.0	ipython

زبان مورد نیاز

مقدار	نام
>=3.7	Python

نحوه نصب

نصب پکیج whl arabica-1.0.2:

pip install arabica-1.0.2.whl

نصب پکیج tar.gz arabica-1.0.2:

pip install arabica-1.0.2.tar.gz