# connlp
A bunch of python codes to analyze text data in the construction industry.
Mainly reconstitute the pre-exist python libraries for Natural Language Processing (NLP).
## _Project Information_
- Supported by C!LAB (@Seoul Nat'l Univ.)
## _Contributors_
- Seonghyeon Boris Moon (blank54@snu.ac.kr, https://github.com/blank54/)
- Sehwan Chung (hwani751@snu.ac.kr)
- Jungyeon Kim (janykjy@snu.ac.kr)
# Initialize
## _Setup_
Install _**connlp**_ with _pip_.
```shell
pip install connlp
```
Install _requirements.txt_.
```shell
cd WORKSPACE
wget -O requirements_connlp.txt https://raw.githubusercontent.com/blank54/connlp/master/requirements.txt
pip install -r requirements_connlp.txt
```
## _Test_
If the code below runs with no error, _**connlp**_ is installed successfully.
```python
from connlp.test import hello
hello()
# 'Helloworld'
```
# Preprocess
Preprocessing module supports English and Korean.
NOTE: No plan for other languages (by 2021.04.02.).
## _Normalizer_
_**Normalizer**_ normalizes the input text by eliminating trash characters and remaining numbers, alphabets, and punctuation marks.
```python
from connlp.preprocess import Normalizer
normalizer = Normalizer()
normalizer.normalize(text='I am a boy!')
# 'i am a boy'
```
## _EnglishTokenizer_
_**EnglishTokenizer**_ tokenizes the input text in English based on word spacing.
The ngram-based tokenization is in preparation.
```python
from connlp.preprocess import EnglishTokenizer
tokenizer = EnglishTokenizer()
tokenizer.tokenize(text='I am a boy!')
# ['I', 'am', 'a', 'boy!']
```
## _KoreanTokenizer_
_**KoreanTokenizer**_ tokenizes the input text in Korean, and is based on either pre-trained or unsupervised approaches.
You are recommended to use pre-trained method unless you have a large size of corpus. This is the default setting.
If you want to use a pre-trained tokenizer, you have to select which analyzer you want to use. Available analyzers are based on KoNLPy (https://konlpy.org/ko/latest/api/konlpy.tag/), a python package for Korean language processing. The default analyzer is _**Hannanum**_
```python
from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=True, analyzer='Hannanum')
```
If your corpus is big, you may use an unsupervised method, which is based on _**soynlp**_ (https://github.com/lovit/soynlp), an unsupervised text analyzer in Korean.
```python
from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=False)
```
### _train_
If your _**KoreanTokenizer**_ are pre-trained, you can neglect this step.
Otherwhise (i.e. you are using an unsupervised approach), the _**KoreanTokenizer**_ object first needs to be trained on (unlabeled) corpus. 'Word score' is calculated for every subword in the corpus.
```python
from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=False)
docs = ['코퍼스의 첫 번째 문서입니다.', '두 번째 문서입니다.', '마지막 문서']
tokenizer.train(text=docs)
print(tokenizer.word_score)
# {'서': 0.0, '코': 0.0, '째': 0.0, '.': 0.0, '의': 0.0, '마': 0.0, '막': 0.0, '번': 0.0, '문': 0.0, '코퍼': 1.0, '번째': 1.0, '마지': 1.0, '문서': 1.0, '코퍼스': 1.0, '문서입': 0.816496580927726, '마지막': 1.0, '코퍼스의': 1.0, '문서입니': 0.8735804647362989, '문서입니다': 0.9036020036098448, '문서입니다.': 0.9221079114817278}
```
### _tokenize_
If you are using a pre-trained _**KoreanTokenizer**_, the selected KoNLPy analyzer will tokenize the input sentence based on morphological analysis.
```python
from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=True, analyzer='Hannanum')
doc = docs[0] # '코퍼스의 첫 번째 문서입니다.'
tokenizer.tokenize(doc)
# ['코퍼스', '의', '첫', '번째', '문서', '입니다', '.']
```
If you are using an unsupervised _**KoreanTokenizer**_, tokenization is based on the 'word score' calculated from _**KoreanTokenizer.train**_ method.
For each blank-separated token, a subword that has the maximum 'word score' is selectd as an individual 'word' and separated with the remaining part.
```python
from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=False)
doc = docs[0] # '코퍼스의 첫 번째 문서입니다.'
tokenizer.tokenize(doc)
# ['코퍼스의', '첫', '번째', '문서', '입니다.']
```
## _StopwordRemover_
_**StopwordRemover**_ removes stopwords from a given sentence based on the user-customized stopword list.
Before utilizing _**StopwordRemover**_ the user should normalize and tokenize the docs.
```python
from connlp.preprocess import Normalizer, EnglishTokenizer, StopwordRemover
normalizer = Normalizer()
eng_tokenizer = EnglishTokenizer()
stopword_remover = StopwordRemover()
docs = ['I am a boy!', 'He is a boy..', 'She is a girl?']
tokenized_docs = []
for doc in eng_docs:
normalized_doc = normalizer.normalize(text=doc)
tokenized_doc = eng_tokenizer.tokenize(text=normalized_doc)
tokenized_docs.append(tokenized_doc)
print(docs)
print(tokenized_docs)
# ['I am a boy!', 'He is a boy..', 'She is a girl?']
# [['i', 'am', 'a', 'boy'], ['he', 'is', 'a', 'boy'], ['she', 'is', 'a', 'girl']]
```
The user should prepare a customized stopword list (i.e., _stoplist_).
The _stoplist_ should include user-customized stopwords divided by '\n' and the file should be in ".txt" format.
```text
a
is
am
```
Initiate the _**StopwordRemover**_ with appropriate filepath of user-customized stopword list.
If the stoplist is absent at the filepath, the stoplist would be ramain as a blank list.
```python
fpath_stoplist = 'test/thesaurus/stoplist.txt'
stopword_remover.initiate(fpath_stoplist=fpath_stoplist)
print(stopword_remover)
# <connlp.preprocess.StopwordRemover object at 0x7f163e70c050>
```
The user can count the word frequencies and figure out additional stopwords based on the results.
```python
stopword_remover.count_freq_words(docs=tokenized_docs)
# ========================================
# Word counts
# | [1] a: 3
# | [2] boy: 2
# | [3] is: 2
# | [4] i: 1
# | [5] am: 1
# | [6] he: 1
# | [7] she: 1
# | [8] girl: 1
```
After finally updating the _stoplist_, use _**remove**_ method to remove the stopwords from text.
```python
stopword_removed_docs = []
for doc in tokenized_docs:
stopword_removed_docs.append(stopword_remover.remove(sent=doc))
print(stopword_removed_docs)
# [['i', 'boy'], ['he', 'boy'], ['she', 'girl']]
```
The user can check which stopword was removed with _**check_removed_words**_ methods.
```python
stopword_remover.check_removed_words(docs=tokenized_docs, stopword_removed_docs=stopword_removed_docs)
# ========================================
# Check stopwords removed
# | [1] BEFORE: a(3) ->
# | [2] BEFORE: boy -> AFTER: boy(2)
# | [3] BEFORE: is(2) ->
# | [4] BEFORE: i -> AFTER: i(1)
# | [5] BEFORE: am(1) ->
# | [6] BEFORE: he -> AFTER: he(1)
# | [7] BEFORE: she -> AFTER: she(1)
# | [8] BEFORE: girl -> AFTER: girl(1)
```
# Embedding
## _Vectorizer_
_**Vectorizer**_ includes several text embedding methods that have been commonly used for decades.
### _tfidf_
TF-IDF is the most commonly used technique for word embedding.
The TF-IDF model counts the term frequency(TF) and inverse document frequency(IDF) from the given documents.
The results included the followings.
- TF-IDF Vectorizer (a class of sklearn.feature_extraction.text.TfidfVectorizer')
- TF-IDF Matrix
- TF-IDF Vocabulary
```python
from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()
docs = ['I am a boy', 'He is a boy', 'She is a girl']
tfidf_vectorizer, tfidf_matrix, tfidf_vocab = vectorizer.tfidf(docs=docs)
type(tfidf_vectorizer)
# <class 'sklearn.feature_extraction.text.TfidfVectorizer'>
```
The user can get a document vector by indexing the _**tfidf_matrix**_.
```python
tfidf_matrix[0]
# (0, 2) 0.444514311537431
# (0, 0) 0.34520501686496574
# (0, 1) 0.5844829010200651
# (0, 5) 0.5844829010200651
```
The _**tfidf_vocab**_ returns an index for every token.
```python
print(tfidf_vocab)
# {'i': 5, 'am': 1, 'a': 0, 'boy': 2, 'he': 4, 'is': 6, 'she': 7, 'girl': 3}
```
### _word2vec_
Word2Vec is a distributed representation language model for word embedding.
The Word2vec model trains tokenized docs and returns word vectors.
The result is a class of 'gensim.models.word2vec.Word2Vec'.
```python
from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()
docs = ['I am a boy', 'He is a boy', 'She is a girl']
tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
w2v_model = vectorizer.word2vec(docs=tokenized_docs)
type(w2v_model)
# <class 'gensim.models.word2vec.Word2Vec'>
```
The user can get a word vector by _**.wv**_ method.
```python
w2v_model.wv['boy']
# [-2.0130998e-03 -3.5652996e-03 2.7793974e-03 ...]
```
The Word2Vec model provides the _topn_-most similar word vectors.
```python
w2v_model.wv.most_similar('boy', topn=3)
# [('He', 0.05311150848865509), ('a', 0.04154288396239281), ('She', -0.029122961685061455)]
```
### _word2vec (update)_
The user can update the Word2Vec model with new data.
```python
new_docs = ['Tom is a man', 'Sally is not a boy']
tokenized_new_docs = [tokenizer.tokenize(text=doc) for doc in new_docs]
w2v_model_updated = vectorizer.word2vec_update(w2v_model=w2v_model, new_docs=tokenized_new_docs)
w2v_model_updated.wv['man']
# [4.9649975e-03 3.8002312e-04 -1.5773597e-03 ...]
```
### _doc2vec_
Doc2Vec is a distributed representation language model for longer text (e.g., sentence, paragraph, document) embedding.
The Doc2vec model trains tokenized docs with tags and returns document vectors.
The result is a class of 'gensim.models.doc2vec.Doc2Vec'.
```python
from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()
docs = ['I am a boy', 'He is a boy', 'She is a girl']
tagged_docs = [(idx, tokenizer.tokenize(text=doc)) for idx, doc in enumerate(docs)]
d2v_model = vectorizer.doc2vec(tagged_docs=tagged_docs)
type(d2v_model)
# <class 'gensim.models.doc2vec.Doc2Vec'>
```
The Doc2Vec model can infer a new document.
```python
test_doc = ['My', 'name', 'is', 'Peter']
d2v_model.infer_vector(doc_words=test_doc)
# [4.8494316e-03 -4.3647490e-03 1.1437446e-03 ...]
```
# Analysis
## _TopicModel_
_**TopicModel**_ is a class for topic modeling based on gensim LDA model.
It provides a simple way to train lda model and assign topics to docs.
Before using LDA topic modeling, the user should install the following packages.
```shell
pip install pyldavis==2.1.2
```
_**TopicModel**_ requires two instances.
- a dict of docs whose keys are the tag
- the number of topics for modeling
```python
from connlp.analysis_lda import TopicModel
num_topics = 2
docs = {'doc1': ['I', 'am', 'a', 'boy'],
'doc2': ['He', 'is', 'a', 'boy'],
'doc3': ['Cat', 'on', 'the', 'table'],
'doc4': ['Mike', 'is', 'a', 'boy'],
'doc5': ['Dog', 'on', 'the', 'table'],
}
lda_model = TopicModel(docs=docs, num_topics=num_topics)
```
### _learn_
The user can train the model with _learn_ method.
Unless parameters being provided by the user, the model trains based on default parameters.
After _learn_, _**TopicModel**_ provides _model_ instance that is a class of <'gensim.models.ldamodel.LdaModel'>
```python
parameters = {
'iterations': 100,
'alpha': 0.7,
'eta': 0.05,
}
lda_model.learn(parameters=parameters)
type(lda_model.model)
# <class 'gensim.models.ldamodel.LdaModel'>
```
### _coherence_
_**TopicModel**_ provides coherence value for model evaluation.
The coherence value is automatically calculated right after model training.
```python
print(lda_model.coherence)
# 0.3607990279229385
```
### _assign_
The user can easily assign the most proper topic to each doc using _assign_ method.
After _assign_, the _**TopicModel**_ provides _tag2topic_ and _topic2tag_ instances for convenience.
```python
lda_model.assign()
print(lda_model.tag2topic)
print(lda_model.topic2tag)
# defaultdict(<class 'int'>, {'doc1': 1, 'doc2': 1, 'doc3': 0, 'doc4': 1, 'doc5': 0})
# defaultdict(<class 'list'>, {1: ['doc1', 'doc2', 'doc4'], 0: ['doc3', 'doc5']})
```
## _NamedEntityRecognition_
Before using NER modules, the user should install proper versions of TensorFlow and Keras.
```shell
pip install config==0.4.2 gensim==3.8.1 gpustat==0.6.0 GPUtil==1.4.0 h5py==2.10.0 JPype1==0.7.1 Keras==2.2.4 konlpy==0.5.2 nltk==3.4.5 numpy==1.18.1 pandas==1.0.1 scikit-learn==0.22.1 scipy==1.4.1 silence-tensorflow==1.1.1 soynlp==0.0.493 tensorflow==1.14.0 tensorflow-gpu==1.14.0
```
The modules might require the module of _keras-contrib_.
The user can install the module by following the below.
```shell
git clone https://www.github.com/keras-team/keras-contrib.git
cd keras-contrib
python setup.py install
```
### _Labels_
_**NER_Model**_ is a class to conduct named entity recognition using Bi-directional Long-Short Term Memory (Bi-LSTM) and Conditional Random Field (CRF).
At the beginning, appropriate labels are required.
The labels should be numbered with start of 0.
```python
from connlp.analysis_ner import NER_Labels
label_dict = {'NON': 0, #None
'PER': 1, #PERSON
'FOD': 2,} #FOOD
ner_labels = NER_Labels(label_dict=label_dict)
```
### _Corpus_
Next, the user should prepare data including sentences and labels, of which each data being matched by the same tag.
The tokenized sentences and labels are then combined via _**NER_LabeledSentence**_.
With the data, labels, and a proper size of _max_sent_len_ (i.e., the maximum length of sentence for analysis), _**NER_Corpus**_ would be developed.
Once the corpus was developed, every data of sentences and labels would be padded with the length of _max_sent_len_.
```python
from connlp.preprocess import EnglishTokenizer
from connlp.analysis_ner import NER_LabeledSentence, NER_Corpus
tokenizer = EnglishTokenizer()
data_sents = {'sent1': 'Sam likes pizza',
'sent2': 'Erik eats pizza',
'sent3': 'Erik and Sam are drinking soda',
'sent4': 'Flora cooks chicken',
'sent5': 'Sam ordered a chicken',
'sent6': 'Flora likes chicken sandwitch',
'sent7': 'Erik likes to drink soda'}
data_labels = {'sent1': [1, 0, 2],
'sent2': [1, 0, 2],
'sent3': [1, 0, 1, 0, 0, 2],
'sent4': [1, 0, 2],
'sent5': [1, 0, 0, 2],
'sent6': [1, 0, 2, 2],
'sent7': [1, 0, 0, 0, 2]}
docs = []
for tag, sent in data_sents.items():
words = [str(w) for w in tokenizer.tokenize(text=sent)]
labels = data_labels[tag]
docs.append(NER_LabeledSentence(tag=tag, words=words, labels=labels))
max_sent_len = 10
ner_corpus = NER_Corpus(docs=docs, ner_labels=ner_labels, max_sent_len=max_sent_len)
type(ner_corpus)
# <class 'connlp.analysis_ner.NER_Corpus'>
```
### _Word Embedding_
Every word in the _**NER_Corpus**_ should be embedded into numeric vector space.
The user can conduct embedding with Word2Vec which is provided in _**Vectorizer**_ of _**connlp**_.
Note that the embedding process of _**NER_Corpus**_ only requires the dictionary of word vectors and the feature size.
```python
from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()
tokenized_sents = [tokenizer.tokenize(sent) for sent in data_sents.values()]
w2v_model = vectorizer.word2vec(docs=tokenized_sents)
word2vector = vectorizer.get_word_vectors(w2v_model)
feature_size = w2v_model.vector_size
ner_corpus.word_embedding(word2vector=word2vector, feature_size=feature_size)
print(ner_corpus.X_embedded)
# [[[-2.40120804e-03 1.74632657e-03 ...]
# [-3.57543468e-03 2.86567654e-03 ...]
# ...
# [ 0.00000000e+00 0.00000000e+00 ...]] ...]
```
### _Model Initialization_
The parameters for Bi-LSTM and model training should be provided, however, they can be composed of a single dictionary.
The user should initialize the _**NER_Model**_ with _**NER_Corpus**_ and the parameters.
```python
from connlp.analysis_ner import NER_Model
parameters = {
# Parameters for Bi-LSTM.
'lstm_units': 512,
'lstm_return_sequences': True,
'lstm_recurrent_dropout': 0.2,
'dense_units': 100,
'dense_activation': 'relu',
# Parameters for model training.
'test_size': 0.3,
'batch_size': 1,
'epochs': 100,
'validation_split': 0.1,
}
ner_model = NER_Model()
ner_model.initialize(ner_corpus=ner_corpus, parameters=parameters)
type(ner_model)
# <class 'connlp.analysis_ner.NER_Model'>
```
### _Model Training_
The user can train the _**NER_Model**_ with customized parameters.
The model automatically gets the dataset from the _**NER_Corpus**_.
```python
ner_model.train(parameters=parameters)
# Train on 3 samples, validate on 1 samples
# Epoch 1/100
# 3/3 [==============================] - 3s 1s/step - loss: 1.4545 - crf_viterbi_accuracy: 0.3000 - val_loss: 1.0767 - val_crf_viterbi_accuracy: 0.8000
# Epoch 2/100
# 3/3 [==============================] - 0s 74ms/step - loss: 0.8602 - crf_viterbi_accuracy: 0.7000 - val_loss: 0.5287 - val_crf_viterbi_accuracy: 0.8000
# ...
```
### _Model Evaluation_
The model performance can be shown in the aspects of confusion matrix and F1 score.
```python
ner_model.evaluate()
# |--------------------------------------------------
# |Confusion Matrix:
# [[ 3 0 3 6]
# [ 1 3 0 4]
# [ 0 0 2 2]
# [ 4 3 5 12]]
# |--------------------------------------------------
# |F1 Score: 0.757
# |--------------------------------------------------
# | [NON]: 0.600
# | [PER]: 0.857
# | [FOD]: 0.571
```
### _Save_
The user can save the _**NER_Model**_.
The model would save the model itself ("\<FileName\>.pk") and the dataset ("\<FileName\>-dataset.pk") that was used in model development.
Note that the directory should exist before saving the model.
```python
from connlp.util import makedir
fpath_model = 'test/ner/model.pk'
makedir(fpath=fpath_model)
ner_model.save(fpath_model=fpath_model)
```
### _Load_
If the user wants to load the already trained model, just call the model and load.
```python
fpath_model = 'test/ner/model.pk'
ner_model = NER_Model()
ner_model.load(fpath_model=fpath_model, ner_corpus=ner_corpus, parameters=parameters)
```
### _Prediction_
_**NER_Model**_ can conduct a new NER task on the given sentence.
The result is a class of _**NER_Result**_.
```python
from connlp.preprocess import EnglishTokenizer
vectorizer = Vectorizer()
new_sent = 'Tom eats apple'
tokenized_sent = tokenizer.tokenize(new_sent)
ner_result = ner_model.predict(sent=tokenized_sent)
print(ner_result)
# Tom/PER eats/NON apple/FOD
```
## _Web Crawling_
The _**connlp**_ currently provides web crawling for Naver news articles.
### _Query_
The user should prepare the proper queries first.
A single text file(.txt) should include every information of the query as below.
- Date Start
- Date End
- Keywords
The web crawler utilizes the keywords separated with '\n\n' in the same time.
Meanwhile, the web crawler utilizes the keywords separated with '\n' as a different queries.
For example, if the queries are determined as below, the web crawler would search the articles with six queries: "smart+construction+safety at 20210718", "smart+construction+management at 20210718", "smart+construction+safety at 20210719", ...
```plain
20210718
20210720
smart
construction
safety
management
```
The _**NewsQueryParser**_ parses the queries into appropriate formats.
```python
from connlp.web_crawling import NewsQueryParser
query_parser = NewsQueryParser()
fpath_query = 'FILEPATH_OF_YOUR_QUERY'
query_list, date_list = query_parser.parse(fpath_query=fpath_query)
```
### _URLs_
For the second step, the web crawler parses the web page that shows the list of news articles.
_**NaverNewsListScraper**_ provides the function of parsing the list page.
The user is recommended to save the url lists and load them later.
```python
from connlp.web_crawling import NaverNewsListScraper
list_scraper = NaverNewsListScraper()
for date in sorted(date_list, reverse=False):
for query in query_list:
url_list = list_scraper.get_url_list(query=query, date=date)
```
### _Articles_
The last step is to parse the article page and get information from the article.
_**NaverNewsArticleParser**_ returns a class of _**Article**_ for a given article.
Remember to extend the query list of the article.
```python
from connlp.web_crawling import NaverNewsArticleParser
article_parser = NaverNewsArticleParser()
query_list, _ = query_parser.urlname2query(fname_url_list=fname_url_list)
for url in url_list:
article = article_parser.parse(url=url)
article.extend_query(query_list)
```
### _Status_
_**NewsStatus**_ provides the status of the crawled corpus for given directories.
```python
from connlp.web_crawling import NewsStatus
news_status = NewsStatus()
fdir_queries = 'DIRECTORY_FOR_QUERIES'
fdir_url_list = 'DIRECTORY_FOR_URLS'
fdir_article = 'DIRECTORY_FOR_ARTICLES'
news_status.queries(fdir_queries=fdir_queries)
news_status.urls(fdir_urls=fdir_url_list)
news_status.articles(fdir_articles=fdir_article)
```
# Visualization
## _Visualizer_
_**Visualizer**_ includes several simple tools for text visualization.
Install the following packages.
```
pip install networkx wordcloud
```
### _network_
_**network**_ method provides a word network for tokenized docs.
```python
from connlp.preprocess import EnglishTokenizer
from connlp.visualize import Visualizer
tokenizer = EnglishTokenizer()
visualizer = Visualizer()
docs = ['I am a boy', 'She is a girl']
tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
word_network = visualizer.network(docs=tokenized_docs, show=True)
```
The word network is a _matplotlib.pyplot_ object.
The user can save the figure by _.savefig()_ method.
```python
word_network.savefig(FILEPATH)
```
### _wordcloud_
_**wordcloud**_ method provides a word cloud for tokenized docs.
```python
from connlp.preprocess import EnglishTokenizer
from connlp.visualize import Visualizer
tokenizer = EnglishTokenizer()
visualizer = Visualizer()
docs = ['I am a boy', 'She is a girl']
tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
wordcloud = visualizer.wordcloud(docs=tokenized_docs, show=True)
```
The wordcloud is a _matplotlib.pyplot_ object.
The user can save the figure by _.savefig()_ method.
```python
wordcloud.savefig(FILEPATH)
```
# Extracting Text
## _TextConverter_
_**TextConverter**_ includes several methods that extract raw text from various types of files (e.g. PDF, HWP) and/or converts the files into plain text files (e.g. TXT).
### _hwp2txt_
_**hwp2txt**_ method converts a HWP file into a plain text file.
Dependencies: pyhwp package
Install pyhwp (you need to install the pre-release version)
```
pip install --pre pyhwp
```
Example
```python
from connlp.text_extract import TextConverter
converter = TextConverter()
hwp_fpath = '/data/raw/hwp_file.hwp'
output_fpath = '/data/processed/extracted_text.txt'
converter.hwp2txt(hwp_fpath, output_fpath) # returns 0 if no error occurs
```
# GPU Utils
## _GPUMonitor_
_**GPUMonitor**_ generates a class to monitor and display the GPU status based on nvidia-smi.
Refer to "https://github.com/anderskm/gputil" and "https://data-newbie.tistory.com/561" for usages.
Install _GPUtils_ module with _pip_.
```
pip install GPUtil
```
Write your code between the initiation of the _**GPUMonitor**_ and _**monitor.stop()**_.
```python
from connlp.util import GPUMonitor
monitor = GPUMonitor(delay=3)
# >>>Write your code here<<<
monitor.stop()
# | ID | GPU | MEM |
# ------------------
# | 0 | 0% | 0% |
# | 1 | 1% | 0% |
# | 2 | 0% | 94% |
```