# aipalette_nlp
```aipalette_nlp``` python package is a package that contains a list of NLP functions that will be used for future tasks in Ai Palette. Many useful modules and functions will be included in the package. For now, it has a module that consists of tokenizers of different languages, and another module that has several functions for text preprocessing which also includes detecting language.
<br>
## How to Use
Install this package using pip : `>> pip install aipalette_nlp`, and import it directly in your code.
<br>
## Modules
### *Module1: tokenizer*
Below is an example of how you can use the ```word_tokenize``` function in the tokenizer module, which will automatically detect input language and call its respective tokenizer.
```
from aipalette_nlp.tokenizer import word_tokenize
text = "우아아 제 요리에 날개를 달아주는 아름다운 <키친콤마> 식품들이 도착했어요. 저당질, 저탄수화물로 만들어져 건강과 다이어트 그리고 맛까지 한꺼번에 챙길 수 있는 필수템입니다! 처음 호기심에서 시작한 저탄고지 키토식단을 유지한지 어느덧 2년 가까이 되었어요. 저탄고지는 살을 빼기위해 무작정 탄수화물을 끊는다거나 몸에 무리가 갈 수 있는 저칼로리 / 저염식이 아니에요. 내 몸에서 나타나는 반응에 좀더 귀기울이고 끊임없이 공부하고 좋은 음식을 섭취하려고 노력하는 라이프스타일 입니다."
print(word_tokenize(text))
```
**Output:**
{'tokenized_text': ['우아아', '제', '요리에', '날개를', '달아주는', '아름다운', '<키친콤마>', '식품들이', '도착했어요', '저당질,', '저탄수화물로', '만들어져', '건강과', '다이어트', '그리고', '맛까지', '한꺼번에', '챙길', '수', '있는', '필수템입니ᄃ', 'ᅡ!', '처음', '호기심에서', '시작 한', '저탄고지', '키토식단을', '유지한지', '어느덧', '2년', '가까이', '되었어요', '저탄고지는', '살을', '빼기위해', '무작정', '탄수화물을', '끊는다거나', '몸에', '무리가', '갈', '수', '있는', '저칼로리', '/', '', '저염식이', '아니에요', '내', '몸에서', '나타나는', '반응에', '좀더', '귀기울이고', '끊임없이', '공부하고', '좋은', '음식을', '섭취하려고', '노력하는', '라이프스타일', '입니다']}
### *Module2: text_cleaning*
Below is an example of how you can use the functions in the text_cleaning module.
```
from aipalette_nlp.preprocessing import detect_language, clean_text, remove_stopwords
text = """Dinner at @docksidevancouver . Patio season is definitely here!Support your local restaurants.
#foodie #facestuffing #scoutmagazine #vancouvermagazine #dailyhivevancouver #ediblevancouver #eatmagazine #vancouverisawesome #vancouverfoodie #food #foodlover
#curiocityvancouver #foodporn #foodlover #eat #foodgasm #foodinsta #foodinstagram #instafood #instafoodie #foodlover #foodpics #foodiesofinstagram #restaurant #homechef #foodphotography #nomnomnom #georgiastraight #docksiderestaurant #granvilleisland #gnocchi #dinner"""
print("language detected of the given text is : ", detect_language(text))
print(remove_stopwords(text))
print(clean_text(text))
```
**Output:**
language detected of the given text is : en
dinner @docksidevancouver . patio season definitely here!support local restaurants. #foodie #facestuffing #scoutmagazine #vancouvermagazine #dailyhivevancouver #ediblevancouver #eatmagazine #vancouverisawesome #vancouverfoodie #food #foodlover #curiocityvancouver #foodporn #foodlover #eat #foodgasm #foodinsta #foodinstagram #instafood #instafoodie #foodlover #foodpics #foodiesofinstagram #restaurant #homechef #foodphotography #nomnomnom #georgiastraight #docksiderestaurant #granvilleisland #gnocchi #dinner
{'hashtags': ['foodie', 'facestuffing', 'scoutmagazine', 'vancouvermagazine', 'dailyhivevancouver', 'ediblevancouver', 'eatmagazine', 'vancouverisawesome', 'vancouverfoodie', 'food', 'foodlover', 'curiocityvancouver', 'foodporn', 'foodlover', 'eat', 'foodgasm', 'foodinsta', 'foodinstagram', 'instafood', 'instafoodie', 'foodlover', 'foodpics', 'foodiesofinstagram', 'restaurant', 'homechef', 'foodphotography', 'nomnomnom', 'georgiastraight', 'docksiderestaurant', 'granvilleisland', 'gnocchi', 'dinner'], 'cleaned_text': 'dinner username patio season definitely support local restaurants', 'text_length': 65}
<br>
## Complete list of tokenizers supported:
['english', 'french', 'italian', 'portuguese', 'spanish', 'swedish', 'turkish', 'russian', 'mandarin', 'thai', 'japanese', 'korean', 'vietnamese','german', 'arabic']
<br>
## Text Processing/Cleaning Functions
The ```clean_text``` function from module text_cleaning does the following steps:
* replace the hashtags (#______) in the main caption with the original form of the word.
* replace all the mentioned usernames (@_______) with the word “\<username>”.
* remove punctuations
* remove stopwords (use nltk package)
* detect language
* replace all links/urls
#### Language supported by our language detector :
`af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu`