DomainThesaurus
================
Introduction
------------
**DomainThesaurus** is a python package offering a technique of extracting domain-specific
thesaurus which is commonly used in *Natural Language Processing*. Here is one item of generated
thesaurus::
{ "internet explorer":
{"abbreviation":["ie"],
"synonym":["internet explorers", "internet explorere", "internetexplorer"],
"other":["firefox","chrome","opera"]}
}
Except for domain-specific thesaurus, the package also provides several useful modules.
For example, **DomainTerm** for extracting domain-specific term and **WordDiscrimination**
for discriminate words (e.g. abbreviation, synonyms).
Details of the implemented approaches can be found in our publication:
`SEthesaurus: WordNet in Software Engineering
<https://ieeexplore.ieee.org/document/8827962>`_. (IEEE Transactions on Software Engineering 2019)
Domain-Specific term
::::::::::::::::::::::::::::::
**DomainTerm** can automatically extract domain-specific terms from domain corpus.
For example, *Javascript* in the domain of computer science and technology and *karush kuhn tucker* in
domain of mathematics.
Abbreviations and Synonyms
:::::::::::::::::::::::::::
The module **WordDiscrimination** can divide semantic-related words into different types.
The default module can recognize semantic-related words as `abbreviation` and `synonym`. Note that,
in our module, the `synonym` means that two words are semantic-related word and they are morphologically similar.
For example, *ie* is the abbreviation of *internet explorer* and *javascripts* is
the synonym of *javascript*.
Installation
------------
**DomainThesaurus** is tested to work under `Python 3.x`. Please use it in `Python 3.x`.
We will try to support *Python 2.x*.
Dependency requirements:
* gensim(>=3.6.0)
* networkx(>=2.1)
**DomainThesaurus** is currently available on the PyPi's repository and you can
install it via `pip`::
pip install DomainThesaurus
If you prefer, you can clone it and run the setup.py file. Use the following
command to get a copy from GitHub::
git clone https://github.com/DunZhang/DomainSpecificThesaurus.git
Usage
----------
A simple example::
>>> dst = DomainThesaurus(domain_specific_corpus_path="your domain specific corpus path",
>>> general_vocab_path="your general vocab path",
>>> outputDir="path of output")
>>> # extract domain thesauruss
>>> your_thesaurus = dst.extract()
If you don't have any datasets, you can copy and run this code:
https://github.com/DunZhang/DomainSpecificThesaurus/blob/master/docs/notebooks/domain_thesaurus.ipynb .
This code will automatically download datasets for you.
The code design is flexible, you can replace the default `function class` with your own `function class` to get better
performance.
You can find more usage in https://github.com/DunZhang/DomainSpecificThesaurus/blob/master/docs/notebooks
Acknowledgements
-----------------
In this project, we use `Levenshtein Distance` and `GoogleDriveDownloader` from https://pypi.org/project/jellyfish/
and https://github.com/ndrplz/google-drive-downloader. Thanks for their code.