معرفی شرکت ها


compound-split-1.0.2.dev4


Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

Splits a compound into its body and head. So far German and Dutch are supported.
ویژگی مقدار
سیستم عامل -
نام فایل compound-split-1.0.2.dev4
نام compound-split
نسخه کتابخانه 1.0.2.dev4
نگهدارنده ['Joel Niklaus']
ایمیل نگهدارنده ['me@joelniklaus.ch']
نویسنده Don Tuggener
ایمیل نویسنده don.tuggener@gmail.com
آدرس صفحه اصلی https://github.com/JoelNiklaus/CompoundSplit
آدرس اینترنتی https://pypi.org/project/compound-split/
مجوز GPL-3.0 License
# CharSplit - An *ngram*-based compound splitter for German Splits a German compound into its body and head, e.g. > Autobahnraststätte -> Autobahn - Raststätte Implementation of the method described in the appendix of the thesis: Tuggener, Don (2016). *Incremental Coreference Resolution for German.* University of Zurich, Faculty of Arts. ### TL;DR The method calculates probabilities of ngrams occurring at the beginning, end and in the middle of words and identifies the most likely position for a split. The method achieves ~95% accuracy for head detection on the [Germanet compound test set](http://www.sfs.uni-tuebingen.de/lsd/compounds.shtml). A model is provided, trained on 1 Mio. German nouns from Wikipedia. ### Usage ### Train a new model: ``` $ python char_split_train.py <your_train_file> ``` where `<your_train_file>` contains one word (noun) per line. ### Compound splitting From command line: ``` $ python char_split.py <word> ``` Outputs all possible splits, ranked by their score, e.g. ``` $ python char_split.py Autobahnraststätte 0.84096566854 Autobahn Raststätte -0.54568851959 Auto Bahnraststätte -0.719082070993 Autobahnrast Stätte ... ``` As a module: ``` $ python >>> from compound_split import char_split >>> char_split.split_compound('Autobahnraststätte') [[0.7945872450631273, 'Autobahn', 'Raststätte'], [-0.7143290887876655, 'Auto', 'Bahnraststätte'], [-1.1132332878581173, 'Autobahnrast', 'Stätte'], [-1.4010051533086552, 'Aut', 'Obahnraststätte'], [-2.3447843979244944, 'Autobahnrasts', 'Tätte'], [-2.4761904761904763, 'Autobahnra', 'Ststätte'], [-2.4761904761904763, 'Autobahnr', 'Aststätte'], [-2.5733333333333333, 'Autob', 'Ahnraststätte'], [-2.604651162790698, 'Autobahnras', 'Tstätte'], [-2.7142857142857144, 'Autobah', 'Nraststätte'], [-2.730248306997743, 'Autobahnrastst', 'Ätte'], [-2.8033113109925973, 'Autobahnraststä', 'Tte'], [-3.0, 'Autoba', 'Hnraststätte']] ``` ### Document splitting From command line: ``` $ python doc_split.py <dict> ``` Reads everything from standard input and writes out the same, with the best splits separated by the middle dot character `·`. Each word is split as many times as possible based on the file <dict>, which contains German words one per line (comment lines beginning with # are allowed). The name of the default dictionary is in the file `doc_config.py`. Note that the `doc_split` module retains a cache of words already split, so long documents will typically be processed proportionately faster than short ones. The cache is discarded when the program ends. ``` $ python sentence1.txt Um die in jeder Hinsicht zufriedenzustellen, tüftelt er einen Weg aus, sinnlose Bürokratie wie Ladenschlußgesetz und Nachtbackverbot auszutricksen. $ python doc_split.py <sentence1.txt Um die in jeder Hinsicht zufriedenzustellen, tüftelt er einen Weg aus, sinnlose Bürokratie wie Laden·schluß·gesetz und Nacht·back·verbot auszutricksen. ``` As a module: ``` $ python >>> from compound_split import doc_split >>> # Constant containing a middle dot >>> doc_split.MIDDLE_DOT '·' >>> # Split a word as much as possible, return a list >>> doc_split.maximal_split('Verfassungsschutzpräsident') ['Verfassungs', 'Schutz', 'Präsident'] >>> # Split a word as much as possible, return a word with middle dots 'Verfassungs·schutz·präsident' >>> # Split all splittable words in a sentence >>> doc_split.doc_split('Der Marquis schlug mit dem Handteller auf sein Regiepult.') Der Marquis schlug mit dem Hand·teller auf sein Regie·pult. ``` ### Document splitting server Because of the startup time, you can run the document splitter as a simple server, and the responses will be quicker. ``` $ python doc_server [ -d ] <dict> <port> ``` The server will load `<dict>` and listen on `<port>`. The client must send the raw data in UTF-8 encoding to the port and close the write side of the port, and the server will return the split data. The option `-d` causes the server to return a sorted dictionary of split words instead. Each word is on a single line, with the original word followed by a tab character followed by the split word. Because of Python restrictions, the server is single-threaded. The default dictionary and port are in the file `doc_config.py`. A trivial client is provided: ``` $ python doc_client <port> <host> ``` Reads a document from standard input, send it to the server running on `<host>` and `<port>`, and send the server's output to standard output. Thus it has the same interface as `doc_split` (except that the dictionary cannot be specified), but should run somewhat faster. The default host and port are in the file `doc_config.py`. ## Downloading dictionaries To download German and Dutch dictionaries for `doc_split` and `doc_server`: ``` $ cd dicts $ sh getdicts ``` This will download the spelling plugins from the LibreOffice site, extract the wordlists, and write five files into the current directory. It leaves a good many files in `/tmp`, which are not needed further. * The dictionaries `de-DE.dic`, `de-AT.dic`, and `de-CH.dic` are fairly extensive (about 250,000 words each) and provide current German, Austrian, and Swiss spelling. * The file `de-1901.dic` provides the spelling used between 1901 and 1996. * The file `misc.dic` is a collection of nouns that are mis-split and are therefore included in the dictionary so that they won't be split. * The file `legal.dic` contains legal terms. Remove it before running getdicts if you don't want it to be included. * The file `de-mixed.dic` is a merger of all of the other files. * The file `nl-NL.dic` is from OpenOffice and provides Dutch spelling (not currently used). You can add your own wordlists before running `getdicts` if you want. They must be plain UTF-8 text with one word per line and begin with the correct language code (`de` for German). If the program is not splitting hard enough for your purposes, you may want to find and use a smaller dictionary. Since it is only checked if the exact word is in these dictionaries the following problem can arise: "Beschwerden" is not split because the dictionaries only contain "Beschwerde"! A solution to this problem would be to do this compound splitting only on the lemmatized text with dictionaries containing lemmatized words. => TODO: implement this OR make it possible to run it on a list of tokens! TODO: Write more documentation


زبان مورد نیاز

مقدار نام
>=3 Python


نحوه نصب


نصب پکیج whl compound-split-1.0.2.dev4:

    pip install compound-split-1.0.2.dev4.whl


نصب پکیج tar.gz compound-split-1.0.2.dev4:

    pip install compound-split-1.0.2.dev4.tar.gz