معرفی شرکت ها


codeprep-1.0.5


Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

A toolkit for pre-processing large source code corpora
ویژگی مقدار
سیستم عامل -
نام فایل codeprep-1.0.5
نام codeprep
نسخه کتابخانه 1.0.5
نگهدارنده []
ایمیل نگهدارنده []
نویسنده Hlib Babii
ایمیل نویسنده hlibbabii@gmail.com
آدرس صفحه اصلی http://github.com/giganticode/codeprep
آدرس اینترنتی https://pypi.org/project/codeprep/
مجوز Apache-2.0
<!-- SPDX-FileCopyrightText: 2020 Hlib Babii <hlibbabii@gmail.com> SPDX-License-Identifier: Apache-2.0 --> # Codeprep [![Build Status](https://travis-ci.org/giganticode/codeprep.svg?branch=master)](https://travis-ci.org/giganticode/codeprep) [![Maintainability](https://api.codeclimate.com/v1/badges/64c9b107bc09fdb1b3b1/maintainability)](https://codeclimate.com/github/giganticode/codeprep/maintainability) [![Test Coverage](https://api.codeclimate.com/v1/badges/64c9b107bc09fdb1b3b1/test_coverage)](https://codeclimate.com/github/giganticode/codeprep/test_coverage) [![PyPI version fury.io](https://badge.fury.io/py/codeprep.svg)](https://pypi.python.org/pypi/codeprep/) **This is a tool for preprocessing source code corpora according to a specified vocabulary modeling choice.** Supported modeling choices are: * Splitting algorithm (no identifier splitting, camel-case splitting, snake-case splitting, BPE (byte-pair-encoding), number-splitting, ronin: http://joss.theoj.org/papers/10.21105/joss.00653); * Number of merges if using BPE; * Ignoring/preserving string literals; * Ignoring/preserving comments; * Preserving case/lowercasing; * Preserving/ignoring newlines and tabs. * applying/not applying stemming after basic splitting # Getting started Make sure you have python >= 3.6 installed in your system; pip, setuptools and wheel are up to date. ```bash python --version python -m pip install --upgrade pip setuptools wheel ``` Install **codeprep** lib: ```bash pip install codeprep ``` In order to run the **ronin** algorithm, you will have to additionally install Spiral module (https://github.com/casics/spiral/): ```bash pip install git+https://github.com/casics/spiral.git ``` The tool can be used **as a python library** as well as a standalone module runnable with a **CLI**. You can pass the path to the dataset or the text itself to be preprocessed. When using Python API for the former option you need to import methods from `codeprep.api.text` module, for the latter - from `codeprep.api.corpus`. Below you can see the general patterns of usage. Python API ```python >>> import codeprep.api.text as cp >>> cp.<commmand>('Some code to be split') ``` ```python >>> import codeprep.api.corpus as cp >>> cp.<commmand>('/path/to/the/dataset') ``` CLI ```bash codeprep <commmand> "Some code to be split" ``` ```bash codeprep <commmand> --path /path/to/the/dataset ``` Hereafter we will demonstrate the usage as a python library. The CLI is analogous to the python API. You can find the documentation about how to use it [here](codeprep/cli/spec.py). ## Usage examples ### Basic splitting Tokenization + CamelCase- and snake_case- splitting: ```python >>> import codeprep.api.text as cp >>> input_code = '''void test_WordUeberraschungPrinter() { ... if (eps >= 0.345e+4) { // FIXME ... printWord(" ... Überraschung"); ... } ... }''' >>> cp.basic(input_code) ['void', '<w>', 'test', '_', 'Word', 'Ueberraschung', 'Printer', '</w>', '(', ')', '{', '\n', '\t', 'if', '(', 'eps', '>', '=', '0', '.', '<w>', '345', 'e', '</w>', '+', '4', ')', '{', '/', '/', 'FIXME', '\n', '\t', '\t', '<w>', 'print', 'Word', '</w>', '(', '"', '\t', '.', '.', '.', '\t', 'Überraschung', '"', ')', ';', '\n', '\t', '}', '\n', '}'] ``` ### Tokenize but don't split identifiers ```python >>> import codeprep.api.text as cp >>> input_code = '''void test_WordUeberraschungPrinter() { ... if (eps >= 0.345e+4) { // FIXME ... printWord(" ... Überraschung"); ... } ... }''' >>> cp.nosplit(input_code) ['void', 'test_WordUeberraschungPrinter', '(', ')', '{', '\n', '\t', 'if', '(', 'eps', '>', '=', '0', '.', '345e', '+', '4', ')', '{', '/', '/', 'FIXME', '\n', '\t', '\t', 'printWord', '(', '"', '\t', '.', '.', '.', '\t', 'Überraschung', '"', ')', ';', '\n', '\t', '}', '\n', '}'] ``` ### BPE (Byte-Pair encoding) The following code does **camelCase-** and **snake_case-** splitting and applies **bpe with 10k merges** on top: ```python >>> import codeprep.api.text as cp >>> input_code = '''void test_WordUeberraschungPrinter() { ... if (eps >= 0.345e+4) { // FIXME ... printWord(" ... Überraschung"); ... } ... }''' >>> cp.bpe(input_code, bpe_codes_id='10k') ['v', 'oid</t>', 'test_', 'Word', 'U', 'eb', 'err', 'as', 'ch', 'un', 'g', 'Print', 'er</t>', '(</t>', ')</t>', '{</t>', '\n', '\t', 'i', 'f</t>', '(</t>', 'e', 'ps</t>', '></t>', '=</t>', '0</t>', '.</t>', '34', '5', 'e</t>', '+</t>', '4</t>', ')</t>', '{</t>', '/</t>', '/</t>', 'FIX', 'M', 'E</t>', '\n', '\t', '\t', 'print', 'Word</t>', '(</t>', '"</t>', '\t', '.</t>', '.</t>', '.</t>', '\t', 'Ü', 'b', 'err', 'as', 'ch', 'un', 'g</t>', '"</t>', ')</t>', ';</t>', '\n', '\t', '}</t>', '\n', '}</t>'] ``` **codeprep** by default does BPE using bpe codes leaned on [the Github Java Corpus](http://groups.inf.ed.ac.uk/cup/javaGithub/). The argument `bpe_codes_id='10k'` tells the **codeprep** tool to use 10,000 bpe merges. Other possible values are `1k` and `5k` (1,000 and 5,000 merges respectively). Please refer to section [Learning custom BPE codes](#Learning-custom-BPE-codes) to train custom bpe codes. **For other commands and options like `chars`, `--split-numbers`, `--ronin`, `--stem`, please refer to the [docs](codeprep/cli/spec.py)**. ## Calculate vocabulary Set `calc_vocab` param to `True` when calling a preprocessing method to calculate the vocabulary of the preprocessed corpus, e.g.: ```python >>> import codeprep.api.corpus as cp >>> cp.basic('/path/to/train/on', calc_vocab=True) ... Vocab is available at /path/to/vocab ``` ## Learning custom BPE codes If you don't want to use, pre-trained BPE codes, it's possible to train custom ones. For example, to train 10,000 merges on the corpus located at the path `/path/to/train/on`, the following command should be run (only CLI): ```bash codeprep learn-bpe 10000 -p /path/to/train/on --id custom-bpe-codes ``` Now it is possible to do bpe splitting by running the bpe command with the number of merges from 0 to 10,000 (for example with 3500 merges): ```bash codeprep bpe custom-bpe-codes-3500 -p /path/to/preprocess ``` Before bpe codes are trained, the [basic preprocessing](#basic-splitting) is done, which can also be tuned with arguments described in section [Tweaking preprocessing](#tweaking-preprocessing). ## Additional options ### Tweaking preprocessing You can pass the following parameters with a `True` value (default values for all of them are False), to tweak the way the imput is preprocessed: * `no_str` - replace strings with <string> placeholders. * `no_com` - replace comments with <comment> placeholders. * `no_spaces` - remove newlines and tabs. * `no_unicode` - replace words containing non-ascii characters with <non-en> placeholders. * `no_case` - lowercase words and encode information about case in <Cap> <CAP> tokens. ```python >>> import codeprep.api.text as cp >>> input_code = '''void test_WordUeberraschungPrinter() { ... if (eps >= 0.345e+4) { // FIXME ... printWord(" ... Überraschung"); ... } ... }''' >>> cp.basic(input_code, no_spaces=True, no_unicode=True, no_case=True, no_com=True, no_str=True) ['void', '<w>', 'test', '_', '<Cap>', 'word', '<Cap>', 'ueberraschung', '<Cap>', 'printer', '</w>', '(', ')', '{', 'if', '(', 'eps', '>', '=', '0', '.', '<w>', '345', 'e', '</w>', '+', '4', ')', '{', '/', '/', '<CAPS>', 'fixme', '<w>', 'print', '<Cap>', 'word', '</w>', '(', '"', '.', '.', '.', '<Cap>', '<non-en>', '"', ')', ';', '}', '}'] ``` Similar params can be specified as switches `--no-str`, `--no-com`, `--no-spaces`, `--no-unicode`, `--no-case` in CLI commands. ### Specifying the language Unless explicitely specified, **codeprep** will assume the language is java. To make sure the input is preprocessed as intended, it is always **highly recommended** to specify it: ```python import codeprep.api.text as cp >>> cp.bpe("volatile", '1k') ['volatile'] >>> cp.bpe("volatile", '1k', extension="py") ['v', 'ol', 'a', 'ti', 'le</t>'] # Since 'volatile' is a keyword in java, it is represented as one token unlike in python # where it is pretty rare when used as an identifier and therefore represented as multiple subtokens. ``` When preprocessing a corpus, `codeprep` identifies the language based on the file extension. If you want only files with (a) certain extension(s) to be preprocessed, you can specify --ext param ```bash codeprep basic --path /path/to/be/preprocessed --ext "java" # or if you want to pre-process multiple types of files: codeprep basic --path /path/to/be/preprocessed --ext "java|c|py|js" ``` ### Miscellaneous You can specify the path to where the preprocessed corpus will be written: ```bash codeprep basic --path /path/to/preprocess --output-path /path/to/output ``` To print logs with log level DEBUG and higher to stdout: ```bash codeprep basic --path /path/to/preprocess --verbose ``` ## Getting Help To get help on commands and options: ```bash codeprep --help ``` ## Paper This library was build to run experiments for our paper accepted at ICSE 2020: [Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code](https://arxiv.org/pdf/2003.07914.pdf) If you you the library or the results, please cite the paper: ``` @article{karampatsis2020big, title={Big Code!= Big Vocabulary: Open-Vocabulary Models for Source Code}, author={Karampatsis, Rafael-Michael and Babii, Hlib and Robbes, Romain and Sutton, Charles and Janes, Andrea}, journal={arXiv preprint arXiv:2003.07914}, year={2020} } ``` # Advanced ### Caching When preprocessing a dataset, **codeprep** first parses source code and converts it into internal representation, which is after that converted to a preprocessed dataset depending on provided parameters. The intermediate representation is cached, so that when the same dataset is pre-processed again with different parameters, **codeprep** (providing no changes have been made to the dataset) would use the cache rather than parsing the source code again. To store the cache, **codeprep** uses a directory speecified by `$XDG_CACHE_HOME/codeprep/<codeprep_version>` variable if its value is set, `$HOME/.cache/codeprep/<codeprep_version>` otherwise. Removing the cache will not change the final result, however, will result in slower pre-processing. # Releases ## 1.0.3 - Add more flixibility with versions of dependencies ## 1.0.1 - Fix training custom bpe codes (Thanks to @mir-am) - Fix corpus pre-processing on Windows ## 1.0.0 - DOI assigned ## 1.0.0-alpha.12 - Bugfixes and minor improvements ## 1.0.0-alpha.11 (NOT backward compatible with 1.0.0-alpha.10) - Include token types in the metadata - Expand on token type hierarchy - Make possible to return full token index in the iterator ## 1.0.0-alpha.10 (NOT backward compatible with 1.0.0-alpha.9) - Add boundaries of comments to pre-processing metadata - Add Windows and OSx support - Switch from unittest to pytest+doctest - Bugfixes related to literal presentation of tokens on the disk - Bugfixes related to adding </t> to mark the end of a full token ## 1.0.0-alpha.9 (NOT backward compatible with 1.0.0-alpha.7) - Add `get_corpus_size()` method to `PreprocessedCorpus` class - Do BPE splitting without splitting by convention first - Use </t> to mark the last sub-token of a token - Replacing non-ascii sequences with a special char - Follow symlinks when reading a dataset - make possible to preserve case when doing stemming - Bugfixes ## 1.0.0-alpha.7 (NOT backward compatible with 1.0.0-alpha.6) - Store version in `codeprep.__version__` - implement `--full-strings` and `--max-str-length` options - replace `ronin` method/command wit`--ronin` option and apply ronin algorithm on word level instead of full identifier level - if `split_numbers` option is set to `True`, split numbers not only in code but also in strings and comments - change placeholder values to more human-readable - improve logging displaying - Bugfixes ## 1.0.0-alpha.6 Initial PyPI release


نیازمندی

مقدار نام
<2,>=1.4 appdirs
<0.4,>=0.3.1.1 dill
<0.7,>=0.6.2 docopt
<4,>=3.0.0 docopt-subcommands
<2,>=1.0 jsons
<4,>=3.4.5 nltk
<3,>=2.5.2 Pygments
<6,>=5.1 PyYAML
<=2020.5.14,>=2019.11.1 regex
<5,>=4.39 tqdm


زبان مورد نیاز

مقدار نام
>=3.6 Python


نحوه نصب


نصب پکیج whl codeprep-1.0.5:

    pip install codeprep-1.0.5.whl


نصب پکیج tar.gz codeprep-1.0.5:

    pip install codeprep-1.0.5.tar.gz