معرفی شرکت ها

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

Aligning BPE and AST

ویژگی	مقدار
سیستم عامل	-
نام فایل	code-tokenizers-0.0.5
نام	code-tokenizers
نسخه کتابخانه	0.0.5
نگهدارنده	[]
ایمیل نگهدارنده	[]
نویسنده	ncoop57
ایمیل نویسنده	nacooper01@wm.edu
آدرس صفحه اصلی	https://github.com/ncoop57/code_tokenizers
آدرس اینترنتی	https://pypi.org/project/code-tokenizers/
مجوز	Apache Software License 2.0

code_tokenizers ================  This library is built on top of the awesome [transformers](https://github.com/huggingface/transformers) and [tree-sitter](https://github.com/tree-sitter/py-tree-sitter) libraries. It provides a simple interface to align the tokens produced by a BPE tokenizer with the tokens produced by a tree-sitter parser. ## Install ``` sh pip install code_tokenizers ``` ## How to use The main interface of `code_tokenizers` is the [`CodeTokenizer`](https://ncoop57.github.io/code_tokenizers/core.html#codetokenizer) class. You can use a pretrained BPE tokenizer from the popular [transformers](https://huggingface.co/docs/transformers/quicktour#autotokenizer) library, and a tree-sitter parser from the [tree-sitter](https://tree-sitter.github.io/tree-sitter/using-parsers#python) library. To specify a [`CodeTokenizer`](https://ncoop57.github.io/code_tokenizers/core.html#codetokenizer) using the `gpt2` BPE tokenizer and the `python` tree-sitter parser, you can do: ``` python from code_tokenizers.core import CodeTokenizer py_tokenizer = CodeTokenizer.from_pretrained("gpt2", "python") ``` None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. You can specify any pretrained BPE tokenizer from the [huggingface hub](hf.co/models) or a local directory and the language to parse the AST for. Now, we can tokenize some code: ``` python from pprint import pprint code = """ def foo(): print("Hello world!") """ encoding = py_tokenizer(code) pprint(encoding, depth=1) ``` {'ast_ids': [...], 'attention_mask': [...], 'input_ids': [...], 'is_builtins': [...], 'is_internal_methods': [...], 'merged_ast': [...], 'offset_mapping': [...], 'parent_ast_ids': [...]} And we can print out the associated AST types: <div> > **Note** > > Note: Here the N/As are the tokens that are not part of the AST, such > as the spaces and the newline characters. Their IDs are set to -1. </div> ``` python for ast_id, parent_ast_id in zip(encoding["ast_ids"], encoding["parent_ast_ids"]): if ast_id != -1: print(py_tokenizer.node_types[parent_ast_id], py_tokenizer.node_types[ast_id]) else: print("N/A") ``` N/A function_definition def function_definition identifier parameters ( N/A N/A N/A N/A call identifier argument_list ( argument_list string argument_list string argument_list string argument_list ) N/A

نیازمندی

مقدار	نام
-	fastcore
-	pandas
<5	transformers
==0.20.1	tree-sitter
-	black[jupyter]
<3	datasets
<3	nbdev
-	twine

زبان مورد نیاز

مقدار	نام
>=3.7	Python

نحوه نصب

نصب پکیج whl code-tokenizers-0.0.5:

pip install code-tokenizers-0.0.5.whl

نصب پکیج tar.gz code-tokenizers-0.0.5:

pip install code-tokenizers-0.0.5.tar.gz