معرفی شرکت ها

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

tools for loading corpora

ویژگی	مقدار
سیستم عامل	-
نام فایل	corpusinterface-0.1.3
نام	corpusinterface
نسخه کتابخانه	0.1.3
نگهدارنده	[]
ایمیل نگهدارنده	[]
نویسنده	Robert Lieck
ایمیل نویسنده	robert.lieck@epfl.ch
آدرس صفحه اصلی	https://github.com/DCMLab/corpusinterface
آدرس اینترنتی	https://pypi.org/project/corpusinterface/
مجوز	-

# Corpus Interface ![build](https://github.com/DCMLab/CorpusInterface/workflows/build/badge.svg) [![PyPI version](https://badge.fury.io/py/corpusinterface.svg)](https://badge.fury.io/py/corpusinterface) ![tests](https://github.com/DCMLab/CorpusInterface/workflows/tests/badge.svg) [![codecov](https://codecov.io/gh/DCMLab/CorpusInterface/branch/master/graph/badge.svg?token=BooAiwbcyk)](https://codecov.io/gh/DCMLab/CorpusInterface) [![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) Basic functionality to maintain and load corpora. ## Installation `pip install corpusinterface` ## Corpora A number of are available in the two following config files: - [DCML_corpora.ini](DCML_corpora.ini): contains corpora maintained by the DCML. Some of these are not publicly accessible so downloading them will fail. Please, contact us at <dcml@epfl.ch> to request access. - [external_corpora.ini](external_corpora.ini): contains external corpora that are not maintained by the DCML. Here are some references with more corpora (some of which are also in the config files from above): - https://github.com/dharasim/MCR/wiki - https://github.com/ismir/mir-datasets ## Managing Corpora ### Adding your own corpus #### TL;DR Provide a config file `your-corpus.ini` ```ini [Your Corpus] access: zip url: http://your-website.com/your-corpus.zip ``` and load it using `init_config("your-corpus.ini")`. #### More details Say, you packaged a number of files into a corpus ``` your-corpus |- file_1.txt |- file_2.txt |- dir_1 |- file_3.txt |- file_4.txt ``` and let's assume you made it available as a zip archive at `http://your-website.com/your-corpus.zip` (`git` repos and `tar.gz` files are also supported). Without a config file, this corpus can be added and accessed as follows: ```python from corpusinterface import config, load # add your corpus config.add_corpus("Your Corpus", access="zip", url="http://your-website.com/your-corpus.zip") # load the corpus corpus = load("Your Corpus", download=True) # access the data (using a file_reader of your choice) for file in corpus.data(file_reader=lambda file, **kwargs: f"reading: {file}"): print(file) ``` This will print ``` reading: ~/corpora/Your Corpus/file_1.txt reading: ~/corpora/Your Corpus/file_2.txt reading: ~/corpora/Your Corpus/dir_1/file_3.txt reading: ~/corpora/Your Corpus/dir_1/file_4.txt ``` with `~` being replaced with your home directory (paths might be displayed differently, depending on your operating system). Config files allow you to automate the procedure of adding a corpus and are convenient to provide more detailed information, in particular for other people who want to use your corpus. ### Config files Instead of specifying the necessary information from within Python, you can also put it in a config file: ```ini [Your Corpus] access: zip url: http://your-website.com/your-corpus.zip ``` If you put this file at the default location `~/corpora/corpora.ini` in your home directory or a file `corpora.ini` in the current working directory, it is automatically loaded by `init_config` on package import. Otherwise, you can load any config file by either calling `reset_config` ```python config.reset_config("your-config-file.ini") ``` which clears the config and reinitialises it, adding `your-config-file.ini` (see `init_config` for more fine-grained control) or by loading it separately ```python config.load_config("your-config-file.ini") ``` #### Default config A default config file is shipped with the `corpusinterface` package and automatically loaded by `init_config`. It defines some useful defaults that are used for newly added corpora if no corpus-specific values are specified. You can see all the config information associated to your corpus by printing a summary: ```python print(config.summary(corpus="Your Corpus")) ``` ```ini [Your Corpus] access: zip url: http://your-website.com/your-corpus.zip info: None root: ~/corpora path: ~/corpora/Your Corpus parent: None loader: FileCorpus ``` In particular, the default `root` directory `~/corpora` was added and the corpus is stored in a `path` that is a subdirectory `~/corpora/Your Corpus` according to its name (more on `root` and `path` below). Moreover, by default we assume to have a `FileCorpus` consisting of a simple collection of files. #### Special parameters The parameters `root`, `path`, `parent`, `download`, `loader`, `access`, and `url` are special and their values are treated in a particular way. ##### `root` Root directory to store the corpus in. This should be an absolute path, `~` is expanded to the user home. If a relative path is specified, a warning is issued and it is interpreted relative to the current working directory. If `parent` is non-empty, the value of `root` is ignored and instead the parent's `path` is used. A call to `config.get(Name, 'root')` returns the effective value. ##### `path` Directory to store the corpus in. This can be 1. an absolute path (`~` is expanded to the user home), in which case `root` is ignored 2. a relative path, in which case it is appended to `root` or 3. be empty, in which case the corpus `[Name]` is appended to `root`. A call to `config.get(Name, 'path')` returns the effective value. Note that for sub-corpora (with non-empty `parent`) the parent's `path` is used instead of `root`. ##### `parent` A parent corpus name or empty. If non-emtpy, the parent corpus should be defined separately and the value of `root` is ignored and replaced by the parent's `path`. Initialisation (e.g. downloading from `url` with `access` method) is delegated to the parent corpus when loading a sub-corpus. ##### `download`, `loader`, `access`, `url` See the section on [loading a corpus](#Loading a corpus). #### Additional parameters You can specify additional parameters that are handed over to the loader and (in case of the `FileCorpus` loader) further passed on the your `file_reader` function. For instance, you could specify ```ini prefix: my prefix ``` in the config file or equivalently ```python config.add_corpus("Your Corpus", ..., prefix="my prefix") ``` from within Python. Your file reader can then make use of this parameter (provided as a keyword argument, so you have to refer to it by the correct name) ```python file_reader=lambda file, prefix, **kwargs: f"{prefix}: {file}" ``` ``` my prefix: ~/corpora/Your Corpus/file_1.txt ... ``` This is also the reason why we always need `**kwargs` in a reader function to accept all keyword arguments that are provided, even if we decide to not use them. The config values can be dynamically overwritten in the `load` function ```python corpus = load("Your Corpus", ..., prefix="other prefix") ``` ``` other prefix: ~/corpora/Your Corpus/file_1.txt ... ``` or in the `data` function: ```python for file in corpus.data(..., prefix="still different"): ... ``` ``` still different: ~/corpora/Your Corpus/file_1.txt ... ``` #### Controlling initialisation You have full control over how the config is (re)initialised. A call to `config.init_config()` or `config.reset_config()` without any arguments will load the default config, look for `corpora.ini` in `~/corpora` and the current working directory and load them, too, if present. This is equivalent to calling ```python config.init_config(default=None, home=None, local=None) ``` or ```python config.reset_config(default=None, home=None, local=None) ``` respectively. For each of these parameters you may alternatively specify a value of `True` (meaning that you _expect_ the respective config file to be loaded and otherwise an error is raised), or `False` (meaning that the respective config file is _not_ loaded, even if it exists). Additionally, you may specify one or more files that should additionally be loaded ```python config.init_config("/path/to/file_1.ini", "/path/to/file_2.ini", ...) ``` ## Loading a corpus Corpora are loaded with the `load` function ```python from corpusinterface import load # load the corpus corpus = load("Your Corpus", download=True) ``` Specifying `download=True` indicates that the corpus should be downloaded if it cannot be found on disk. The `load` function looks up the given corpus in the config, retrieving any parameters (including default parameters) specified there. If you provide additional keyword arguments, these will overwrite parameters from the config with the same name. So you could, for instance, specify a different URL for downloading ```python corpus = load("Your Corpus", url="some-other-url.com/corpus.zip" download=True) ``` or a custom path for looking for the corpus on disk and/or downloading it to ```python corpus = load("Your Corpus", path="/my/custom/path/for/corpus/" download=True) ``` Four parameters are processed by the `load` function itself (`download`, `access`, `url`, `loader`). `download` and `url` play the obvious role described above. `access` specifies how the content should be accessed and together with `url` is handled by the `download` function (called by `load` if `download=True` is specified). `access` can be a string (`"git"`, `"zip"`, or `"tar.gz"`) resulting in the corpus being downloaded and unpacked accordingly. It can also be a callable provided as a keyword argument to `load`. In that case the corpus `path` is created on disk and the provided method is called with the corpus name and all keyword arguments, including any parameters specified in the config. The `loader` parameters is handled in a special way. If it is a callable, the `load` function will ensure the corpus exists (potentially downloading it) and then call the specified method with all provided keyword arguments, including any parameters specified in the config. This means that you can simply specify any custom loader function you would like to use ```python corpus = load("Your Corpus", loader=my_special_loader_function) ``` If `loader` is a string, `load` tries to look up the appropriate function in the `loaders` dictionary. So you can also add it there and request it by providing the corresponding string in the `load` function ```python from corpusinterface import load, loaders loaders["my custom loader"] = my_special_loader_function corpus = load("Your Corpus", loader="my custom loader") ``` The advantage of this approach is that you can specify it in a config file so you don't need to pass it to `load` each time ```ini loader: my custom loader ``` Adding the loader function can also be automised. For instance, if you have a special corpus type that you provide in a separate python module, you can simply add the loader function there ```python from corpusinterface import loaders class MySpecialCorpus: ... loaders["my custom loader"] = MySpecialCorpus ``` Given your custom config file, you corpus can then be loaded simply as ```python corpus = load("Your Corpus") ``` without having to specify anything manually. Note that any loader function is provided with all keyword arguments, so it might be a good idea to use `**kwargs` to handle any unforeseen additional parameters, even if they are not used. ## FileCorpus The default corpus type is defined by the `FileCorpus` class. In a config file, it is specified by ```ini loader: FileCorpus ``` which is the default if this parameter is not explicitly specified for a corpus. When calling `load`, they keyword argument `loader="FileCorpus"` is looked up in `loaders` and the actual `FileCorpus` constructor is called. In fact, the static `FileCorpus.init` method is called to check for the mandatory `path` argument and provide an interpretable error message if it is missing. The `FileCorpus` class assumes to find a collection of files at `path` and makes them available via the `files` and `data` method. Additionally, accepts four more parameters: - `file_regex`: a regular expression for file names; if provided, files whose name does _not_ match are ignored - `path_regex`: a regular expression for paths; if provided, path (including the file name) that do _not_ match are ignored - `file_exclude_regex`: like `file_regex` but _matches_ are ignores - `path_exclude_regex`: like `path_regex` but _matches_ are ignores All additional keyword arguments are stored and passed on to calls of `data` and `metadata`. ### `files` The `files` function returns an iterator over files (after applying the `*_regex` expressions, if provided). It returns their absolute paths. ### `data` The `data` function iterates over `files` and attempts to read them. If a `file_reader` function is provided as keyword argument upon initialisation or directly to `data`, it is called with the full path of the respective file as first argument and all keyword arguments. Otherwise (or if `file_reader=None`) `data` returns the absolute paths just like `files`. ### `metadata` The `metadata` function looks for metadata in the `path` location of the corpus. If a `meta_reader` function is provided as keyword argument upon initialisation or directly to `metadata`, it is called with the full `path` of the corpus as first argument and all keyword arguments. Otherwise (or if `meta_reader=None`) the full `path` is returned.

نیازمندی

مقدار	نام
-	GitPython

زبان مورد نیاز

مقدار	نام
>=3.6	Python

نحوه نصب

نصب پکیج whl corpusinterface-0.1.3:

pip install corpusinterface-0.1.3.whl

نصب پکیج tar.gz corpusinterface-0.1.3:

pip install corpusinterface-0.1.3.tar.gz