معرفی شرکت ها

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

Corpus of Annual Reports in Japan

ویژگی	مقدار
سیستم عامل	-
نام فایل	coarij-0.2.8
نام	coarij
نسخه کتابخانه	0.2.8
نگهدارنده	[]
ایمیل نگهدارنده	[]
نویسنده	icoxfog417
ایمیل نویسنده	icoxfog417@yahoo.co.jp
آدرس صفحه اصلی	https://github.com/chakki-works/CoARiJ
آدرس اینترنتی	https://pypi.org/project/coarij/
مجوز	MIT

# CoARiJ: Corpus of Annual Reports in Japan [![PyPI version](https://badge.fury.io/py/coarij.svg)](https://badge.fury.io/py/coarij) [![Build Status](https://travis-ci.org/chakki-works/coarij.svg?branch=master)](https://travis-ci.org/chakki-works/coarij) [![codecov](https://codecov.io/gh/chakki-works/coarij/branch/master/graph/badge.svg)](https://codecov.io/gh/chakki-works/coarij) We organized Japanese financial reports to encourage applying NLP techniques to financial analytics. ## Dataset The corpora are separated to each financial years. master version. | fiscal_year | Raw file version (F) | Text extracted version (E) | |-------------|-------------------|-----------------| | 2014 | [.zip (9.3GB)](https://s3-ap-northeast-1.amazonaws.com/chakki.esg.financial.jp/dataset/release/chakki_esg_financial_2014.zip) | [.zip (269.9MB)](https://s3-ap-northeast-1.amazonaws.com/chakki.esg.financial.jp/dataset/release/chakki_esg_financial_extracted_2014.zip) | | 2015 | [.zip (9.8GB)](https://s3-ap-northeast-1.amazonaws.com/chakki.esg.financial.jp/dataset/release/chakki_esg_financial_2015.zip) | [.zip (291.1MB)](https://s3-ap-northeast-1.amazonaws.com/chakki.esg.financial.jp/dataset/release/chakki_esg_financial_extracted_2015.zip) | | 2016 | [.zip (10.2GB)](https://s3-ap-northeast-1.amazonaws.com/chakki.esg.financial.jp/dataset/release/chakki_esg_financial_2016.zip) | [.zip (334.7MB)](https://s3-ap-northeast-1.amazonaws.com/chakki.esg.financial.jp/dataset/release/chakki_esg_financial_extracted_2016.zip) | | 2017 | [.zip (9.1GB)](https://s3-ap-northeast-1.amazonaws.com/chakki.esg.financial.jp/dataset/release/chakki_esg_financial_2017.zip) | [.zip (309.4MB)](https://s3-ap-northeast-1.amazonaws.com/chakki.esg.financial.jp/dataset/release/chakki_esg_financial_extracted_2017.zip) | | 2018 | [.zip (10.5GB)](https://s3-ap-northeast-1.amazonaws.com/chakki.esg.financial.jp/dataset/release/chakki_esg_financial_2018.zip) | [.zip (260.9MB)](https://s3-ap-northeast-1.amazonaws.com/chakki.esg.financial.jp/dataset/release/chakki_esg_financial_extracted_2018.zip) | * financial data is from [決算短信情報](http://db-ec.jpx.co.jp/category/C027/). * We use non-cosolidated data if it exist. * stock data is from [月間相場表（内国株式）](http://db-ec.jpx.co.jp/category/C021/STAT1002.html). * `close` is fiscal period end and `open` is 1 year before of it. ### Past release * [v1.0](https://github.com/chakki-works/CoARiJ/blob/master/releases/v1.0.md) ### Statistics | fiscal_year | number_of_reports | has_csr_reports | has_financial_data | has_stock_data | |-------------|-------------------|-----------------|--------------------|----------------| | 2014 | 3,724 | 92 | 3,583 | 3,595 | | 2015 | 3,870 | 96 | 3,725 | 3,751 | | 2016 | 4,066 | 97 | 3,924 | 3,941 | | 2017 | 3,578 | 89 | 3,441 | 3,472 | | 2018 | 3,513 | 70 | 2,893 | 3,413 | ### File structure #### Raw file version (`--kind F`) The structure of dataset is following. ``` chakki_esg_financial_{year}.zip └──{year} ├── documents.csv └── docs/ ``` `docs` includes XBRL and PDF file. * XBRL file of annual reports (files are retrieved from [EDINET](http://disclosure.edinet-fsa.go.jp/)). * PDF file of CSR reports (additional content). `documents.csv` has metadata like following. Please refer the detail at [Wiki](https://github.com/chakki-works/CoARiJ/wiki/Columns-on-the-file). * edinet_code: `E0000X` * filer_name: `XXX株式会社` * fiscal_year: `201X` * fiscal_period: `FY` * doc_path: `docs/S000000X.xbrl` * csr_path: `docs/E0000X_201X_JP_36.pdf` #### Text extracted version (`--kind E`) Text extracted version includes `txt` files that match each part of an annual report. The extracted parts are defined at [`xbrr`](https://github.com/chakki-works/xbrr/blob/master/docs/edinet.md). ``` chakki_esg_financial_{year}_extracted.zip └──{year} ├── documents.csv └── docs/ ``` ## Tool You can download dataset by command line tool. ``` pip install coarij ``` Please refer the usage by `--` (using [fire](https://github.com/google/python-fire)). ``` coarij -- ``` Example command. ```bash # Download raw file version dataset of 2014. coarij download --kind F --year 2014 # Extract business.overview_of_result part of TIS.Inc (sec code=3626). coarij extract business.overview_of_result --sec_code 3626 # Tokenize text by Janome (`janome` or `sudachi` is supported). pip install janome coarij tokenize --tokenizer janome # Show tokenized result (words are separated by \t). head -n 5 data/processed/2014/docs/S100552V_business_overview_of_result_tokenized.txt 1 【業績等の概要】 ( 1 ) 業績当連結会計年度における我が国経済は、消費税率引上げに伴う駆け込み需要の反動や海外景気動向に対する先行き懸念等から弱い動きも見られましたが、企業収益の改善等により全体 ... ``` If you want to download latest dataset, please specify `--version master` when download the data. * About the parsable part, please refer the [`xbrr`](https://github.com/chakki-works/xbrr/blob/master/docs/edinet.md). You can use `Ledger` to select your necessary file from overall CoARiJ dataset. ```python from coarij.storage import Storage storage = Storage("your/data/directory") ledger = storage.get_ledger() collected = ledger.collect(edinet_code="E00021") ```

نحوه نصب

نصب پکیج whl coarij-0.2.8:

pip install coarij-0.2.8.whl

نصب پکیج tar.gz coarij-0.2.8:

pip install coarij-0.2.8.tar.gz