معرفی شرکت ها


data-engineering-extract-metadata-2.0.0


Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

A python package for extracting and formatting table metadata from databases
ویژگی مقدار
سیستم عامل -
نام فایل data-engineering-extract-metadata-2.0.0
نام data-engineering-extract-metadata
نسخه کتابخانه 2.0.0
نگهدارنده []
ایمیل نگهدارنده []
نویسنده Alec Johnson
ایمیل نویسنده alec.johnson@digital.justice.gov.uk
آدرس صفحه اصلی https://github.com/moj-analytical-services/data-engineering-extract-metadata
آدرس اینترنتی https://pypi.org/project/data-engineering-extract-metadata/
مجوز MIT
# Extract metadata for data engineering pipelines This repo lets you extract metadata from a database and shape it into a folder of json files that [etl_manager](https://github.com/moj-analytical-services/etl_manager) can read. The `create_all_metadata` function will do most of the process in one go. See 'quickstart' below for a summary, or more detailed documentation below that. The end result will be: - a json file containing a filtered list of tables - a subfolder for the metadata - in that subfolder, a database.json file with overall metadata for the database - also in that subfolder, another .json file listing the columns and other metadata for each table ## Requirements This runs in Python 3.6+. You'll need to [install cx_Oracle 8.0.0+](https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html). Installing cx_Oracle also involves installing its client libraries. You can [download it from Oracle](https://www.oracle.com/database/technologies/instant-client/downloads.html) or, if you're on Mac, install by Homebrew: `brew tap InstantClientTap/instantclient` `brew install instantclient-basic` ### cx_Oracle problems 1: client host name If you're on a Mac, you might get errors like "cx_Oracle.DatabaseError: ORA-24454: client host name is not set". If you do, you'll need to adjust your hosts file. To do this: - go to system preferences, then sharing, and note the computer name at the top - go to your hard disk, then the `etc` folder and find the `hosts` file - back up your hosts file in case anything weird happens - in hosts, find the line that says `127.0.0.1 localhost` - under it, add a new line that says `127.0.0.01 computer_name`, where computer_name is the one you got from system preferences/sharing - save the hosts file ### cx_Oracle problems 2: client library location Depending on how you installed the Oracle client libraries, the create_oracle_connection function might not work with default parameters. If it won't connect, try specifying the location of your client libraries using the oracle_client_lib parameter. ## Quick start Here's an example of creating a full metadata folder by using create_all_metadata and a custom function to filter the table list: ``` python from pathlib import Path from extract_metadata.connect import create_oracle_connection from extract_metadata.metadata import create_all_metadata def table_filter(table_list): """Takes a list of (table, tablespace) tuples and filters it down to tables that have 'REPLICATION_TEST' in their name""" return [t[0] for t in table_list if "REPLICATION_TEST" in t[0]] settings_location = Path.cwd().parent / "settings_folder" connection = create_oracle_connection("delius_sandpit", settings_location) create_all_metadata( connection, save_folder="delius", title="delius_sandpit_test", description="Here's a description", schema="DELIUS_ANALYTICS_PLATFORM", source_bucket="mojap-raw-hist-dev", source_folder="hmpps/delius/DELIUS_ANALYTICS_PLATFORM", filter_function=table_filter, ) connection.close() ``` If you save this in a script called `get_metadata.py` in /metadata/ folder, you'll end up with this folder structure: ``` metadata |-- get_metadata.py |-- delius | |-- delius_sandpit_test.json | |-- delius_sandpit_test | | |-- database.json | | |-- table1.json | | |-- table2.json ``` ## Step by step There are 3 steps to getting metadata from a database: 1. Connect using connect.create_oracle_connection 2. Make a list of all the tables and filter it to the ones you want - you can do both with table_list.get_table_list 3. Get metadata from the filtered list of tables using metadata.create_metadata_folder ### 1. Connecting You'll need some database connection settings. These should be in a json file structured like this: ``` json { "host": "HOST", "password": "PASSWORD", "port": 1234, "service_name": "SERVICE_NAME", "user": "database_username" } ``` Then pass this file and its location to create_oracle_connection in the connect module. You might also need to use the oracle_client_lib parameter to specify where your Oracle client libraries are. See the cx_Oracle connection problems, above. ### 2. Making a list of tables get_table_list in the table_list module will use your connection to get a list of tables from the database. It will create a timestamped json file listing the tables. Specify the file's save_folder and title when you call the function. Also specify the database schema to get the tables from. You might not want the file to list all the tables. In this case, pass a filter_function as well. This should be a function that takes a single argument - a list, in this format: `[("TABLE_NAME", "TABLESPACE"), ("ANOTHER_TABLE_NAME", "TABLESPACE")]` Return the tables you want as a list of table names. So if you only wanted the first one from this list, you'd want to return: `["TABLE_NAME"]` ### 3. Get the metadata for the tables To read a folder with etl_manager, it must contain a database.json file specifying overall features of the database, plus a json file for each table. You can create these in one go using metadata.create_metadata_folder, or separately with metadata.create_json_for_database and metadata.create_json_for_tables. ## Tests Unit tests are written for pytest. Run `pytest` from the root folder to start them. Where functions involve SQL queries, the unit tests don't check these queries - only the Python surrounding them. ## Githooks This repo comes with some githooks to make standard checks before you commit files to Github. The checks are: - if you're using git-crypt, run `git-crypt status` and check for unencrypted file warnings - run Black on Python files - run Flake8 on Python files - run yamllint on yaml files If you want to use these, run this command from the repo's root directory: `git config core.hooksPath githooks` See the [data engineering template repo](https://github.com/moj-analytical-services/data-engineering-template) for details. ## Licence [MIT Licence](LICENCE.md) # Extract metadata for data engineering pipelines This repo lets you extract metadata from a database and shape it into a folder of json files that [etl_manager](https://github.com/moj-analytical-services/etl_manager) can read. The `create_all_metadata` function will do most of the process in one go. See 'quickstart' below for a summary, or more detailed documentation below that. The end result will be: - a json file containing a filtered list of tables - a subfolder for the metadata - in that subfolder, a database.json file with overall metadata for the database - also in that subfolder, another .json file listing the columns and other metadata for each table ## Requirements This runs in Python 3.6+. You'll need to [install cx_Oracle 8.0.0+](https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html). Installing cx_Oracle also involves installing its client libraries. You can [download it from Oracle](https://www.oracle.com/database/technologies/instant-client/downloads.html) or, if you're on Mac, install by Homebrew: `brew tap InstantClientTap/instantclient` `brew install instantclient-basic` ### cx_Oracle problems 1: client host name If you're on a Mac, you might get errors like "cx_Oracle.DatabaseError: ORA-24454: client host name is not set". If you do, you'll need to adjust your hosts file. To do this: - go to system preferences, then sharing, and note the computer name at the top - go to your hard disk, then the `etc` folder and find the `hosts` file - back up your hosts file in case anything weird happens - in hosts, find the line that says `127.0.0.1 localhost` - under it, add a new line that says `127.0.0.01 computer_name`, where computer_name is the one you got from system preferences/sharing - save the hosts file ### cx_Oracle problems 2: client library location Depending on how you installed the Oracle client libraries, the create_oracle_connection function might not work with default parameters. If it won't connect, try specifying the location of your client libraries using the oracle_client_lib parameter. ## Quick start Here's an example of creating a full metadata folder by using create_all_metadata and a custom function to filter the table list: ``` python from pathlib import Path from extract_metadata.connect import create_oracle_connection from extract_metadata.metadata import create_all_metadata def table_filter(table_list): """Takes a list of (table, tablespace) tuples and filters it down to tables that have 'REPLICATION_TEST' in their name""" return [t[0] for t in table_list if "REPLICATION_TEST" in t[0]] settings_location = Path.cwd().parent / "settings_folder" connection = create_oracle_connection("delius_sandpit", settings_location) create_all_metadata( connection, save_folder="delius", title="delius_sandpit_test", description="Here's a description", schema="DELIUS_ANALYTICS_PLATFORM", source_bucket="mojap-raw-hist-dev", source_folder="hmpps/delius/DELIUS_ANALYTICS_PLATFORM", filter_function=table_filter, ) connection.close() ``` If you save this in a script called `get_metadata.py` in /metadata/ folder, you'll end up with this folder structure: ``` metadata |-- get_metadata.py |-- delius | |-- delius_sandpit_test.json | |-- delius_sandpit_test | | |-- database.json | | |-- table1.json | | |-- table2.json ``` ## Step by step There are 3 steps to getting metadata from a database: 1. Connect using connect.create_oracle_connection 2. Make a list of all the tables and filter it to the ones you want - you can do both with table_list.get_table_list 3. Get metadata from the filtered list of tables using metadata.create_metadata_folder ### 1. Connecting You'll need some database connection settings. These should be in a json file structured like this: ``` json { "host": "HOST", "password": "PASSWORD", "port": 1234, "service_name": "SERVICE_NAME", "user": "database_username" } ``` Then pass this file and its location to create_oracle_connection in the connect module. You might also need to use the oracle_client_lib parameter to specify where your Oracle client libraries are. See the cx_Oracle connection problems, above. ### 2. Making a list of tables get_table_list in the table_list module will use your connection to get a list of tables from the database. It will create a timestamped json file listing the tables. Specify the file's save_folder and title when you call the function. Also specify the database schema to get the tables from. You might not want the file to list all the tables. In this case, pass a filter_function as well. This should be a function that takes a single argument - a list, in this format: `[("TABLE_NAME", "TABLESPACE"), ("ANOTHER_TABLE_NAME", "TABLESPACE")]` Return the tables you want as a list of table names. So if you only wanted the first one from this list, you'd want to return: `["TABLE_NAME"]` ### 3. Get the metadata for the tables To read a folder with etl_manager, it must contain a database.json file specifying overall features of the database, plus a json file for each table. You can create these in one go using metadata.create_metadata_folder, or separately with metadata.create_json_for_database and metadata.create_json_for_tables. ## Tests Unit tests are written for pytest. Run `pytest` from the root folder to start them. Where functions involve SQL queries, the unit tests don't check these queries - only the Python surrounding them. ## How to update Update and release new versions using Poetry. Make sure to change the version number in `pyproject.toml` and describe the change in `CHANGELOG.md`. If you've changed any dependencies in `pyproject.yaml`, run `poetry update` to update `poetry.lock`. Once you've created a release in GitHub, to publish the latest version to PyPI, run: ``` poetry build poetry publish -u <username> ``` Here, you should replace `<username>` with your PyPI username. To publish to PyPI, you must be an owner of the project. ## Licence [MIT Licence](LICENCE.md)


نیازمندی

مقدار نام
>=8.0.0,<9.0.0 cx_Oracle


زبان مورد نیاز

مقدار نام
>=3.6,<4.0 Python


نحوه نصب


نصب پکیج whl data-engineering-extract-metadata-2.0.0:

    pip install data-engineering-extract-metadata-2.0.0.whl


نصب پکیج tar.gz data-engineering-extract-metadata-2.0.0:

    pip install data-engineering-extract-metadata-2.0.0.tar.gz