معرفی شرکت ها

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

Build and deploy a serverless data pipeline with no effort on AWS.

ویژگی	مقدار
سیستم عامل	-
نام فایل	datajob-0.9.0
نام	datajob
نسخه کتابخانه	0.9.0
نگهدارنده	[]
ایمیل نگهدارنده	[]
نویسنده	Vincent Claes
ایمیل نویسنده	vincent.v.claes@gmail.com
آدرس صفحه اصلی	https://github.com/vincentclaes/datajob
آدرس اینترنتی	https://pypi.org/project/datajob/
مجوز	Apache Software License (Apache 2.0)

[![Awesome](https://awesome.re/badge.svg)](https://github.com/kolomied/awesome-cdk#high-level-frameworks) ![logo](./assets/logo.png) <div align="center"> <b>Build and deploy a serverless data pipeline on AWS with no effort.</b></br> <i>Our goal is to let developers think about the business logic, datajob does the rest...</i> </br> </br> </br> </div> </br> - Deploy code to python shell / pyspark **AWS Glue jobs**. - Use **AWS Sagemaker** to create ML Models. - Orchestrate the above jobs using **AWS Stepfunctions** as simple as `task1 >> task2` - Let us [know](https://github.com/vincentclaes/datajob/discussions) **what you want to see next**. </br> <div align="center"> :rocket: :new: :rocket: </br> </br> [Check our new example of an End-to-end Machine Learning Pipeline with Glue, Sagemaker and Stepfunctions](examples/ml_pipeline_end_to_end) </br> </br> :rocket: :new: :rocket: </br></br> </div> </br> # Installation Datajob can be installed using pip. <br/> Beware that we depend on [aws cdk cli](https://github.com/aws/aws-cdk)! pip install datajob npm install -g aws-cdk@1.109.0 # latest version of datajob depends this version # Quickstart You can find the full example in [examples/data_pipeline_simple](./examples/data_pipeline_simple/). We have a simple data pipeline composed of [2 glue jobs](./examples/data_pipeline_simple/glue_jobs/) orchestrated sequentially using step functions. ```python from aws_cdk import core from datajob.datajob_stack import DataJobStack from datajob.glue.glue_job import GlueJob from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow app = core.App() # The datajob_stack is the instance that will result in a cloudformation stack. # We inject the datajob_stack object through all the resources that we want to add. with DataJobStack(scope=app, id="data-pipeline-simple") as datajob_stack: # We define 2 glue jobs with the relative path to the source code. task1 = GlueJob( datajob_stack=datajob_stack, name="task1", job_path="glue_jobs/task.py" ) task2 = GlueJob( datajob_stack=datajob_stack, name="task2", job_path="glue_jobs/task2.py" ) # We instantiate a step functions workflow and orchestrate the glue jobs. with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow") as sfn: task1 >> task2 app.synth() ``` We add the above code in a file called `datajob_stack.py` in the [root of the project](./examples/data_pipeline_with_packaged_project/). ### Configure CDK Follow the steps [here](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-config) to configure your credentials. ```shell script export AWS_PROFILE=default # use the aws cli to get your account number export AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text --profile $AWS_PROFILE) export AWS_DEFAULT_REGION=eu-west-1 # init cdk cdk bootstrap aws://$AWS_ACCOUNT/$AWS_DEFAULT_REGION ``` ### Deploy Deploy the pipeline using CDK. ```shell cd examples/data_pipeline_simple cdk deploy --app "python datajob_stack.py" --require-approval never ``` ### Execute ```shell script datajob execute --state-machine data-pipeline-simple-workflow ``` The terminal will show a link to the step functions page to follow up on your pipeline run. ![sfn](./assets/sfn.png) ### Destroy ```shell script cdk destroy --app "python datajob_stack.py" ``` # Examples - [Data pipeline with parallel steps](./examples/data_pipeline_parallel/) - [Data pipeline for processing big data using PySpark](./examples/data_pipeline_pyspark/) - [Data pipeline where you package and ship your project as a wheel](./examples/data_pipeline_with_packaged_project/) - [Machine Learning pipeline where we combine glue jobs with sagemaker](examples/ml_pipeline_end_to_end) All our examples are in [./examples](./examples) # Functionality <details> <summary>Deploy to a stage</summary> Specify a stage to deploy an isolated pipeline. Typical examples would be `dev` , `prod`, ... ```shell cdk deploy --app "python datajob_stack.py" --context stage=my-stage ``` </details> <details> <summary>Using datajob's S3 data bucket</summary> Dynamically reference the `datajob_stack` data bucket name to the arguments of your GlueJob by calling `datajob_stack.context.data_bucket_name`. ```python import pathlib from aws_cdk import core from datajob.datajob_stack import DataJobStack from datajob.glue.glue_job import GlueJob from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow current_dir = str(pathlib.Path(__file__).parent.absolute()) app = core.App() with DataJobStack( scope=app, id="datajob-python-pyspark", project_root=current_dir ) as datajob_stack: pyspark_job = GlueJob( datajob_stack=datajob_stack, name="pyspark-job", job_path="glue_job/glue_pyspark_example.py", job_type="glueetl", glue_version="2.0", # we only support glue 2.0 python_version="3", worker_type="Standard", # options are Standard / G.1X / G.2X number_of_workers=1, arguments={ "--source": f"s3://{datajob_stack.context.data_bucket_name}/raw/iris_dataset.csv", "--destination": f"s3://{datajob_stack.context.data_bucket_name}/target/pyspark_job/iris_dataset.parquet", }, ) with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow") as sfn: pyspark_job >> ... ``` you can find this example [here](./examples/data_pipeline_pyspark/glue_job/glue_pyspark_example.py) </details> <details> <summary>Deploy files to the datajob's deployment bucket</summary> Specify the path to the folder we would like to include in the deployment bucket. ```python from aws_cdk import core from datajob.datajob_stack import DataJobStack app = core.App() with DataJobStack( scope=app, id="some-stack-name", include_folder="path/to/folder/" ) as datajob_stack: ... ``` </details> <details> <summary>Package your project as a wheel and ship it to AWS</summary> You can find the example [here](./examples/data_pipeline_with_packaged_project/) ```python # We add the path to the project root in the constructor of DataJobStack. # By specifying project_root, datajob will look for a .whl in # the dist/ folder in your project_root. with DataJobStack( scope=app, id="data-pipeline-pkg", project_root=current_dir ) as datajob_stack: ``` Package you project using [poetry](https://python-poetry.org/) ```shell poetry build cdk deploy --app "python datajob_stack.py" ``` Package you project using [setup.py](./examples/data_pipeline_with_packaged_project) ```shell python setup.py bdist_wheel cdk deploy --app "python datajob_stack.py" ``` you can also use the datajob cli to do the two commands at once: ```shell # for poetry datajob deploy --config datajob_stack.py --package poetry # for setup.py datajob deploy --config datajob_stack.py --package setuppy ``` </details> <details> <summary>Processing big data using a Glue Pyspark job</summary> ```python import pathlib from aws_cdk import core from datajob.datajob_stack import DataJobStack from datajob.glue.glue_job import GlueJob current_dir = str(pathlib.Path(__file__).parent.absolute()) app = core.App() with DataJobStack( scope=app, id="datajob-python-pyspark", project_root=current_dir ) as datajob_stack: pyspark_job = GlueJob( datajob_stack=datajob_stack, name="pyspark-job", job_path="glue_job/glue_pyspark_example.py", job_type="glueetl", glue_version="2.0", # we only support glue 2.0 python_version="3", worker_type="Standard", # options are Standard / G.1X / G.2X number_of_workers=1, arguments={ "--source": f"s3://{datajob_stack.context.data_bucket_name}/raw/iris_dataset.csv", "--destination": f"s3://{datajob_stack.context.data_bucket_name}/target/pyspark_job/iris_dataset.parquet", }, ) ``` full example can be found in [examples/data_pipeline_pyspark](examples/data_pipeline_pyspark]). </details> <details> <summary>Orchestrate stepfunctions tasks in parallel</summary> ```python # Task2 comes after task1. task4 comes after task3. # Task 5 depends on both task2 and task4 to be finished. # Therefore task1 and task2 can run in parallel, # as well as task3 and task4. with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow") as sfn: task1 >> task2 task3 >> task4 task2 >> task5 task4 >> task5 ``` More can be found in [examples/data_pipeline_parallel](./examples/data_pipeline_parallel) </details> <details> <summary>Orchestrate 1 stepfunction task</summary> Use the [Ellipsis](https://docs.python.org/dev/library/constants.html#Ellipsis) object to be able to orchestrate 1 job via step functions. ```python some_task >> ... ``` </details> <details> <summary>Notify in case of error/success</summary> Provide the parameter `notification` in the constructor of a `StepfunctionsWorkflow` object. This will create an SNS Topic which will be triggered in case of failure or success. The email will subscribe to the topic and receive the notification in its inbox. ```python with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow", notification="email@domain.com") as sfn: task1 >> task2 ``` You can provide 1 email or a list of emails `["email1@domain.com", "email2@domain.com"]`. </details> # Datajob in depth The `datajob_stack` is the instance that will result in a cloudformation stack. The path in `project_root` helps `datajob_stack` locate the root of the project where the setup.py/poetry pyproject.toml file can be found, as well as the `dist/` folder with the wheel of your project . ```python import pathlib from aws_cdk import core from datajob.datajob_stack import DataJobStack current_dir = pathlib.Path(__file__).parent.absolute() app = core.App() with DataJobStack( scope=app, id="data-pipeline-pkg", project_root=current_dir ) as datajob_stack: ... ``` When __entering the contextmanager__ of DataJobStack: A [DataJobContext](./datajob/datajob_stack.py#L48) is initialized to deploy and run a data pipeline on AWS. The following resources are created: 1) "data bucket" - an S3 bucket that you can use to dump ingested data, dump intermediate results and the final output. - you can access the data bucket as a [Bucket](https://docs.aws.amazon.com/cdk/api/latest/python/aws_cdk.aws_s3/Bucket.html) object via ```datajob_stack.context.data_bucket``` - you can access the data bucket name via ```datajob_stack.context.data_bucket_name``` 2) "deployment bucket" - an s3 bucket to deploy code, artifacts, scripts, config, files, ... - you can access the deployment bucket as a [Bucket](https://docs.aws.amazon.com/cdk/api/latest/python/aws_cdk.aws_s3/Bucket.html) object via ```datajob_stack.context.deployment_bucket``` - you can access the deployment bucket name via ```datajob_stack.context.deployment_bucket_name``` when __exiting the context manager__ all the resources of our DataJobStack object are created. <details> <summary>We can write the above example more explicitly...</summary> ```python import pathlib from aws_cdk import core from datajob.datajob_stack import DataJobStack from datajob.glue.glue_job import GlueJob from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow current_dir = pathlib.Path(__file__).parent.absolute() app = core.App() datajob_stack = DataJobStack(scope=app, id="data-pipeline-pkg", project_root=current_dir) datajob_stack.init_datajob_context() task1 = GlueJob(datajob_stack=datajob_stack, name="task1", job_path="glue_jobs/task.py") task2 = GlueJob(datajob_stack=datajob_stack, name="task2", job_path="glue_jobs/task2.py") with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow") as step_functions_workflow: task1 >> task2 datajob_stack.create_resources() app.synth() ``` </details> # Ideas Any suggestions can be shared by starting a [discussion](https://github.com/vincentclaes/datajob/discussions) These are the ideas, we find interesting to implement; - add a time based trigger to the step functions workflow. - add an s3 event trigger to the step functions workflow. - add a lambda that copies data from one s3 location to another. - version your data pipeline. - cli command to view the logs / glue jobs / s3 bucket - implement sagemaker services - processing jobs - hyperparameter tuning jobs - training jobs - implement lambda - implement ECS Fargate - create a serverless UI that follows up on the different pipelines deployed on possibly different AWS accounts using Datajob > [Feedback](https://github.com/vincentclaes/datajob/discussions) is much appreciated!

نیازمندی

مقدار	نام
>=2.1.0,<3.0.0	stepfunctions
>=2.4,<3.0	contextvars
>=0.3.2,<0.4.0	typer
>=1.181,<2.0	aws-cdk.core
>=1.181,<2.0	aws-cdk.aws-glue
>=1.181,<2.0	aws-cdk.aws-s3-deployment
>=1.181,<2.0	aws-cdk.aws-stepfunctions
>=1.181,<2.0	aws-cdk.aws-sns-subscriptions
>=9.13.0,<10.0.0	rich
>=1.6,<2.0	toposort

زبان مورد نیاز

مقدار	نام
>=3.8,<4.0	Python

نحوه نصب

نصب پکیج whl datajob-0.9.0:

pip install datajob-0.9.0.whl

نصب پکیج tar.gz datajob-0.9.0:

pip install datajob-0.9.0.tar.gz