Technical Documentation
This section provides an in-depth look at the architecture, configuration, and core concepts behind Kedro-Dagster. Here you'll find details on how Kedro projects are mapped to Dagster constructs, how to configure orchestration, and how to customize the integration for advanced use cases.
Danger
This documentation section is a work in progress. The translation configuration and logic are not fully defined here. Please check back later for a more complete guide!
Project Configuration
Kedro-Dagster expects a standard Kedro project structure. The main configuration file for Dagster integration is dagster.yml
, located in your Kedro project's conf/<ENV_NAME>/
directory.
dagster.yml
This YAML file defines jobs, executors, and schedules for your project.
Example
schedules:
my_job_schedule: # Name of the schedule
cron_schedule: "0 0 * * *" # Parameterst of the schedule
executors:
my_executor: # Name of the executor
multiprocess: # Parameters of the executor
max_concurrent: 2
jobs:
my_job: # Name of the job
pipeline: # Parameters of its corresponding pipeline
pipeline_name: __default__
node_namespace: my_namespace
executor: my_executor
schedule: my_job_schedule
- jobs: Map Kedro pipelines to Dagster jobs, with optional filtering.
- executors: Define how jobs are executed (in-process, multiprocess, k8s, etc) by picking executors from those implemented in Dagster.
- schedules: Set up cron-based or custom schedules for jobs.
Customizing Schedules
You can define multiple schedules for your jobs using cron syntax. See the Dagster scheduling documentation and the API Reference for more details.
Customizing Executors
Kedro-Dagster supports several executor types for running your jobs, such as in-process, multiprocess, Dask, Docker, Celery, and Kubernetes. You can customize executor options in your dagster.yml
file under the executors
section.
For each available Dagster executor, there is a corresponding configuration Pydantic model documented in the API Reference.
Example: Custom Multiprocess Executor
We can select multiprocess
as the executor type corresponding to the multiprocess Dagster executor and specify the mulitprocess executor according to the MultiprocessExecutorOptions.
executors:
my_multiprocess_executor:
multiprocess:
max_concurrent: 4
Example: Custom Docker Executor
Similarly, we can configure a Docker Dagster executor with the available parameters defined in DockerExecutorOptions
.
executors:
my_docker_executor:
docker_executor:
image: my-custom-image:latest
registry: "my_registry.com"
network: "my_network"
networks: ["my_network_1", "my_network_2"]
container_kwargs:
volumes:
- "/host/path:/container/path"
environment:
- "ENV_VAR=value"
Note
The docker_executor
requires the `dagster-docker package.
Customizing Jobs
You can filter which nodes, tags, or inputs/outputs are included in each job. Each job can be associated with a pre-defined executor and/or schedule. See the Kedro pipeline documentation for more on pipelines and filtering. The accepted pipeline parameters are documented in the associated Pydantic model, `PipelineOptions.
To each job, you can assign a schedule and/or an executor by name if it was previously defined in the configuration file.
definitions.py
The definitions.py
file is auto-generated by the plugin and serves as the main entry point for Dagster to discover all translated Kedro objects. It contains the Dagster Definitions
object, which registers all jobs, assets, resources, schedules, and sensors derived from your Kedro project.
In most cases, you should not manually edit definitions.py
; instead, update your Kedro project or dagster.yml
configuration.
Kedro-Dagster Concept Mapping
Kedro-Dagster translates core Kedro concepts into their Dagster equivalents. Understanding this mapping helps you reason about how your Kedro project appears and behaves in Dagster.
Kedro Concept | Dagster Concept | Description |
---|---|---|
Node | Op, Asset | Each Kedro node becomes a Dagster op. Node parameters are passed as config. |
Pipeline | Job | Each Kedro pipeline is translated into a Dagster job. Jobs can be filtered and scheduled and can target executors. |
Dataset | Asset, IO Manager | Each Kedro data catalog's dataset become Dagster assets managed by a dedicated IO managers. |
Hooks | Hooks, Sensors | Kedro hooks are executed at the appropriate points in the Dagster job lifecycle. |
Parameters | Config, Resources | Kedro parameters are passed as Dagster config. |
Logging | Logger | Kedro logging is integrated with Dagster's logging system. |
Catalog
Kedro-Dagster translates Kedro datasets into Dagster assets and IO managers. This allows you to use Kedro's Data Catalog with Dagster's asset materialization and IO management features.
For the Kedro pipelines specified in dagster.yml
, the following Dagster objects are defined:
- External assets: Input datasets to the pipelines are registered as Dagster external assets.
- Assets: Output datasets to the pipelines are defined as Dagster assets
- IO Managers: Custom Dagster IO managers are created for each dataset involved in the deployed pipelines mapping both their save and load functions.
See the API reference for CatalogTranslator
for more details.
Node
Kedro nodes are translated into Dagster ops and assets. Each node becomes a Dagster op, and, additionally, nodes that return outputs are mapped to Dagster multi-assets.
For the Kedro pipelines specified in dagster.yml
, the following Dagster objects are defined:
- Ops: Each Kedro node within the pipelines is mapped to a Dagster op.
- Assets: Kedro nodes that return output datasets are registered as Dagster multi-assets.
- Parameters: Node parameters are passed as Dagster config to enable them to be modified in a Dagster run launchpad.
See the API reference for NodeTranslator
for more details.
Pipeline
Kedro pipelines are translated into Dagster jobs. Each job can be filtered, scheduled, and assigned an executor via configuration.
- Jobs: Each pipeline is mapped to a Dagster job.
- Filtering: Jobs are defined granuarily from Kedro pipelines by allowing the filtering of their nodes, namespaces, tags, and inputs/outputs.
See the API reference for PipelineTranslator
for more details.
Hook
Kedro-Dagster preserves all Kedro hooks in the Dagster context. Hooks are executed at the appropriate points in the Dagster job lifecycle. Catalog hooks are called in the handle_output
and load_input
function of each Dagster IO manager. Node hooks are plugged in the appropriate Dagster Op. As for the Context hook, they are called within a Dagster Op running at the beginning of each job along with the before_pipeline_run
pipeline hook. The after_pipeline_run
is called in a Dagster op running at the end of each job. Finally the on_pipeline_error
pipeline, is embedded in a dedicated Dagster sensor that is triggered by a run failure.
Next Steps
- Getting Started: Follow the step-by-step tutorial to set up Kedro-Dagster in your project.
- Example: See the Example Documentation for a real-world use case.
- API Reference: Explore the API Reference for details on available classes, functions, and configuration options.
- External Documentation: For more on Kedro concepts, see the Kedro documentation. For Dagster concepts, see the Dagster documentation.