Getting Started
This guide walks you through setting up and deploying a Kedro project with Dagster using the Kedro‑Dagster plugin. The examples below use the Kedro spaceflights-pandas
starter project, but you can use your own Kedro project. If you do, skip step 1.
1. Create a Kedro Project (Optional)
Skip this step if you already have a Kedro project you want to deploy with Dagster.
If you don't already have a Kedro project, you can create one using a starter template:
kedro new --starter=spaceflights-pandas
Follow the prompts to set up your project. Once it is done, install the dependencies of your project:
cd spaceflights-pandas
pip install -r requirements.txt
2. Installation
Install the plugin with pip
:
pip install kedro-dagster
3. Initialize Dagster Integration
Use kedro dagster init
to initialize Kedro‑Dagster:
kedro dagster init --env local
This creates:
src/definitions.py
: Dagster entrypoint file that exposes all translated Kedro objects as Dagster objects.
"""Dagster definitions."""
import dagster as dg
from kedro_dagster import KedroProjectTranslator
translator = KedroProjectTranslator(env="local")
dagster_code_location = translator.to_dagster()
resources = dagster_code_location.named_resources
# The "io_manager" key handles how Kedro MemoryDatasets are handled by Dagster
resources |= {
"io_manager": dg.fs_io_manager,
}
# Define the default executor for Dagster jobs
default_executor = dg.multiprocess_executor.configured(dict(max_concurrent=2))
defs = dg.Definitions(
assets=list(dagster_code_location.named_assets.values()),
resources=resources,
jobs=list(dagster_code_location.named_jobs.values()),
schedules=list(dagster_code_location.named_schedules.values()),
sensors=list(dagster_code_location.named_sensors.values()),
loggers=dagster_code_location.named_loggers,
executor=default_executor,
)
conf/local/dagster.yml
: Dagster configuration file for thelocal
Kedro environment.
# `dagster dev` configuration
dev:
log_level: "info"
log_format: "colored"
port: "3000"
host: "127.0.0.1"
live_data_poll_rate: "2000"
# Dagster schedules configuration
schedules:
daily: # Schedule name
cron_schedule: "0 0 * * *" # Schedule parameters
# Dagster executors configuration
executors:
sequential: # Executor name
in_process: # Executor parameters
# Dagster jobs configuration
jobs:
# You may filter pipelines by using e.g. `node_names`` to define a job
# data_processing: # Job name
# pipeline: # Pipeline filter parameters
# pipeline_name: data_processing
# node_names:
# - preprocess_companies_node
# - preprocess_shuttles_node
default:
pipeline:
pipeline_name: __default__
schedule: daily
executor: sequential
There's no need to modify the Dagster definitions.py
file to get started, but let's have a deeper look on the dagster.yml
file.
4. Configure Jobs, Executors, and Schedules
The Kedro‑Dagster configuration file dagster.yml
includes the following sections:
- schedules: Used to set up cron schedules for jobs.
- executors: Used to specify the compute targets for jobs (in-process, multiprocess, k8s, etc).
- jobs: Used to describe jobs through the filtering of Kedro pipelines.
You can edit the automatically generated conf/local/dagster.yml
to customize jobs, executors, and schedules:
schedules:
daily: # Schedule name
cron_schedule: "0 0 * * *" # Schedule parameters
executors: # Executor name
sequential: # Executor parameters
in_process:
multiprocess:
multiprocess:
max_concurrent: 2
jobs:
default: # Job name
pipeline: # Pipeline filter parameters
pipeline_name: __default__
executor: sequential
parallel_data_processing:
pipeline:
pipeline_name: data_processing
node_names:
- preprocess_companies_node
- preprocess_shuttles_node
schedule: daily
executor: multiprocess
data_science:
pipeline:
pipeline_name: data_science
schedule: daily
executor: sequential
Here, we have added a "parallel_data_processing" and a "data_science" job to the jobs configuration. The first one makes use of the node_names
Kedro pipeline filter arguments, to create a sub-pipeline of the Kedro "data_processing" pipeline from a list of two Kedro nodes: "preprocess_companies_node" and "preprocess_shuttles_node". Both jobs are to run daily using the "daily" schedule based on the cron_schedule
"0 0 * * *". "parallel_data_processing" is to run using a "multiprocess" executor with 2 max_concurrent
and "data_science" will run sequentially.
See the Technical Documentation for more on customizing the Dagster configuration file.
5. Browse the Dagster UI
Use kedro dagster dev
to start the Dagster development server:
kedro dagster dev --env local
Note
The dagster.yml
file also include a dev section, containing the default parameters of the command. Check out the API Reference for more info.
The Dagster UI will be available at http://127.0.0.1:3000 by default.
You can inspect assets, jobs, and resources, trigger or automate jobs, and monitor runs from the UI.
Assets
Moving to the "Assets" tab leads to the list of assets generated from the Kedro datasets involved in the filtered pipelines specified in dagster.yml
.
Each asset is prefixed by the Kedro environment that was passed to the KedroProjectTranslator
in definitions.py
. If the Kedro dataset was generated from a dataset factory, the namespace that prefixed its name will also appear as a prefix, allowing easy browsing of assets per environment and per namespace.
Clicking on the "Asset lineage" link at the top right of the window leads to the Dagster asset lineage graph, where you can observe the dependencies between assets and check their status and description.
Resources
Kedro‑Dagster defines one Dagster IO Manager per Kedro Dataset to handle their saving/loading. As with assets, they are defined per Kedro environment and their name is prefixed accordingly.
Automation
Moving to the "Automation" tab, you can see a list of the defined schedules and sensors.
Jobs
To see the different jobs defined in dagster.yml
, click on the "Jobs" tab.
Clicking on the "parallel_data_processing" job brings you to a graph representation of the corresponding Dagster-translated Kedro pipeline. before_pipeline_run
and after_pipeline_run
are included as the first and final nodes of the job graph.
The job can be run by clicking on the "Launchpad" sub-tab. The Kedro pipeline, its parameters (mapped to Dagster Config), and the Kedro datasets (mapped to IO managers) can be modified before launching a run.
Next Steps
- Visit the Example section for a more advanced example.
- Explore the Technical Documentation for advanced configuration and customization.
- See the API Reference for details on available classes and functions.