The building blocks of a Jetstream pipeline are templates. Templates are scripts that describe the steps required for a pipeline. Some pipelines can be a single template file. Other pipelines will require several templates and supporting data files (see pipelines).
- Templates
- Table of contents
- Syntax
- Dependencies between Tasks in templates
- Variables and Logic
- Template rendering data
- Modularity
Templates describe a set of tasks that can be run for multiple projects. You can think of templates like a declarative scripting language for building pipelines. They usually have variable components that will be filled-in with project data at runtime. Once the variables are filled-in, the tasks will be rendered, or finalized commands ready to execute.
Here is an example workflow template with one task to complete:
- name: hello_world
cmd: echo "{{ greeting }} world"
Here is a template with two tasks:
- name: task1
cmd: echo "{{ greeting }} world"
- name: task2
after: task1
cmd: echo "It is $(date)"
And here is a template with 100 tasks:
{% for i in range(99) %}
- name: task1
cmd: echo "{{ greeting }} world number {{ i }}"
{% endfor %}
- name: task2
after: task1
cmd: echo "It is $(date)"
The foundation of template syntax is YAML. Each task
starts with a hyphen, -
, and then a set of task directives are
given as directive: value
lines. This is YAML syntax for describing a
sequence of mappings.
Directives are the instructions that the runner will use to determine when and how to execute the task. See Tasks for more information.
Variables and logic can be included to make templates dynamic. In the example
above, the {{ greeting }}
will be filled in with data provided by the user
when the template is run with jetstream run
.
Variables and logic provide endless potential
for modeling complex workflows.
Templates can be run with the command-line tool:
$ jetstream run example.jst -c name bender
Dependencies between tasks can be specified with "flow" directives: before
,
after
, input
and output
. In this example, "task2" needs to run after
"task1". Adding the after
directive will establish a dependency where "task2"
depends_on "task1".
- name: task1
cmd: hostname > info.txt
- name: task2
after: task1
cmd: date >> info.txt
Another way to set up dependencies is with the input
and output
directives.
This example sets up the same workflow as above, but using different directives.
- cmd: hostname > info.txt
output: info.txt
- cmd: date >> info.txt
input: info.txt
output
directives can be declared without having a task with a matchinginput
directive. But, an error will be raised if there areinput
directives with no matchingoutput
directives.
All dependency directives (before
, after
, input
, output
, and *-re
variants) can be sequences. This example will setup a third task that waits for
both setup tasks to complete before it executes:
- name: setup_task_1
cmd: hostname > info.txt
- name: setup_task_2
after: setup_task_1
cmd: date > info.txt
- name: start
after:
- setup_task_1
- setup_task_2
cmd: cat info.txt
Note that there are multiple syntax options for a sequence in YAML.
Some dependency directives (before-re
, after-re
, input-re
) are treated as
regex patterns. Regex patterns can be an easier way of setting up dependencies
when there are many tasks you need to link. Here's the same workflow written
with a pattern for the after-re
directive instead of a sequence.
Note that patterns and input/output directives will increase the time it takes to calculate the workflow graph. For most workflows this is negligible but may become a concern for very large workflows.
- name: setup_task_1
cmd: hostname > info.txt
- name: setup_task_2
after: setup_task_1
cmd: date >> info.txt
- name: start
after-re: setup_task_.*
cmd: cat info.txt
similarily input/outputs can be used:
- name: setup_task_1
output: info1.txt
cmd: hostname > info1.txt
- name: setup_task_2
after: setup_task_1
output: info2.txt
cmd: date >> info.txt
- name: start
input-re: ".*\.txt"
cmd: cat info.txt
Details for variables and logic syntax can be found here designer documentation for details
Jinja2 can be used to add variables and logic to workflow templates. Prior to loading tasks and connecting dependencies, Jetstream will render templates with Jinja2. During the render, variables in the template are replaced with actual data given as command arguments, config files, or saved in the project and pipelines. Templating is a pattern used widely in web development, but other examples of using Jinja and YAML together to create dynamic structured documents can be found in tools like Ansible or SaltStack.
Examples:
-
Add variables to templates with the double-curly-bracket syntax:
- name: say_hello cmd: say "hello {{ name }}"
-
Use logical expressions with curly-bracket-parenthesis syntax:
# comments are allowed throughout these documents following either yaml or # jinja2 comment syntax {% for name in names %} - name: say_hello cmd: say "hello {{ name }}" {% endfor %}
-
And much, much more... see designer documentation for details
In addition to the included global functions and filters included with Jinja2, several other tools have been added with Jetstream and can be used inside templates:
-
raise
: Raise an error while rendering the template- Example:
{{ if foo < 42 }}{{ raise('foo should be at least 42') }}{{ endif }}
- Example:
-
log
: Log messages to the Jetstream logger while template renders- Example:
{{ log('Foo is {}'.format(foo), level='CRITICAL') }}
- Example:
-
env
: Returns environment variable value- Example:
echo foo is {{ getenv('FOO') }}
- Example:
-
getenv
: Returns environment variable value, this will return None if value is not set whereasenv
will raise an error. A different fallback value can be given as the second argument.- Example:
echo foo is {{ getenv('FOO', None) }}
- Example:
-
setenv
: Sets an environment variable when the template is rendered- Example:
{{ setenv('FOO', '42') }}
- Example:
-
fromjson
: Parse a json string as an object -
basename
: Returns the basename of a path -
dirname
: Returns the directory name of a path -
urlparse
: Parse a url string as an object -
sha256
: Returns sha256 hexdigest for a string -
md5
: Returns md5sum of a file defined with a path- Example:
{{ required_scripts.some_script.path | md5 }}
- Example:
-
assignbin
: Returns the 0-based bin the value falls in.- The default bin edges are 0 to infinity, meaning this will return 0 if the bin edges are not defined.
- Returns -1 if the input value is out of bounds.
- Any value landing on an edge will floor to lower bin.
- This also accepts a list of labels such that:
{{ assignbin(5,[0,2,4,6],['low','med','high']) }}
returns 'high'. Moreover,{{ assignbin(4,[0,2,4,6],['low','med','high']) }}
would return 'med'.
When the template is rendered, data is pulled from several sources. Each is explained below in further detail (highest priority first):
-
command-line arguments:
-c/--config
or-C/--config-file
-
data stored in the project index (if using a project):
<project>/jetstream/project.yaml
-
data stored in the pipeline manifest (if using a pipeline):
<pipeline>/pipeline.yaml
After these sources are loaded, they're collapsed into a single config object (a dictionary) that is used by Jinja2 as the context for rendering the template. Higher-priority data sources will overwrite other sources.
In the example below, the variable {{ name }}
is what we want to replace, so
we need to pass in config data with key name
. To pass a single variable with
command line arguments, use the -c/--config
options. It takes two arguments:
the first is the key, the second is the value. The key can optionally include
the type of the variable being passed with the syntax: type:key value
. If
type is not given, the data will be loaded as a string. This syntax supports a
wide range of datatypes, including entire files.
These arguments are also allowed when creating projects with jetstream init
You can test this process with the jetstream render
command. The resulting
template will be printed to stdout:
# example1.jst
- name: say_hello
cmd: say "hello {{ name }}"
$ jetstream render example1.jst -c name bender
# example1.jst
- name: say_hello
cmd: say "hello bender"
Batches of config data can also be loaded from files. Here the example template has been modified to accept a set of names. We can load that set of names from a json file:
example2.jst
# example2.jst
{% for name in names %}
- name: say_hello
cmd: say "hello {{ name }}"
{% endfor %}
config.json
{"names": ["Philip J. Fry", "Bender Bending Rodriguez", "Leela Turanga"]}
To run, use -C/--config-file
to load the entire file of variables.
$ jetstream render example2.jst -C config.json
# example2.jst
- name: say_hello
cmd: say "hello Philip J. Fry"
- name: say_hello
cmd: say "hello Bender Bending Rodriguez"
- name: say_hello
cmd: say "hello Leela Turanga"
There is a dynamic file loader type that will handle json, yaml, and many
tabular text file formats. It will determine the file type based on the
extension of the path. But, this can be overridden with the
--config-file-type
option. See jetstream render -h
help for a list of
supported file types for your configuration.
Projects are an optional but very helpful feature in Jetstream. A project is
a directory that contains a jetstream folder and project.yaml
(this folder
will be referred to as the project index). You can create these directories
with the jetstream init
command.
Config variables can be used during project init. Any data will be saved into the project index so that it is available when running templates or pipelines on that project in the future.
When running many jetstream commands (project
, tasks
, run
, pipelines
,
etc.) projects help jetstream to organize task data, logs, and store workflow
progress. Projects also serve to store configuration data that will be used
when running templates and pipelines. In the project index (jetstream
folder)
there is a project.yaml
file that contains info about the project when it
was created. This file is always included as a config data source when
rendering templates or running pipelines.
Pipelines may specify additional data that is available when rendering their
templates. Pipelines should include a pipeline.yaml
file. Inside this file
the required field __pipeline__
contains information about the pipeline, but
any additional fields can be specified.
An example use case for this would be when if a pipeline had a set of possible
options to choose from. They could be stored in the pipeline.yaml
, and remove
the amound of config data needed to be included for each invocation of the
pipeline. Here's an example:
pipeline.yaml
__pipeline__:
name: example_pipe
...
reference_file: gs://bucket/path/to/reference.file
run_modes:
default:
threads: 4
index_uri: gs://bucket/path/to/basic.index
faster:
threads: 16
index_uri: gs://bucket/path/to/faster.index
Templates can be modularized, or divided into smaller pieces, to improve
organization and reusablility. There are a few ways to modularize code:
include
, extends
, and macros
are some options. Full details can be found
in the Jinja2 documentation on
template inheritance.
Here is an example using the {% include %}
statement to use code from another
template file:
# This template includes code from the next template
{% for sample in samples %}
- name: download_{{ sample }}
cmd: wget $DOWNLOAD_URL_ROOT/{{ sample }}.gz
{% include 'process_sample.jst' with context %}
{% endfor %}
- name: finalize
after: .*
cmd: echo All done
# This template code is used in the template above
- name: decompress_{{ sample }}
cmd: gunzip {{ sample }}.gz
- name: transform_{{ sample }}
after: decompress_{{ sample }}
cmd: sed -i 's/monday/friday/' {{ sample }}.txt
- name: compress_{{ sample }}
after: transform_{{ sample }}
cmd: gzip {{ sample }}.txt