Introducing hetida designer — a graphical workflow composer leveraging the Python data science stack

stewit
14 min readApr 22, 2021

We recently open sourced hetida designer on GitHub, a collaboration, development, and runtime environment for data science. With this article, we want to introduce hetida designer to the data science community.

Table of Contents

· hetida designer in a nutshell
· Why?
· What is hetida designer?
· Using hetida designer — an example
Creating a workflow
Creating a component
Model training workflow
Inference workflow
Using workflows in production
API workflow execution endpoint
Wirings and adapters
Automation
· Feature overview
· Conclusion

hetida designer in a nutshell

hetida designer is a collaboration, development, and runtime environment for data science made for business experts and data scientists which emphasizes production readiness and integration.

  • hetida designer is an open source graphical composition tool for analytical workflows, which is based on and leveraging the Python data science stack.​
  • Source code development and graphical composition are on equal footing.​
  • hetida designer produces versioned, production-ready, and easily-integrable data science artifacts.​

Why?

In our experience, there are two main factors contributing to successful data science projects.

👫 First, business experts and data scientists need to collaborate closely and interactively: Data scientists need to provide their produced source code as artifacts which are easy to employ and reusable in order to enable business users to play around with them. Business experts (e.g. maintenance engineers, market experts, portfolio managers) need the abilities to interact with those models and functions, recombine them, and actually use them as tools on their problems and use cases.

Today, both groups work in completely different environments. Data scientists typically write code in Python or R. Business experts use a broad range of tools from Excel to highly-specialized business applications but they usually do not interact with source code.

Imagine how much more accessible data science methods would become for business experts, if both groups could collaborate in a shared environment!

🔧 Second, one main hurdle for data science projects is to take them from experimentation to production. This step requires data engineering to make data sources and sinks available in a robust and professional way. Plus, the resulting models must be integrated into existing IT landscapes and applications requiring software engineering.

Note that many data scientists find the science fun and the engineering boring. They are good at writing modeling code, but they are neither software nor data engineers. E.g. data scientists should not be compelled to writing a production web service each time to deploy their results.

A perfect data science environment needs to decouple IT integration and data engineering from the analytics part and make data science artifacts available for production usage without needing extra deployment steps.

🌱 We realized that an environment addressing both aspects does not exist. Creating hetida designer is our attempt to change that.

What is hetida designer?

To the user, hetida designer is a graphical composition tool for analytical workflows which is leveraging the Python data science stack, where source code development and graphical composition are on equal footing. hetida designer produces production-ready and easily-integrable data science workflows without requiring extra deployment steps.

Its main goals are:

🎯 Empowering people of different backgrounds and skill levels to collaboratively apply data science methods:

  • Data scientists writing code and providing data science artifacts (models, functions)
  • Expert users graphically combining and employing those artifacts for their business purposes

🎯 Simplifying the evolvement of data science from experimentation to production

🎯 Decoupling of analytics from data engineering and facilitating the integration of arbitrary data sources and sinks

🎯 Accomplishing transparency of data science down to the source code level — no magic black boxes.

To get a clearer picture, we think it is just as important to state what hetida designer is not:

  • hetida designer is not a no-code/low-code data science tool. Its approach is not to hide code behind boxes. On the contrary, it is designed precisely for the purpose to provide data science code as usable artifacts to the business side. While enabling graphical composition of functions for business experts without coding skills, writing your own Python code is a first-class citizen and considered the primary way data scientists work with hetida designer.
  • hetida designer is not a lock-in ecosystem. Instead, it embraces the open source Python data science stack and tries to make it fully accessible and usable. hetida designer greatly benefits from this massive open ecosystem — it does not try to reinvent the wheel by providing its own algorithm implementations or layers on top of existing libraries. Rather, it makes it easy to employ models and functions from this well-established stack.
  • hetida designer is not a one-off analysis / data exploration tool. Instead, it strives to help creating workflows that can then be automated for production use. It makes workflow execution immediately available e.g. through a web service endpoint or via Kafka. This together with hetida designer’s flexible adapter system enables efficient integration and automation in multifaceted IT landscapes.
  • Last but not least, hetida designer is not an ETL tool.

hetida designer is provided as a web application and typically deployed as a small ensemble of containerized services. See the Readme for basic setup and installation instructions.

hetida designer is best understood by using it. Therefore we want to demonstrate with an example how to work with hetida designer.

Using hetida designer — an example

Let us take data from two related sensors, measured at the same time intervals. For this example, it is not important which physical quantities are actually measured, we simply call them x and y.

x values:

[1.49, 1.45, 1.55, 1.78, 1.25, 1.6, 1.63, 1.34, 1.58, 1.51, 1.31, 1.36, 1.3, 1.2, 1.59, 1.46, 1.59, 1.07, 1.43, 1.43, 1.37, 1.36, 1.07, 1.36, 1.43, 1.22, 1.37, 1.2, 1.73, 1.26, 1.05, 1.01, 1.16, 1.47, 1.59, 1.19, 1.27, 1.23, 1.73, 1.59, 1.51, 1.44, 1.71, 1.07, 1.29, 1.65, 1.44, 1.17, 1.53, 1.32, 1.11, 1.4, 1.44, 1.57, 1.18, 1.58, 1.45, 1.41, 1.53, 1.31, 1.35, 1.23, 1.64, 1.28, 1.5, 1.49, 1.37, 1.59, 0.97, 1.64, 1.22, 1.3, 1.7, 1.38, 1.49, 1.51, 1.39, 1.49, 1.46, 1.17, 1.26, 1.62, 1.66, 1.4, 1.53, 1.36, 1.13, 1.27, 1.31, 1.38, 1.64, 1.46, 1.43, 1.58, 1.41, 1.58, 1.45, 1.18, 1.45, 1.28, 0.38, -0.35, 1.13, -0.58, -0.6, 0.15, 0.49, 0.22, -0.26, -0.0, 0.8, -0.39, 0.04, 0.76, 0.28, -0.08, 0.65, 0.68, 0.63, 0.63, 0.25, 0.09, 0.54, -0.36, 0.25, -0.72, -0.26, -0.37, 0.02, -0.27, 0.42, -0.28, 0.05, 0.43, 0.08, -0.15, 0.42, 0.05, 0.05, 0.69, -0.3, 0.07, -0.46, 0.16, 0.4, -0.41, 0.03, 0.18, 0.53, 0.59]

y values:

[2.09, 1.64, 1.89, 2.43, 1.26, 2.57, 1.83, 1.66, 1.78, 2.06, 1.19, 1.39, 1.34, 1.16, 1.94, 2.21, 1.18, 0.67, 1.78, 1.26, 1.26, 1.59, 1.17, 2.08, 1.74, 1.05, 1.51, 1.44, 2.26, 1.4, 1.3, 0.94, 0.71, 1.43, 1.96, 1.26, 1.61, 1.26, 2.28, 1.72, 2.18, 1.7, 2.09, 0.84, 1.4, 2.36, 1.66, 1.37, 1.66, 1.8, 1.2, 2.03, 2.07, 1.66, 1.02, 1.6, 1.64, 1.79, 2.53, 1.57, 1.48, 0.67, 2.33, 1.37, 1.53, 2.47, 1.1, 1.86, 0.41, 2.36, 1.25, 1.43, 2.48, 1.86, 1.56, 1.67, 1.38, 2.26, 1.47, 0.73, 1.35, 1.5, 2.35, 1.65, 2.02, 1.41, 1.06, 1.35, 1.88, 1.53, 2.09, 1.65, 1.75, 1.41, 1.6, 2.18, 1.74, 1.21, 1.75, 0.89, 0.25, 0.11, 0.03, 0.12, 0.09, 0.18, 0.18, 0.07, 0.28, -0.09, 0.07, 0.28, 0.01, 0.34, 0.23, 0.01, 0.31, 0.11, 0.2, 0.0, -0.11, -0.25, -0.0, 0.09, 0.29, 0.46, 0.08, 0.68, -0.12, -0.15, 0.25, -0.34, 0.01, -0.0, 0.1, 0.31, 0.14, 0.26, 0.17, 0.09, 0.03, -0.26, -0.05, -0.08, 0.2, -0.07, 0.17, 0.2, 0.1, 0.12]

Our goal is to train an unsupervised anomaly detection model on them in order to decide for new pairs whether they should be considered okay or not okay. Please note that the focus here is on the demonstration of the designer usage and not on the concrete, extremely simplified data science case.

Creating a workflow

We start by creating a new workflow in hetida designer for training and evaluating such a model.

First, we just want to plot the training data to get a feeling for it. We drag and drop some basic components and a visualization component from the component side bar into the workflow and connect them accordingly.

Then, we set up the workflow’s interface in its IO dialog.

After that we execute the workflow where we decide to use the manual input option to enter our quite small data set.

The result is the desired plot (colored by index position):

We observe that our data appears adequate and we do not need to add further preprocessing steps (in the end this is only an example 😏). Next, we want to use a One Class Support Vector Machine for the anomaly detection. There is no such component in the component set provided with hetida designer. But the scikit-learn Python library provides this algorithm and hetida designer makes it easy to write custom components based on the Python data science stack. Note that scikit-learn is pre-installed together with the most-used libraries of the Python data science stack. You can, of course, install additional packages and use your own modules as well.

Creating a component

From the component sidebar, we start writing our new component. First, we declare its interface using the IO dialog. We decide to only expose the “nu” parameter of the algorithm for now.

Then we enter the relevant code where appropriate by using the code editor.

We actually output decision function values instead of class predictions. Note that the code simply creates and invokes the model from the scikit-learn package.

At this point, we could test-execute the component directly, but as we believe it will work as expected, we just press the “publish” button in order to release this component revision. hetida designer has a built-in revision system for all data science artifacts. Only released component revisions can be used in workflows to ensure stable versions and reproducibility.

Model training workflow

We switch back to the workflow, drag our just released component revision into the workflow, and connect it accordingly. Our component has an input for test data for evaluation. For this demo, we decide to just display the resulting decision function as a contour plot and therefore, push a grid of points in the 2D-plane as test data into this input. Furthermore, we write the resulting trained model into a file for later use in an accompanying inference workflow, using the simple object store delivered with hetida designer. Remark: More sophisticated model stores can be integrated via the adapter system mentioned below.

Executing our workflow with parameters set appropriately results in an additional plot depicting the “learned” decision function. Note that parameters are “stored” in the execution dialog so that you do not have to re-enter them all when you want to test again later.

Inference workflow

Next, we create a simple inference workflow that loads the trained model and applies it to some new data and just outputs the decision function values.

With the correct parameters and some new data this results in:

At this point, we reflect on what we achieved so far in our example:

  • created a training workflow
  • used plot components to display data
  • created a new component from scratch
  • released a first component revision
  • created an inference workflow loading a trained model
  • tested both workflows with manually inserted data

To keep the demonstration clear and simple, this example does not address many typical data science steps like data cleaning, preprocessing, model validation, or model optimization. Ignoring all this, a typical next step would be using these workflows in production.

Using workflows in production

This is already possible and does not require an extra deployment step:

Let us sketch how to automate the application of the inference workflow to new data. For this, we first release the workflow through the “publish” button. This prevents the workflow revision from being further edited, ensuring all automated applications run the exact same code. Next, we obtain the workflow revision id from its details/edit dialog:

API workflow execution endpoint

The web API url for workflow execution depends on your designer installation. Assuming you are using the local docker-compose setup described in the Readme it is

http://localhost:8080/api/workflows/f7ad33b5-e201–412f-9f2a-ded12759c1b5/execute

Doing a POST request against this url with json body

{
"inputWirings": [
{
"adapterId": "direct_provisioning",
"filters": {
"value": "ocsvm_example"
},
"workflowInputName": "model_name"
},
{
"adapterId": "direct_provisioning",
"filters": {
"value": "20210406"
},
"workflowInputName": "model_tag"
},
{
"adapterId": "direct_provisioning",
"filters": {
"value": "[2.0, 3.0, 1.0, 1.0]"
},
"workflowInputName": "x"
},
{
"adapterId": "direct_provisioning",
"filters": {
"value": "[0.0, 4.0, 1.0, 1.3]"
},
"workflowInputName": "y"
}
],
"outputWirings": []
}

is equivalent to what pressing the execution button did before and returns

{
"result": "ok",
"output_results_by_output_name": {
"decision_function_vals": {
"0": -21.4381604335,
"1": -28.4018510059,
"2": -0.3109575766,
"3": -1.1063304828
}
},
"output_types_by_output_name": {
"decision_function_vals": "ANY"
}
}

For more detailed json field descriptions, we refer to the documentation.

Wirings and adapters

The request json represents a “wiring” — a data structure that maps data sources via adapters to workflow inputs and workflow outputs via adapters to data sinks.

Wirings, adapters, and workflow execution

An adapter is a piece of software providing data sources and sinks for hetida designer. A typical adapter lets hetida designer read and/or write data from/to a database. Writing an adapter is usually a data engineering task. Please refer to the documentation on this subject for details.

In our example, the built-in “direct_provisioning” adapter is used for all inputs and for all outputs (the latter by default). This means, data to all inputs is provided by the request, and resulting data from outputs is included in the response.

Let us see how this example looks with sensor data coming from another adapter. Here is an example, using the Python demo adapter included in the hetida designer default setup.

{
"inputWirings": [
{
"adapterId": "direct_provisioning",
"filters": {
"value": "ocsvm_example"
},
"workflowInputName": "model_name"
},
{
"adapterId": "direct_provisioning",
"filters": {
"value": "20210406"
},
"workflowInputName": "model_tag"
},
{
"adapterId": "demo-adapter-python",
"filters": {
"timestampFrom": "2021-04-09T08:14:24.000000000Z",
"timestampTo": "2021-04-09T09:14:24.000000000Z"
},
"refId": "root.plantA.picklingUnit.influx.temp",
"refIdType": "SOURCE",
"type": "timeseries(float)",
"workflowInputName": "x"
},
{
"adapterId": "demo-adapter-python",
"filters": {
"timestampFrom": "2021-04-09T08:14:31.000000000Z",
"timestampTo": "2021-04-09T09:14:31.000000000Z"
},
"refId": "root.plantA.picklingUnit.influx.press",
"refIdType": "SOURCE",
"type": "timeseries(float)",
"workflowInputName": "y"
}
],
"outputWirings": []
}

So, by simply changing the wiring, one can run a workflow on different data sources without having to change the workflow itself. Wirings decouple the analytics in workflows from data ingestion/egestion.

Automation

This is quite useful for automation. For example it allows to run one workflow on all your facilities simply by invoking it for every facility with a facility-specific wiring that refers to that facility’s data. Think of replacing “plantA” by “plantB” in the example above.

Or, if you have several different model workflows which have a common interface (i.e. same inputs and outputs), you can run them all with the exact same wiring just by replacing the workflow revision id in the web API endpoint url.

hetida designer itself is not a job management tool. Hence, automation should be implemented using the standard tools for this purpose — from simple CRON job scripts to enterprise job scheduling and automation frameworks.

Feature overview

Source code transparency: You can open every component by double-clicking on it in the sidebar or right-clicking on an instance in a workflow and expanding the preview from there. This allows to view its source code and even to create new revisions of this component. You can also copy the component to have a basis for your own custom component, which is a good starting point for component development.

Using workflows in workflows: You can drag a released workflow into a workflow to use it like a component. Nesting workflows is encouraged and a good method to separate concerns. For instance, you may have a preprocessing pipeline workflow that you use in both a training workflow and a prediction workflow.

Visualization components: The visualization components are ordinary components written in Python using Plotly. This means, you can write your own custom plotting components. To get going, we recommend starting by copying an existing visualization component. Hint: Execution of plotting components is by default deactivated for production runs via the execution web API endpoint — there is no need to maintain an extra production workflow revision without plots.

Using data from files: Instead of manually inserting data, you can use data from files. First, the manual input supports reading small amounts of csv or json data from disk. Second, the local file adapter allows to mount local directories into the hetida designer runtime which is a good option to use for larger files, e.g. csv or excel. You can even add your own file formats/extensions to this (check the documentation).

Using data from arbitrary data sources: hetida designer comes with a flexible adapter system that allows for integration of data from arbitrary sources. Typically, a data engineer writes an adapter to the organisation’s databases once and from then on they can be accessed in hetida designer. This is also the basis for decoupling analytics from data engineering as mentioned at the beginning.

Writing data to arbitrary data sinks: The same adapter system handles writing result data to target systems. For more details, please consult the documentation.

Running workflows in production: As mentioned above, the created workflows are directly executable through a web service endpoint and the adapter system allows full control over input data origin and output data destination. Of course, data can also come from the execution web service request itself and be sent back with its response. This all can be mixed arbitrarily. Together this allows to run your workflows by automated systems with data ingestion/egestion being completely different from your test runs, e.g. data coming from a production database instead and results written to any target system.

In addition it is possible to run workflows using Kafka, and more options may be added in the future.

Workflow/component documentation: For both workflows and components you can write some documentation in markdown format with LaTeX support via the “Open documentation” button.

Conclusion

hetida designer has already become a cornerstone of our own data science projects. It allows our consultants to quickly bring their methods to our customer’s business experts. These experts can then play around with the provided components and workflows as well as employ them productively to solve their problems. Plus, it helps in-house Data Scientists to make their results accessible for their business colleagues.

hetida designer is built on an open ecosystem and tries to expand its circle of users. We hope that hetida designer can give something back to this ecosystem and that others find it useful, provide feedback, and start to contribute.

Thanks for reading!

--

--