Hey everyone! How is your day going? Today I come with a recurrent topic in the fantastic, wonderful and senseless world of data.
Every single company has been trying focus on automate their data pipelines (aka ETL) in order to manage and organize every single business term for a long time. Because of this, people like me have been able to switch between many companies, working on different data-related projects. Some of them involve creating simple workflows, while others focus on optimization mechanisms, debugging legacy code or re-implementing them.
Here is THE BIG PROBLEM: every single developer has his own coding style, and as the team grows, more styles emerge. In the past, I had many challenges working with my code fellas, especially with junior ones. While some developers could condense 40 lines into a high-order function using super fancy monads and other advanced cool functional things, juniors are trying to understand how a for-loop works. And the point is, when developers switch projects or leave the company, who wants to maintain and debug the code? Exactly—problems come…
SO! To avoid this, make the team’s life easier and (why not) create more cost-cheaper solutions, why not create data pipelines with zero code?
ETL Configuration File #
There is ALWAYS a DSL in the middle. Yes, if we provide a well-defined DSL along with a well-structured document where we write our DSL, we will end up with a DSL Instruction Program or, in other words, an ETL Configuration File which means: Big Data with Zero Code (OMG! he said it!).
I won’t dive into the details of DSLs (maybe for another post). For now, what we need is:
- Grammar Specification, which formally defines the syntax and structure of a DSL or any language. It’s written in a meta-language, such as BNF (Backus-Naur Form) or EBNF (Extended BNF), which provides rules for how instructions in the DSL should be constructed.
- Semantic, which describes the behavior and transformations of a DSL using formal methods.
Grammar Specification #
In this case, we are going to re-write the SQL grammar in someway fitting it into YAML Syntax.
Why YAML? #
Here is just a taste. Of course, there are other data-serialization languages like TOML, JSON, etc. However, I believe YAML has the most human-friendly syntax.
Semantic #
Because we are talking about big data, YES, we are going to evaluate our DSL expressions into Apache Spark DataFrames
.
In the end, our goal is to provide an efficient solution using a common enterprise data stack.
Approach #
The ultimate goal is to provide a user-friendly mechanism that homogenizes and simplifies the process of generating Apache Spark ETL workflows.
With this approach, one could create a Spark ETL by simply using a YAML configuration file, much like following a cooking recipe.
The Final Result: Introducing the Teckel Framework #
All these concepts, examples, and efforts have been leading toward: Teckel.
Teckle framework for building Apache Spark ETL processes using YAML configuration files.
Conclusion #
In this post, we explored the foundational ideas behind building ETL workflows for Apache Spark without writing code. By introducing the concept of a DSL and leveraging YAML as a configuration format, we outlined a vision for simplifying and homogenizing data pipeline creation.
While we only scratched the surface by discussing grammar specifications, semantics, and the use of YAML, this approach holds significant potential. It promises to make ETL workflow generation as straightforward as following a recipe, reducing complexity and improving collaboration within teams.
In the future, we will delve deeper into the implementation details, such as defining transformation rules, optimizing execution, and creating practical examples of this zero-code methodology. Stay tuned as we continue this exciting journey into the world of data automation!