Brandon Rice

Software development with a focus on web technologies.

ETL With Ruby and Rodimus

| Comments

This is the story of how and why I made Rodimus, a minimal ETL framework written in Ruby.

When I started at Optoro (which coincidentally was also my first professional programming experience), one of my first projects was to create an ETL solution that had been sitting on the back burner for a while. The goal was to migrate a series of Mongo collections to equivalent MySQL database tables so that the company’s analysts could easily access the data from their Windows-based GUI SQL clients. In the case of embedded documents, the structure essentially had to be “flattened” into a series of SQL columns. An additional requirement was that the schema should be determined dynamically during each run. If we started to add arbitrary new fields on future Mongo documents, the program should recognize that and adjust the destination SQL schema appropriately during the next run.

My first attempt at a solution was a very crude Ruby implementation. It “worked”, but it used a ludicrous amount of memory and was horrendously slow. When I say slow, I mean it took over a day to process the entire events collection. The collection was admittedly over a terabyte in size, but that was still unacceptable. If the code had been capable of determining the most recently translated record’s timestamp and only pulling events that were created after that point in time, maybe the other limitations would have been acceptable. But that wasn’t the case.

Over the next few months, I spent a great deal of time reading about ETL and data warehouse solutions. I came into this project without any knowledge of these things, and with programming experience that was fairly limited to typical Ruby on Rails applications and other small-scale school projects. After evaluating other ETL solutions in the wild, I settled on a second iteration that used Pentaho’s Kettle. Kettle is an open-source Java ETL engine provided by a company that’s been doing ETL for over a decade.

My thought process behind using Kettle was simple: Why reinvent the wheel when people who are smarter than I am have already figured out these problems? I spent the next few weeks implementing version 2 of my ETL solution using Kettle. Working with Kettle involves a kind of visual programming using a drag-and-drop GUI interface. Transformation components (read from a database, match against a string, etc.) are arranged on a canvas, connected together, and then configured using step-specific menus. In some cases, you can script custom solutions using Javascript, Java, and with help from a third-party plugin even Ruby. There are literally hundreds of pre-defined steps for everything from reading a CSV file to interacting with a Salesforce module.

By the end of this second iteration, I had a working solution that was reliable and performant. On each run, only new data was processed from the source collection and subsequently written to the destination database. From the end users’ perspective (our analysts), it was a success.

Unfortunately, it was not a success from a development perspective. Kettle was designed for non-programmers. The development environment is mouse-click heavy with tons of windows and menus for every minute detail. It’s possible to dive into the Java source in order to manage, create, or extend any component of the system, but that’s not the Kettle Way. They’ve gone above and beyond in order to make sure that anything and everything you could possibly want to do to your data is available in a pre-baked configurable step from inside the GUI. This is great for ensuring that Kettle can handle any conceivable ETL problem, but it also results in a lot of complexity. My Kettle project was massive and completely unmaintainable. Making a change required at minimum a full day of refreshing myself on how things worked. Debugging anything was a nightmare. It was completely untestable, and to make matters worse I was the only developer on our team who had any idea how to work with Kettle.

A few weeks ago, we made some changes to our Mongo infrastructure that resulted in a need to refactor how my ETL project works. I was dreaded even opening the project files. Then, I had an epiphany. Why not do it in Ruby? It’s been a year since my first failed attempt. I’ve spent that year reading about, working with, and generally absorbing a great deal of knowledge about ETL solutions. As the only ETL developer on the team, I could say with certainty that re-implementing this entire thing in Ruby would make developer happiness increase by 1000%.

Enter Rodimus.

One major thing I’ve learned about ETL in the past year or so is that solutions tend to be very targeted to their specific domain. When you try to generalize too much, you end up with a conglomerate of options and a barrier of entry that is far too high. Kettle is, by design, the one stop shop for ETL solutions. Entire books have been written on it because it takes an entire book to even begin to understand everything it has to offer. That’s all well and good if you’re a non-programmer who’s working with ETL, but I wanted something simple, clean, and easily maintainable. I wanted something that would fit nicely into our Ruby ecosystem.

I approached Rodimus with the goal of simplicity. I wanted something very lightweight with minimal dependencies upon which targeted ETL solutions could be implemented. Its forking process approach to concurrency is actually inspired by Kettle’s design. Check out the README for more details.

It only took me a couple of nights to produce the first version of Rodimus. A few days later, I had rewritten our entire ETL stack on top of it.
When I look at the code now, its simplicity still surprises me when compared to the monstrosity that I had previously implemented in Kettle. Approaching the ETL project is no longer an exercise in frustration. I am more confident in my (now testable) solution, and I can easily share an understanding of the project with my coworkers. In all, I am a much happier developer.

I have future plans for Rodimus, but I think I will continue to strive for simplicity at its core. ETL can be complex, but that complexity should live solely in the specific application. It shouldn’t be the concern of the ETL engine.

If you enjoyed this post, please consider subscribing.

Comments