5-minute read
Snowflake has emerged at the forefront of a new era in enterprise data warehousing. Among its most game-changing features is its decoupling of compute resources from storage. This enables speed, scalability, and operational efficiency we data engineers could only dream of in the era of on-prem analytics. With on-demand compute in the cloud, the sky is the limit.
Snowflake solves a great many old data engineering problems, but others remain. The “T” in ELT brings its own set of issues. In particular, scheduling, alerting, and resource provisioning for heavy-duty data transformations are not easily accomplished with SQL alone. The cloud puts a great deal of power at our fingertips. How do we make use of it?
Enter Fivetran. While primarily billing itself as a turn-key solution for data integration from third-party sources, Fivetran also has a handy feature under the hood, simply (and appropriately) called “Transformations.” We should note that Fivetran Transformations support other database technologies too, but for our purposes here, we’ll focus on Snowflake.
In Fivetran, a Transformation is a custom SQL statement coupled with a triggering and alerting mechanism.
If that sounds mind-numbingly simple, that’s because, well, it is. But that’s the beauty of it. Consider the implications: If you can encapsulate a data transformation as a SQL statement, you can harness the database’s compute resources without having to write nor maintain any code for scheduling and fault alerting. With Snowflake, this also means you can dynamically provision the compute horsepower your transformation requires, then idle it when it’s not needed. Fivetran simply acts as a SQL client. Conveniently, Fivetran can automatically trigger a transformation to run upon completion of loading raw data via one of its own connectors to a third-party app. It can also run transformations on scheduled intervals, which means it can operate upon any data in your database, whether or not you used Fivetran to collect it.
Take the following use case: Let’s say you work for Widgets, Inc., manufacturer of IoT widgets that phone home with data on their locations and health. You already have a process in place landing the raw data in your Snowflake instance. Your team of analysts needs reporting the status of these widgets by region; perhaps differences in climate can affect their performance. Millions of these widgets update their status frequently, so these telemetry records pile up in the billions.
This is one of those situations where the operation is simple, but making it perform at scale can get hairy. This is where columnar databases like Snowflake really show their strength: Crunching numbers from big data sets while neatly abstracting away the parallel processing infrastructure required. We might as well let Snowflake do the heavy lifting—because that’s what it’s for. The only ingredient we need to code is the SQL. Something like this:
MERGE INTO WIDGET_WAREHOUSE.STATUS_BY_REGION AS tgt
USING
(
SELECT
REGION_ID
, STATUS_ID
, COUNT(DISTINCT WIDGET_ID) AS WIDGET_COUNT
FROM WIDGET_RAW_DATA.WIDGET_STATUS
GROUP BY REGION_ID, STATUS_ID
) AS src
WHEN MATCHED
THEN UPDATE SET tgt.WIDGET_COUNT = src.WIDGET_COUNT
WHEN NOT MATCHED
THEN INSERT (REGION_ID, STATUS_ID, WIDGET_COUNT)
VALUES (src.REGION_ID, src.STATUS_ID, src.WIDGET_COUNT)
;
In Fivetran, we’ll need to configure our Snowflake database as a destination, using a Snowflake service account. Within Snowflake, we can configure this account to leverage the compute resources we’ll need to perform the operation within our SLA.
Then, we can add a Transformation to the Fivetran destination, add our SQL statement, and set it on a schedule.
And voilà ! We’re done. Fivetran will track the execution history of our transformation and can be configured to send email alerts in case of failure.
Of course, the simplicity of this approach comes with constraints. At this time, Fivetran does not officially provide any mechanism for injecting dynamic parameters into our SQL (although a Data Build Tool transformation feature currently in beta certainly has our attention). Airflow is another popular, open-source alternative where more complex transformation logic is needed.
Sometimes you need a samurai sword; other times, you need a scalpel. If it’s the former, it’s nice to have a tool that introduces only the complexity you need, and for that, Fivetran is great.
Like what you see?
Paul Anderson is a senior data engineer at Logic20/20 as well as a business intelligence developer and tech industry veteran.