Why Your Data Pipelines Need to Be Column-Aware

As I have said many times over the past few years, the data landscape has changed dramatically since the time I built my first database. Back then, I was managing a few government mailing lists that consisted of around 1,000 names and addresses in total, and which had formerly been handled using COBOL programs. Not only did I model and migrate these lists to a database, I merged them all into a single integrated database and eliminated any duplicate addresses in the process.

Today we are all dealing with hundreds of millions—if not billions—of rows of data that often add up to hundreds of terabytes, and even petabytes in some cases.

Of course, not all that data is in one table or even one database; rather, it is spread across hundreds or even thousands of tables across multiple databases, flat files, images, document stores, and media files. A single organization may have access to millions of attributes. Translate that to database terms, and that means tens or hundreds of millions of columns it needs to understand the contents of.

How are we ever going to manage and get value from the data in all of those columns? How will we know where that data came from, where it went, and how it was changed along the way?

I see no other way forward than automation. Automation of everything in our data estate, especially the data pipelines that move and transform all this data. With all the privacy laws and regulations, which vary from country to country and from state to state, how will we ever be able to trace the data and audit these processes—at massive scale—without automation?

The only reasonable answer is that the tools we use need to be column-aware.

 

It is no longer sufficient to keep track of just tables and databases. What is a table anyway, but a collection of columns organized for a specific purpose?

Our automation tools must collect and manage metadata at the column level. And it has to be beyond the data type and size. We need much more context today if we really want to unlock the power of our data. We need to know the origin of that data, how current the data is, how many hops it made to get to its current state, what rules and transformations were applied along the way, and more.

From a governance perspective, we need to be able to easily apply security classifications, such as PII and PHI, to individual columns, and make sure those classifications follow the data from column to column and table to table throughout the lifecycle of that data. And with those classifications, we need to be able to easily apply consistent security and governance policies, such as masking, encryption, or obfuscation.

Column awareness is the next level of innovation needed to allow us to attain the agility, governance, and scalability that today’s data world demands.

Legacy ETL and integration tools won’t cut it anymore. They can’t handle the scale and diversity of data we have today in the cloud. Likewise, trying to hand code all of this properly won’t scale either (unless you have a massive army of expert data engineers working around the clock).

Much like Snowflake changed the world of databases with its innovative architecture that separates compute from storage, Coalesce has changed the game for data pipelines and transformations by creating the first built-for-the-cloud, user-friendly automation tool with a truly column-aware architecture. With the capabilities of an innovative and powerful tool like this, imagine what value you can unlock as you build out your modern data landscape.

No need to imagine—you can request a demo or give Coalesce a try today!

Explore next-gen data transformations for yourself

Get Hours of Development Work Done In Minutes