Any data warehouse project normally has three distinct stages: an initial period of requirements gathering, followed by a long build phase, and finally the acceptance phase at the end. One of the main reasons why building a data warehouse has traditionally been such a painful exercise is because the technical implementation not only takes a long time, but it can be very expensive to rework—something you are asked to do more often than you like. How many times after your build is complete have you heard “Hmm, that’s not what I asked for” or “Now that I see it, I realize I need something totally different”? Incredibly painful, indeed.
The good news is that a lot of the technology we build is pattern-based, and anything pattern-based can be automated. And so, the rise of automation software like Coalesce came about from people saying to themselves, “Why don’t we just automate all the steps of the time-consuming build phase?” Automating all of that frees up your team to better focus on the requirements gathering upfront as well as the acceptance phase at the end, lessening the burden of error-prone, time-consuming manual work on your human beings and helping make future data warehouse projects more likely to succeed.
“Anything pattern-based can be automated.”
So what exactly are the parts of the build phase that can be automated? Nearly every step! When breaking it down for my own clients, I like to explain it as “the 6 Ds of automation,” which I outline below:
1. Design automation
The first step where we can rely on automation is design, the phase in which you are planning out what you are going to build. For large parts of what we build in a data warehouse, the model constructs follow patterns. When we build a dimension table, a dimension always follows the same pattern: it has a surrogate key, a business key, some system columns for tracking changes, and things like that. If I’m building a data vault, a satellite, or a hub, they all follow a certain pattern as well. A hub table will have a hashing key, a business key, some system columns, and so on. The only thing that really changes is the name of the table and the attributes that go in it.
Automation makes a lot of sense during the design phase because we should not have to recreate the wheel every time we create a new object of any type, dimension of any type, satellite, and the like.
2. Development automation
If design involves planning what you will build, development is the process of creating the physical artifacts that reside in the database (or that will move data from A to B and transform it). As luck would have it, development is another phase that can be fully automated.
The code patterns we generate when we build out a data warehouse—particularly if we’re taking an ELT approach or a database-driven approach—follow a template. It’s either a temporary table where the code pattern is “insert into table, then select something from a set of tables,” or a persistent table where the code pattern is “merge into table, select from a set of tables.” Because it’s highly pattern-based, we can templatize it. Development automation should also include facets of things such as source code control, which can be done automatically.
Automating development means your developers can focus on business decisions, such as “Where is my data coming from?”; “What am I building?”; and “What are the business rules I need to build?” This is much better than them worrying about “What are the 2,000 lines of code I need to write to make this work?”
3. Documentation automation
Documentation has two phases: the first phase covers what is built, the second phase explains how it is built. The “what” is usually what you want to describe to your business users—these are the data artifacts available to use in their decision-making. The “how” is more for the technical teams—how does data get from A to B? This includes things like column lineage as well as impact analysis (“If I make a change here, what else do I break?” or “If there’s a change in my incoming source data, what are the downstream impacts of that?”).
Documentation is the biggest bugbear of all IT functions, especially in fast-moving environments. While very large firms often have dedicated documentation teams, most companies do not. And so technical documentation never gets done, or if it does, it’s often done in a rush by the developer, is hardly ever QAed or updated, and quickly becomes stale. While a developer may augment the documentation as they build, they should not be responsible for it.
Documentation should be automatically generated, ideally driven from metadata so that it’s always fresh as of that moment.
4. Deployment automation
The days when you had an army of DBAs and system administrators taking the system down for a weekend to do a deployment should be over. Often belonging to the realm of DevOps or continuous integration/continuous development (CI/CD), deployment automation means that when I finish developing and documenting a set of artifacts, and I need to deploy them to a higher environment—be it QA, test, pre-production, or production—I need that process to be highly automated. And I also need it to be highly auditable so that any auditor can say to any artifact in a production environment, for example, “How did this artifact get here, who put it here, when, and why?” I should also be able to roll back to a previous point if I mess up a deployment.
All of this should be automated so that it’s at the touch of a button, or even just on an automatic schedule that says, “Artifacts have been approved—make it so!”
5. Daily operations automation
Daily operations automation is really about orchestration or scheduling. In the traditional data warehouse world, data feeds are scheduled to run once a day, so at midnight we take yesterday’s data and load it into our data warehouse. However, this is done much more frequently in the modern data world—from daily to hourly all the way to what is called “microbatch,” meaning every few minutes, and all the way down to real time.
Because of the dependencies between objects, how data flows from the source into the final end user object, the order in which we load data is very important. Any time I orchestrate or change my scheduling, the dependencies should be automatically maintained, because if things get out of order, the data becomes incorrect. Dependency maintenance automation is critical.
In addition, you should be capturing rich operational metadata as part of your daily operations automation. This includes metadata around what ran; when it started; how long it took; records inserted, updated, deleted, and merged; success messages; error messages; operating system messages—all of this should be automatically captured and maintained so that you can look at the performance of your data warehouse over time and troubleshoot any problems you encounter. Say, for example, your data warehouse experiences a significant slowdown. You might wonder if the volume of data just got too big or if there’s another problem. With access to this metadata, you can determine if there’s something technical going on, such as if an index has gotten out of sync or has become fragmented and needs to be rebuilt.
6. Data quality automation
Data quality should be part of the DNA of every project you do. For years we never really focused on data quality; it was a big enough job just moving the data from all the source systems into the data warehouse. If we did do it, it was done manually. But when your end users are making important business decisions, they need to be confident that they are working with high-quality data. Otherwise they could harm the business, lose their job, or worse.
“Data quality should be part of the DNA of every project you do.”
Developers need to think about data quality when they are doing their development. Is the code of good quality? Are they following the right templates? Are they following the standards their architects have laid down? Are they creating Cartesian joins when they join data together? Automated data quality testing can help with all of this.
Data quality also plays an important role at user acceptance testing (UAT) time. Your users don’t want to rerun every test they’ve ever run before, they just want to test what’s new and rely on an automated testing platform to confirm none of the old stuff is broken.
A third area where data quality automation is useful is for monitoring in production. Do you have everything that you should have? Are your business rules and data volumes within reason of what you expect them to be? Has every data quality problem you’ve ever found been checked to confirm that it has not occurred in this run? Have any foreseeable data quality issues also been checked for? Once you can confirm all this, you can assure users that they can have confidence in the data.
The sheer scale of all this testing, which has to be done continuously, means that it is impossible to do it manually, and therefore it requires automation.
Work smarter, not harder
In today’s tightening economy, companies are trying to figure out how to do better. Because a business can only improve on something that it can measure, the value of the data in its data warehouse has become more important than ever. But hiring freezes and even workforce reductions mean there’s a shortage of human brains, so those humans need to work smarter rather than harder. What does smarter look like? Automation. Let those human brains focus on the things you actually need a human to do—everything else we’ve covered here can be done by a machine.
Many companies today are starting to embrace automation, but instead of looking at automation as a whole, too many take a piecemeal approach to their automation efforts, focusing on just one part of the bigger process: “I have a development automation problem, so let me go and buy a development automation tool.” Yes, there are great development automation tools out there, but most companies do not have a development automation problem—they have a data warehouse automation problem.
This is why you should look for an automation solution that focuses on your data warehouse implementation holistically, not on the individual steps. All the individual steps are important, but having the complete view of automation is what will make your data warehouse process a seamless one and, to take it a step further, enable key data governance initiatives that businesses must prioritize.
The future of automation and more
Want to learn more about the top data and technology trends our experts predict for 2023 and beyond? Download our special top trends report.
Michael Tantrum, National Sales Director at Resultant, has over 30 years of data warehouse experience in architectural, design, and selling capacities. He brings a wealth of knowledge to organizations embarking on data warehouse transformation journeys and is a frequent speaker on the topics of data warehouse modernization and automation.