Contrary to what you might hear at data meetups lately, AI isn’t changing what data engineers do, it’s just turning up the pressure. There’s more data flowing through systems, more rapid changes to accommodate, and more urgent requests (“can we ship that today?”) coming from the business.
But here is the boring truth that no one wants to bring up: most of the pain isn’t coming from GPUs or fancy models, but instead from orchestration challenges, schema drift, governance bottlenecks, and the eternal question: “who broke column X again?”
What’s actually happening is that data engineering is becoming something bigger: a framework for standardizing change, so that humans and AI codegen can both move fast without damaging trust. This means prioritizing patterns over one-off projects, establishing contracts at the edges, and implementing column-level lineage and guardrails directly into the data development path through tests, approvals, and clear ownership.
What’s actually happening is that data engineering is becoming something bigger: a framework for standardizing change, so that humans and AI codegen can both move fast without damaging trust.
The pain points — and how to fix them
Slow development loops and risky merges
It’s a common scenario: It can take hours to get feedback from business stakeholders, so your engineers start batching changes and shipping risky deployments. They can’t afford to wait for hours to see if one small code change works, so they bundle five changes together and cross their fingers.
The solution isn’t complex, but it requires discipline. You need local-ish or ephemeral dev environments, seedable test data, and pre-merge preview runs. Every pull request should compile, run core tests, and show a diff of affected models before a human ever reviews it.
Schema drift that nukes predictability
You can’t code confidently when small changes cause unpredictable ripple effects. This is where standardizing patterns (SCD2, merges, CDC) becomes critical, along with enforcing contracts at inputs/outputs (names, types, nullability, SLAs). The goal is to fail fast on drift in CI rather than discover problems after deployment when they are exponentially more time consuming to fix.
Governance becomes your biggest bottleneck
Nothing kills momentum like audits and approvals that land after the code is written. The answer is to make tests, approvals, and change tickets a part of the PR checks — compliance by construction, if you will. Ownership and promotion rules should be clearly defined and enforced through code, not tribal knowledge. Governance should happen in the flow of the work, built into the data development lifecycle.
Governance should happen in the flow of the work, built into the data development lifecycle.
Documentation dies a slow death
If documentation work is a separate chore, it won’t happen. Engineers simply don’t have the time, and their development work takes priority every time. The fix is simple: auto-generate documentation from metadata and diffs, then let humans add the “why” behind it. Most importantly, surface documentation in the same place you develop so it updates with the build.
AI codegen amplifies chaos
AI can write SQL, but it can’t enforce your standards or understand your business requirements and context. So adopt an approach that works. First, define templates, naming conventions, and contracts. Let AI draft the scaffolding, but keep humans in charge of the logic, modeling, and reviews. Gate everything with tests and approvals, always keeping a human in the loop.
Multi-platform becomes a headache
Nobody wants to maintain three toolchains and four CI patterns. The solution is to build one mental model and one CI/CD path for each of the data platforms you use: Snowflake, Databricks, Fabric, or any combination thereof. Platform differences should be addressed with configuration, not rewrites.
Coalesce can help put all these best practices in motion by giving you metadata-driven templates, contracts, preview builds, Git-native reviews, and promotion rules all in one place so that the dev loop is fast and consistent. But if you’re rolling your own, you can still implement the same practices with SQL, GitHub Actions, and your orchestrator of choice.
Best practices: Engineering patterns that scale
- Choose templates over vibes: Publish a minimal set of reusable nodes: merge-upsert, SCD2, CDC stream, dim/fact loaders, materialized view refresh. New work should extend these patterns, not rely on ad hoc SQL.
- Treat change like code: Branches, PRs, preview builds, owners, and checklists should all be automated and enforced. No “side door” deploys—every change should go through review and validation.
- Make impact checks part of the PR: Run smoke tests and dependency checks automatically before review. If a change would break something downstream, the PR should already alert you.
- Ship data products, not loose tables: Curate and certify a small set of trusted data sets with clear owners and SLAs; deprecate or hide the rest.
- Use AI for repetitive, boring tasks — not strategic work: Leverage AI for SQL scaffolding, test skeletons, doc drafts, and commit messages. Rely on humans instead for your dimensional design or business logic.
Your AI-ready engineering scorecard
Ship-ready data doesn’t happen by accident — it comes from repeatable patterns, safe change, and automation that scales. To find out how prepared your pipelines are for the AI-era, pick one project and answer each question with a rating from 0–2 (0 = No / 1 = Partial / 2 = Yes).
Question #1: Do producers/consumers agree on names, types, nullability, SLAs — and do builds fail on drift?
Best practices: Contracts live with code and block merges when violated.
Question #2: Can you compile and run a scoped build with seeded data before review?
Best practices: Automated preview with diff of affected models and test results.
Question #3: Are SCD2, merge-upsert, CDC, dims/facts templated?
Best practices: Reusable nodes; new models reference patterns, not copy-paste SQL.
Question #4: Do PRs gate deploys with tests/approvals and clear ownership?
Best practices: Envs + PR checks + owners required; no direct prod edits.
Question #5: Do docs auto-update from metadata/diffs and live where the work is done?
Best practices: Generated docs with human notes layered on; linked from PRs.
Question #6: Which data sets are certified and who’s on the hook?
Best practices: Small curated set with owners, SLAs, and usage visibility.
Question #7: Where does AI help without lowering standards?
Best practices: AI drafts scaffolding (SQL/tests/docs/commits); templates enforce shape.
Question #8: Can the same workflow target your cloud engines without rewrites?
Best practices: Single CI/CD path and APIs; platform is a config toggle.
Now add up your scores. How do you rank?
14–16: You’re leading the pack — your data pipelines are AI-ready.
10–13: You’re close — focus on standardization and automation.
Below 10: You’ve got a solid foundation but still have some work to do.
Move fast, fix less
AI accelerates change, and the trick lies in standardizing the dev loop so change is safe and fast. Start small and boring. Pick one domain, publish a merge/SCD2 template, turn on preview builds with seeded data, require contract checks in PRs, and certify two high-value data sets with clear owners and SLAs. Track lead time to production and break-fix hours for two sprints, then iterate on what actually reduces toil.
Whether you stitch this together yourself with SQL + Git + CI — or get the scaffolding out of the box by leaning on a platform like Coalesce — the same principle stands: treat metadata as code and put governance where the work happens. Do that, and you’ll ship AI-ready data faster, with fewer surprises — and everyone will sleep better.