From ETL to AI(E)TL: Rethinking Data Pipelines for the AI Era
If you know data pipelines, you already know 70% of AI pipelines. Here’s the other 30%.
I’ve built data pipelines for 20 years. Extract. Transform. Load. The pattern is carved into muscle memory. Pull data from sources, clean it, reshape it, load it into the warehouse.
Then I started integrating AI into pipelines. Suddenly the rules changed.
Not everything. The fundamentals still apply. But there’s a new step that changes how you think about data quality, testing, and reliability.
The evaluation step.
Traditional ETL becomes AI(E)TL: Extract → Transform → Evaluate → Load.
That “E” is where probabilistic systems meet deterministic infrastructure. Where you need quality gates for outputs that are never exactly the same twice. This isn’t optional—without evaluation, you’re loading hallucinations, errors, and inconsistencies directly into your systems.
Let me show you what changes, what stays the same, and how to think about AI-augmented data pipelines.
The Core Insight: Add an “E” for Evaluation
Traditional ETL: Extract → Transform → Load
AI-augmented AI(E)TL: Extract → Transform → Evaluate → Load
The evaluation step validates AI-generated outputs before loading them into your warehouse.
Why?
Deterministic transforms don’t need validation. If you write SELECT UPPER(name), you get uppercase names. Every time. Predictably.
AI transforms do. If you send 100 customer reviews to an LLM for sentiment classification, you get 100 classifications. Some accurate. Some hallucinated. Some off-topic.
You can’t trust probabilistic outputs the way you trust SQL transforms.
So you evaluate them. This is where Arbiter fits—the evaluation framework I built specifically for this problem (see “The LLM Evaluation Stack” for the full story). It’s the quality gate between transformation and loading.
Real Example: Customer Feedback Pipeline
Extract: Pull customer reviews from API. Transform: Categorize and analyze sentiment using AI. Evaluate: Validate categorization accuracy with Arbiter. Load: Insert into warehouse.
Quality gate: Only load if evaluation score > 0.8
Without the evaluation step, you’re loading AI hallucinations into your production database.
With it, you catch problems before they propagate.
What Changes: The Four Pillars
The fundamentals of data engineering don’t disappear. They adapt.
1. Immutability - Now Includes Context
Traditional ETL: Deterministic transforms take a DataFrame, apply a function like converting to uppercase, and return the same result every time. Run it twice. Same input, same output. Immutable.
AI(E)TL: AI-augmented transforms take data plus context, model, and prompt version. For each row, generate output using the LLM, then evaluate it for accuracy, completeness, and relevance. If the evaluation passes, include it in results. If it fails, log the failure and use a fallback or skip. Same input? Maybe different output. The LLM is non-deterministic.
But you can make it semantically immutable: given the same data, context, model, and prompt version, outputs should be semantically similar.
Track these new inputs:
Source data (traditional)
Prompt text (new)
Model version (new - gpt-4o-mini-2024-07-18)
Context version (new - from your context system)
Temperature setting (new)
All of these affect the output. Log them all.
2. Idempotency - Semantic, Not Exact
Traditional ETL: Same input produces exact same output. Run the transform twice with identical input, assert equality, and it passes.
AI(E)TL: Same input produces semantically similar output. Run the AI transform twice with identical input, prompt, and model. Exact equality fails because LLMs are non-deterministic. But semantic similarity succeeds—evaluate both outputs for semantic similarity, and if the score is above 0.9, that’s “good enough.”
You can’t guarantee exact idempotency with LLMs.
You can guarantee semantic idempotency: outputs convey the same meaning, even if worded differently.
This is where the evaluation step becomes critical. You’re testing for semantic consistency, not byte-for-byte equality.
3. Testing - Distributional, Not Deterministic
Traditional ETL: Test with exact assertions. Input “john”, transform to uppercase, assert output equals “JOHN”. Deterministic, so exact assertions work.
AI(E)TL: Can’t assert exact output—too brittle. Instead, assert distributional properties. Evaluate the AI output against criteria like “sentiment classification accuracy” with a reference. Assert the overall score is above 0.8, confidence is above 0.7, and the output is in an expected set like [”positive”, “very positive”, “strongly positive”].
You’re testing that outputs are mostly correct, not exactly correct.
Acceptance criteria:
Semantic similarity > 0.8
Confidence level > 0.7
No hallucinations detected
Output format is valid
4. Monitoring - Track Distributions
Traditional ETL: Monitor exact values. Assert average price is between 0 and 10,000. Assert all ages are between 0 and 120. Exact thresholds work.
AI(E)TL: Monitor distributions. Evaluate all AI outputs against references, extract scores, then monitor statistical properties: mean score above 0.75 for average quality, standard deviation below 0.3 for consistency, minimum score above 0.5 for threshold. Also monitor hallucination rate by evaluating outputs for “no hallucination” and ensuring less than 5% fail.
You’re not checking if price == 29.99. You’re checking if the distribution of evaluation scores stays within acceptable ranges.
The New Dependency: Model Versions
Traditional pipeline dependencies:
Code version (git commit sha)
Data schema version
Library versions (requirements.txt)
AI pipeline dependencies add:
Model version (gpt-4o-mini-2024-07-18, claude-3-5-sonnet-20241022)
Model provider (OpenAI, Anthropic, Google, Groq)
Prompt version (prompt_v2.txt)
Context version (context_v1.5)
Temperature (0.0 for deterministic, 0.7 for creative)
Track everything: run_id, code_version (git sha), model, model_version, prompt_version, context_version, temperature, and timestamp. Store this metadata for every pipeline run.
Model updates can break your pipeline. GPT-4o-mini from July behaves differently than GPT-4o-mini from November.
Pin versions in production. Test before upgrading.
Cost as a First-Class Concern
Traditional ETL: Compute costs are predictable. Run the same query, pay the same cost.
AI(E)TL: Token costs scale with data volume and are variable. This isn’t theoretical—I documented how long context windows can cost you hundreds of dollars when you’re not managing sessions strategically. The same principles apply to pipeline costs: every unnecessary LLM call compounds.
Cost Optimization Patterns
1. Cache LLM Responses
Bad: Re-generate every run. Process each review through the LLM on every pipeline execution.
Good: Cache results. Create a cache key from the review content, model, and prompt version. If the key exists in cache, use the cached sentiment. Otherwise, generate it and cache it. Same review? Use cached result. Don’t pay twice.
2. Batch Processing
Bad: Individual API calls create high overhead. Process 1000 items means 1000 API calls.
Good: Batch processing amortizes overhead. Process items in batches of 50. 1000 items becomes 20 API calls instead of 1000.
3. Model Selection
Bad: Use gpt-4o for everything at $5.00 per 1M tokens.
Good: Use appropriate model for task complexity. Simple tasks use gpt-4o-mini at $0.15 per 1M tokens (33x cheaper). Medium tasks use gpt-4o. Complex tasks worth the cost also use gpt-4o.
4. Smart Retry Logic
Bad: Naive retry wastes tokens. Retry immediately on failure, burning through tokens without backoff.
Good: Exponential backoff plus cache. Check cache first—if the prompt and model combination exists, return cached result. Otherwise, use retry config with exponential backoff (initial delay 1 second, exponential base 2.0, max delay 60 seconds). Cache successful results.
Cost Example
Processing 10,000 customer reviews:
Without optimization:
Individual API calls: 10,000 calls
No caching: Re-process duplicates
Model: gpt-4o
Cost: ~$50
With optimization:
Batch processing: 200 calls (50 items per batch)
Cached duplicates: -30% calls
Model: gpt-4o-mini for classification, gpt-4o only for complex reviews
Cost: ~$2
25x cost reduction from basic optimization.
Real AI(E)TL Pipeline Examples
Let me show you three AI(E)TL pipeline patterns.
Example 1: Customer Feedback Pipeline
EXTRACT: Fetch customer reviews from API for the date range.
TRANSFORM (AI): For each review, use an LLM to generate classification with category, sentiment, and priority. Append to categorized reviews list.
EVALUATE: For each categorized review, evaluate the category and sentiment for accuracy and appropriateness. If evaluation passes the quality gate, add to validated reviews. If it fails, log a warning with the review preview.
LOAD: Insert only validated reviews into the warehouse. Print counts: processed, validated, and rejected.
Key insight: The evaluation step prevents hallucinated categories from entering the warehouse.
Example 2: Document Summarization Pipeline
EXTRACT: Fetch documents from S3 bucket and prefix.
TRANSFORM (AI): For each document, use a powerful model (gpt-4o) to generate a summary. Store doc_id, summary text, original length, and summary length.
EVALUATE: For each summary, evaluate for factual accuracy, key point capture, and absence of hallucinations using custom criteria. If evaluation passes, add to validated summaries. If it fails, use an extractive summary fallback method and mark fallback_used as True.
LOAD: Store validated summaries in vector database.
Quality gate: Reject hallucinated summaries. Use fallback for failed evaluations. This is the pattern: evaluate first, load only what passes. It’s the same principle as setting permission boundaries—contain failures before they propagate.
Example 3: Airflow DAG for AI Pipeline
Create an Airflow DAG with four PythonOperator tasks scheduled daily. Extract task fetches customer data from API. Transform task pulls data from extract task via XCom, enriches it with AI, and returns transformed data. Evaluate task pulls transformed data, evaluates each item for data quality, completeness, and accuracy using custom criteria, adds evaluation score and passed flag to each item, then filters to only passing items. Load task pulls validated data from evaluate task and loads it into the database. Chain tasks: extract → transform → evaluate → load.
Airflow integration: Evaluation becomes a first-class task in the pipeline.
When NOT to Use AI in Pipelines
Be honest about limits. AI isn’t always the answer. This is the same principle I wrote about in “Permissions > Context”—some tasks have too high a blast radius for AI, even with excellent context. The same applies to pipelines.
Bad AI Use Cases:
Deterministic transformations → Use SQL. Don’t use an LLM to convert to uppercase. Do use SELECT UPPER(name).
Simple data cleaning → Use regex/pandas. Don’t use an LLM to remove whitespace. Do use df[’column’].str.strip().
Calculations → Use Python/SQL. Don’t use an LLM to calculate total. Do use df[’price’] * df[’quantity’].
Structured data reshaping → Use dbt. Don’t use an LLM to join tables. Do write SQL joins.
Good AI Use Cases:
Unstructured data processing - Text classification, entity extraction, sentiment analysis
Classification at scale - Categorize 100,000 customer reviews
Summarization - Condense long documents to key points
Data enrichment - Add context from external knowledge
Anomaly detection - Pattern recognition for fraud/outliers
Decision framework: Can you write a deterministic rule? Use code. Do you need semantic understanding? Use AI (with evaluation). And always set permission boundaries so AI can’t accidentally delete production data or force-push to main.
The dbt Analogy: What’s Missing?
dbt transformed SQL data modeling by making it:
Version controlled (SQL as code)
Testable (data quality tests)
Documented (automatically generated docs)
Modular (reusable models)
What’s the dbt equivalent for AI pipelines? This question matters because dbt solved the exact problem AI pipelines face: how do you make probabilistic transformations reliable, testable, and maintainable at scale?
The answer isn’t clear yet—we’re still in early days. But the pattern is emerging: context systems handle prompt versioning, evaluation frameworks handle quality gates, and measurement practices handle proving value.
Current state: No unified framework exists yet.
What it would look like:
dbt for SQL: SQL as transformation layer, version controlled, testing framework (schema tests, data tests), documentation generator, lineage tracking.
“AI-dbt” for AI pipelines: Prompts as transformation layer (version controlled), model plus prompt versioning, evaluation framework (Arbiter), context documentation (AGENTS.md pattern), semantic lineage tracking.
The space is still early. No dominant framework has emerged.
Arbiter handles the evaluation layer. Context management handles the prompt versioning. But there’s no unified orchestration layer yet.
Someone will build it. Maybe you.
The Takeaway
If you know ETL, you know AI pipelines.
Just add the “E”: Extract → Transform → Evaluate → Load.
That evaluation step is where you validate probabilistic outputs before they enter your warehouse. It’s where you catch hallucinations, track costs, monitor quality distributions, and ensure semantic consistency—what Arbiter aims to handle.
But evaluation alone isn’t enough. You also need proper context systems so AI understands your data architecture, permission boundaries to prevent catastrophic failures, and measurement frameworks to prove the value.
The fundamentals still apply:
Immutability (now includes prompts and models)
Idempotency (semantic, not exact)
Testing (distributional, not deterministic)
Monitoring (track distributions, not exact values)
The new concerns:
Track model versions as dependencies
Treat cost as a first-class metric
Build quality gates with evaluation
Use appropriate models for task complexity
AI doesn’t replace data engineering. It extends it. The fundamentals still matter—immutability, idempotency, testing, monitoring. But now you add evaluation gates, cost tracking, and semantic consistency checks. The craft evolves, but the principles remain.
The teams that succeed will apply data engineering rigor to probabilistic systems—measuring what matters, building proper context, and setting boundaries that prevent catastrophic failures.
That’s the approach for building reliable AI pipelines.
————————
How are you using AI in your pipelines? What patterns have you found that work? Reply with your use case.
Want to dive deeper? Check out “The LLM Evaluation Stack” to see how Arbiter approaches the evaluation problem, or “Measuring AI Effectiveness” to learn how to prove your AI pipelines are working.
This is post #10 in the Ashita AI series on building AI systems.
