Data
Connectors, transforms, and quality checks from a single config. DuckDB and Spark backends built in.
DataEngineX unifies data pipelines, ML lifecycle, and AI agents. Config-driven, self-hosted, production-ready. Replaces the Airflow + MLflow + LangChain + FastAPI glue.
pip install dataenginex
— or: uv add dataenginex
data:
source: s3://my-bucket/raw/
format: parquet
quality:
null_threshold: 0.05
ml:
backend: mlflow
training:
model: xgboost
target: revenue
ai:
provider: openai
retrieval: hybrid
agents:
- name: analyst
tools: [sql, search]
server:
auth: jwt
rate_limit: 100/min
observability:
metrics: prometheus
tracing: otel
Airflow for orchestration. MLflow for tracking. LangChain for agents. FastAPI wired together by hand. Prometheus bolted on. Each tool: its own config format, auth system, failure mode, oncall rotation. Stop building glue. Start shipping products.
Six domains. One framework. No assembly required.
Connectors, transforms, and quality checks from a single config. DuckDB and Spark backends built in.
Experiment tracking, training, serving, and drift detection built in. MLflow, W&B, or the built-in backend — your call.
LLM providers, hybrid BM25+dense retrieval, and LangGraph agent runtime — swappable, not locked in.
FastAPI with JWT auth, rate limiting, and health checks. API, background workers, and scheduler under one roof.
structlog structured logging, Prometheus metrics, and OpenTelemetry tracing — wired up from config, not code.
K3s, Helm, and Terraform via infradex. From dev to production Kubernetes cluster without writing manifests by hand.
dex.yaml is the single source of truth for your entire platform.
Sources, transforms, quality rules, model config, agent definitions,
API settings, and observability — all in one place.
No more hunting across twelve repos to find why a pipeline broke. No more "it works in dev" because dev and prod share the same config schema.
dex validate dex.yaml
# DataEngineX — full stack config
data:
source: s3://my-bucket/raw/
format: parquet
backend: duckdb # or spark
quality:
null_threshold: 0.05
schema_enforcement: strict
audit_table: quality.audit
ml:
backend: mlflow
tracking_uri: http://mlflow:5000
training:
model: xgboost
target: revenue
features: [clicks, sessions, region]
serving:
endpoint: /api/v1/predict
drift_detection: true
ai:
provider: openai
model: gpt-4o-mini
retrieval: hybrid # BM25 + dense
agents:
- name: analyst
tools: [sql, search, python]
server:
host: 0.0.0.0
port: 17000
auth: jwt
rate_limit: 100/min
observability:
metrics: prometheus
tracing: otel
log_level: info
Each component is independently useful. Together they cover the full lifecycle.
pip install dataenginex
Core framework — config system, backend registry, CLI, API server, ML lifecycle, AI agents. The engine everything runs on.
View on GitHub →Port 7860
Web UI — single pane of glass built on NiceGUI. Monitor pipelines, browse data, inspect ML experiments, and chat with AI agents.
View on GitHub →Terraform + Helm
K3s cluster config, Helm charts, and Terraform modules. From blank VPS to production-grade Kubernetes cluster — no manual YAML.
View on GitHub →Install the base package or pick the extras you need.
pip install dataenginex
# or
uv add dataenginex
pip install "dataenginex[spark]" # PySpark transforms
pip install "dataenginex[mlflow]" # MLflow backend
pip install "dataenginex[agents]" # LangGraph agents
pip install "dataenginex[all]" # Everything