Why DataEngineX

Why one config file beats stitching N tools together

The Modern Data Stack Is Broken

A typical production data + ML + AI setup looks like this:

Concern Tool you add
Orchestration Airflow or Prefect
Experiment tracking MLflow or W&B
AI / LLM agents LangChain or LlamaIndex
Data serving FastAPI (custom)
Observability Prometheus + Grafana + custom logging
Deployment Helm + Terraform + custom CI

You are not building a product. You are building glue.

Every new tool is another configuration format, another auth system, another failure mode, another oncall page.


One File, Entire Stack

DataEngineX is the opposite approach. Define once, run anywhere:

# dex.yaml
data:
  source: s3://my-bucket/raw/
  format: parquet
  quality:
    null_threshold: 0.05

ml:
  backend: mlflow          # or built-in — swap without code change
  training:
    model: xgboost
    target: revenue

ai:
  provider: openai
  retrieval: hybrid        # BM25 + dense — built in
  agents:
    - name: analyst
      tools: [sql, search]

server:
  auth: jwt
  rate_limit: 100/min

observability:
  metrics: prometheus
  tracing: otel

One dex serve command starts everything. No glue code.


Swappable Backends

Opinionated defaults, zero lock-in. Every layer is swappable:

pip install "dataenginex[dagster]"    # swap orchestration
pip install "dataenginex[mlflow]"     # swap experiment tracking
pip install "dataenginex[agents]"     # LangGraph agent runtime
pip install "dataenginex[spark]"      # PySpark transforms

The config stays the same. The backend changes.


Self-Hosted

Your data never leaves your infrastructure. No SaaS subscription. No vendor lock-in. Run on a VPS, K3s cluster, or bare metal.

Get started →