Skip to main content

Flink in-cluster sync for Onyx

This guide covers the in-cluster Apache Flink topology shipped in the ClawQL Helm chart for real-time Onyx connector sync. It is the deployment path tracked in #119: deploy Flink next to ClawQL and Onyx, keep networking internal by default, and treat new data sources as connector-job configuration rather than ClawQL server code changes.

What this feature is (and is not)

  • Is: optional Helm-managed Flink control plane + workers (flink.enabled=true) in the same namespace as ClawQL.
  • Is: an execution surface for connector jobs that keep the Onyx index fresh with low sync latency.
  • Is: config-as-code in values.yaml for image, ports, replicas, task slots, resources, and connector env wiring.
  • Is not: a ClawQL MCP tool; this does not change search / execute semantics directly.
  • Is not: public-by-default; the service defaults to ClusterIP, with no ingress generated by the chart.

Topology and traffic model

At a high level:

  1. Connector jobs run in Flink TaskManagers.
  2. Jobs pull from source systems (Slack, Confluence, Drive, Jira, GitHub, email, etc.).
  3. Jobs transform and forward records into Onyx indexing/search APIs.
  4. ClawQL’s optional knowledge_search_onyx MCP tool reads from that continuously refreshed index.

Recommended deployment boundaries:

  • Namespace: same namespace as ClawQL for simpler service discovery and shared release lifecycle.
  • Service exposure: keep Flink internal (ClusterIP) unless you have an explicit operator workflow requiring UI exposure.
  • Secrets: place connector credentials in a dedicated secret wired only to Flink (flink.connectorSecret) so they are not inherited by clawql-mcp.

The chart exposes these keys in charts/clawql-mcp/values.yaml:

KeyPurpose
flink.enabledEnables all Flink resources in the release.
flink.image.repository, flink.image.tag, flink.image.pullPolicyContainer image settings for JobManager and TaskManager pods.
flink.service.typeService type for JobManager endpoint (ClusterIP recommended for production default).
flink.service.restPortFlink REST/UI port (default 8081).
flink.service.rpcPortRPC port used by Flink internals (default 6123).
flink.service.blobPortBlob server port (default 6124).
flink.service.queryPortQuery service port (default 6125).
flink.jobManager.replicasJobManager replica count (usually 1 unless running HA patterns externally).
flink.jobManager.resourcesCPU/memory requests and limits for JobManager.
flink.taskManager.replicasNumber of TaskManager pods (horizontal worker scale).
flink.taskManager.taskSlotsSlots per TaskManager process (parallelism capacity planning input).
flink.taskManager.resourcesCPU/memory requests and limits for TaskManagers.
flink.connectorSecretExisting Kubernetes Secret name loaded into Flink pods via envFrom.
flink.extraEnvAdditional env entries for both JobManager and TaskManager containers.

Quick start (minimal)

Create a dedicated connector secret:

kubectl -n clawql create secret generic onyx-connector-env \
  --from-literal=ONYX_API_TOKEN='replace-me'

Install/upgrade with Flink enabled:

helm upgrade --install clawql ./charts/clawql-mcp \
  --namespace clawql \
  --create-namespace \
  --set flink.enabled=true \
  --set flink.connectorSecret=onyx-connector-env \
  --wait

Verify resources:

kubectl -n clawql get deploy,svc | rg 'flink|clawql-mcp-http'
kubectl -n clawql get pods -l app.kubernetes.io/component=flink-jobmanager
kubectl -n clawql get pods -l app.kubernetes.io/component=flink-taskmanager

Production-oriented values example

Use a values override file (recommended over long --set chains):

flink:
  enabled: true
  image:
    repository: flink
    tag: '1.19.1-scala_2.12-java17'
    pullPolicy: IfNotPresent
  service:
    type: ClusterIP
    restPort: 8081
    rpcPort: 6123
    blobPort: 6124
    queryPort: 6125
  jobManager:
    replicas: 1
    resources:
      requests:
        cpu: 500m
        memory: 1Gi
      limits:
        cpu: '2'
        memory: 2Gi
  taskManager:
    replicas: 4
    taskSlots: 4
    resources:
      requests:
        cpu: '1'
        memory: 2Gi
      limits:
        cpu: '4'
        memory: 4Gi
  connectorSecret: onyx-connector-env
  extraEnv:
    - name: ONYX_BASE_URL
      value: 'http://onyx-api.onyx.svc.cluster.local:8080/api'

Apply:

helm upgrade --install clawql ./charts/clawql-mcp \
  -n clawql \
  -f values.prod.yaml \
  --wait

Security and isolation guide

1) Keep Flink private by default

  • Default flink.service.type=ClusterIP means no cloud public IP allocation.
  • Do not add an ingress for Flink UI unless you also add authn/authz and network restrictions.
  • For short-lived diagnostics, prefer kubectl port-forward over permanent exposure.

Temporary local access example:

kubectl -n clawql port-forward svc/clawql-mcp-http-flink-jobmanager 18081:8081

2) Scope credentials to Flink only

  • Put connector/API tokens into one secret dedicated to connector runtime.
  • Reference it with flink.connectorSecret.
  • Avoid storing connector credentials in extraEnv at the chart root, because those env vars target clawql-mcp.

3) Separate ClawQL and connector permissions

  • Grant clawql-mcp only the permissions needed for MCP-facing API operations.
  • Grant Flink connector jobs only the permissions needed to read source systems and write to Onyx.
  • Rotate connector credentials independently from ClawQL application credentials.

Operations runbook

Health checks

kubectl -n clawql get deploy clawql-mcp-http-flink-jobmanager
kubectl -n clawql get deploy clawql-mcp-http-flink-taskmanager
kubectl -n clawql describe deploy clawql-mcp-http-flink-jobmanager
kubectl -n clawql describe deploy clawql-mcp-http-flink-taskmanager

Logs

kubectl -n clawql logs deploy/clawql-mcp-http-flink-jobmanager --tail=200
kubectl -n clawql logs deploy/clawql-mcp-http-flink-taskmanager --tail=200

Scale workers

kubectl -n clawql scale deploy/clawql-mcp-http-flink-taskmanager --replicas=6

For durable config-as-code scaling, update flink.taskManager.replicas and re-run helm upgrade.

Controlled rollouts

Use staged changes:

  1. Increase TaskManager replicas first.
  2. Observe queue lag / indexing freshness.
  3. Increase taskSlots only after validating CPU/memory headroom.
  4. Roll image or config updates via Helm so desired state is explicit and reproducible.

Capacity planning notes

Start with:

  • parallelism ~= taskManager.replicas * taskSlots
  • Resource requests sized for peak source pull and transform bursts.
  • Conservative memory limits to avoid OOM kills during large backfills.

Then iterate based on:

  • Connector source throughput.
  • Onyx indexing throughput and backpressure.
  • Sync freshness SLO targets (for example, median and p95 lag).

Troubleshooting

TaskManagers not joining JobManager

Check:

  • Service DNS/name resolution to clawql-mcp-http-flink-jobmanager.
  • Port mismatches (rpcPort, blobPort, queryPort).
  • Image compatibility between JobManager and TaskManager.

Connectors fail auth

Check:

  • Secret exists in the same namespace.
  • Secret name matches flink.connectorSecret.
  • Key names match what connector runtime expects.
  • Token scopes/roles on source APIs and Onyx write path.

Onyx index freshness does not improve

Check:

  • Flink job status and processing lag.
  • Connector-side retry/error rates.
  • Onyx API health and ingestion acceptance.
  • knowledge_search_onyx responses after known document updates.

Relationship to other ClawQL docs

Was this page helpful?