Flink in-cluster sync for Onyx
This guide covers the in-cluster Apache Flink topology shipped in the ClawQL Helm chart for real-time Onyx connector sync. It is the deployment path tracked in #119: deploy Flink next to ClawQL and Onyx, keep networking internal by default, and treat new data sources as connector-job configuration rather than ClawQL server code changes.
What this feature is (and is not)
- Is: optional Helm-managed Flink control plane + workers (
flink.enabled=true) in the same namespace as ClawQL. - Is: an execution surface for connector jobs that keep the Onyx index fresh with low sync latency.
- Is: config-as-code in
values.yamlfor image, ports, replicas, task slots, resources, and connector env wiring. - Is not: a ClawQL MCP tool; this does not change
search/executesemantics directly. - Is not: public-by-default; the service defaults to
ClusterIP, with no ingress generated by the chart.
Topology and traffic model
At a high level:
- Connector jobs run in Flink TaskManagers.
- Jobs pull from source systems (Slack, Confluence, Drive, Jira, GitHub, email, etc.).
- Jobs transform and forward records into Onyx indexing/search APIs.
- ClawQL’s optional
knowledge_search_onyxMCP tool reads from that continuously refreshed index.
Recommended deployment boundaries:
- Namespace: same namespace as ClawQL for simpler service discovery and shared release lifecycle.
- Service exposure: keep Flink internal (
ClusterIP) unless you have an explicit operator workflow requiring UI exposure. - Secrets: place connector credentials in a dedicated secret wired only to Flink (
flink.connectorSecret) so they are not inherited byclawql-mcp.
Helm values reference (flink.*)
The chart exposes these keys in charts/clawql-mcp/values.yaml:
| Key | Purpose |
|---|---|
flink.enabled | Enables all Flink resources in the release. |
flink.image.repository, flink.image.tag, flink.image.pullPolicy | Container image settings for JobManager and TaskManager pods. |
flink.service.type | Service type for JobManager endpoint (ClusterIP recommended for production default). |
flink.service.restPort | Flink REST/UI port (default 8081). |
flink.service.rpcPort | RPC port used by Flink internals (default 6123). |
flink.service.blobPort | Blob server port (default 6124). |
flink.service.queryPort | Query service port (default 6125). |
flink.jobManager.replicas | JobManager replica count (usually 1 unless running HA patterns externally). |
flink.jobManager.resources | CPU/memory requests and limits for JobManager. |
flink.taskManager.replicas | Number of TaskManager pods (horizontal worker scale). |
flink.taskManager.taskSlots | Slots per TaskManager process (parallelism capacity planning input). |
flink.taskManager.resources | CPU/memory requests and limits for TaskManagers. |
flink.connectorSecret | Existing Kubernetes Secret name loaded into Flink pods via envFrom. |
flink.extraEnv | Additional env entries for both JobManager and TaskManager containers. |
Quick start (minimal)
Create a dedicated connector secret:
kubectl -n clawql create secret generic onyx-connector-env \
--from-literal=ONYX_API_TOKEN='replace-me'
Install/upgrade with Flink enabled:
helm upgrade --install clawql ./charts/clawql-mcp \
--namespace clawql \
--create-namespace \
--set flink.enabled=true \
--set flink.connectorSecret=onyx-connector-env \
--wait
Verify resources:
kubectl -n clawql get deploy,svc | rg 'flink|clawql-mcp-http'
kubectl -n clawql get pods -l app.kubernetes.io/component=flink-jobmanager
kubectl -n clawql get pods -l app.kubernetes.io/component=flink-taskmanager
Production-oriented values example
Use a values override file (recommended over long --set chains):
flink:
enabled: true
image:
repository: flink
tag: '1.19.1-scala_2.12-java17'
pullPolicy: IfNotPresent
service:
type: ClusterIP
restPort: 8081
rpcPort: 6123
blobPort: 6124
queryPort: 6125
jobManager:
replicas: 1
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: '2'
memory: 2Gi
taskManager:
replicas: 4
taskSlots: 4
resources:
requests:
cpu: '1'
memory: 2Gi
limits:
cpu: '4'
memory: 4Gi
connectorSecret: onyx-connector-env
extraEnv:
- name: ONYX_BASE_URL
value: 'http://onyx-api.onyx.svc.cluster.local:8080/api'
Apply:
helm upgrade --install clawql ./charts/clawql-mcp \
-n clawql \
-f values.prod.yaml \
--wait
Security and isolation guide
1) Keep Flink private by default
- Default
flink.service.type=ClusterIPmeans no cloud public IP allocation. - Do not add an ingress for Flink UI unless you also add authn/authz and network restrictions.
- For short-lived diagnostics, prefer
kubectl port-forwardover permanent exposure.
Temporary local access example:
kubectl -n clawql port-forward svc/clawql-mcp-http-flink-jobmanager 18081:8081
2) Scope credentials to Flink only
- Put connector/API tokens into one secret dedicated to connector runtime.
- Reference it with
flink.connectorSecret. - Avoid storing connector credentials in
extraEnvat the chart root, because those env vars targetclawql-mcp.
3) Separate ClawQL and connector permissions
- Grant
clawql-mcponly the permissions needed for MCP-facing API operations. - Grant Flink connector jobs only the permissions needed to read source systems and write to Onyx.
- Rotate connector credentials independently from ClawQL application credentials.
Operations runbook
Health checks
kubectl -n clawql get deploy clawql-mcp-http-flink-jobmanager
kubectl -n clawql get deploy clawql-mcp-http-flink-taskmanager
kubectl -n clawql describe deploy clawql-mcp-http-flink-jobmanager
kubectl -n clawql describe deploy clawql-mcp-http-flink-taskmanager
Logs
kubectl -n clawql logs deploy/clawql-mcp-http-flink-jobmanager --tail=200
kubectl -n clawql logs deploy/clawql-mcp-http-flink-taskmanager --tail=200
Scale workers
kubectl -n clawql scale deploy/clawql-mcp-http-flink-taskmanager --replicas=6
For durable config-as-code scaling, update flink.taskManager.replicas and re-run helm upgrade.
Controlled rollouts
Use staged changes:
- Increase TaskManager replicas first.
- Observe queue lag / indexing freshness.
- Increase
taskSlotsonly after validating CPU/memory headroom. - Roll image or config updates via Helm so desired state is explicit and reproducible.
Capacity planning notes
Start with:
parallelism ~= taskManager.replicas * taskSlots- Resource requests sized for peak source pull and transform bursts.
- Conservative memory limits to avoid OOM kills during large backfills.
Then iterate based on:
- Connector source throughput.
- Onyx indexing throughput and backpressure.
- Sync freshness SLO targets (for example, median and p95 lag).
Troubleshooting
TaskManagers not joining JobManager
Check:
- Service DNS/name resolution to
clawql-mcp-http-flink-jobmanager. - Port mismatches (
rpcPort,blobPort,queryPort). - Image compatibility between JobManager and TaskManager.
Connectors fail auth
Check:
- Secret exists in the same namespace.
- Secret name matches
flink.connectorSecret. - Key names match what connector runtime expects.
- Token scopes/roles on source APIs and Onyx write path.
Onyx index freshness does not improve
Check:
- Flink job status and processing lag.
- Connector-side retry/error rates.
- Onyx API health and ingestion acceptance.
knowledge_search_onyxresponses after known document updates.
Relationship to other ClawQL docs
- Helm install and values: Helm
- Kubernetes deployment baseline: Kubernetes
- Onyx MCP retrieval tool: Onyx knowledge search
- Tool inventory and feature flags: Tools
- Canonical repo docs:
docs/helm.md,docs/deploy-k8s.md, issue #119
