Using ingest_external_knowledge
ingest_external_knowledge imports external knowledge into your Obsidian vault so it participates in the same memory.db sync, _INDEX_, and memory_recall story as memory_ingest. Today it supports bulk Markdown (documents[]) and an optional single-URL fetch (source: "url"). Automatic whole-repo and SaaS workspace pulls are roadmap work on top of the same pipeline — see Knowledge lake roadmap and what is next.
Canonical reference: external-ingest.md. Tool table: Tools · mcp-tools.md. Agent skill (dry-run first): clawql-external-ingest.
What the tool does today
| Mode | What it does |
|---|---|
documents[] | Write up to 50 vault-relative .md files in one call (~2 MiB UTF-8 per body). |
source: "url" + url | fetch() one HTTPS URL (or http://localhost / 127.0.0.1 for tests), normalize body to Markdown, write one note. Requires CLAWQL_EXTERNAL_INGEST_FETCH=1. |
| No payload | Returns stub: true roadmap JSON (roadmap[], relatedIssues) — useful to probe behavior without a vault. |
Not the same as web search: URL mode archives raw bytes from one URL into the vault (external-ingest.md compares this to search snippets).
Before you start
- Documents tools enabled —
ingest_external_knowledgeis registered with the document stack by default.CLAWQL_ENABLE_DOCUMENTS=0removes it (and related document vendors from the default merge). See Concepts and configuration. - External ingest flag — set
CLAWQL_EXTERNAL_INGEST=1(exactly1) for non-stub writes on bulk / URL paths. - Writable vault — set
CLAWQL_OBSIDIAN_VAULT_PATHto a real vault for imports; without it, no-payload calls still return roadmap JSON. - URL fetch — opt in with
CLAWQL_EXTERNAL_INGEST_FETCH=1when you needsource: "url". - Dry-run discipline —
dryRundefaultstruefordocuments[]; validate withdryRun: true, then setdryRun: falseto write (matches the clawql-external-ingest skill).
Bulk Markdown documents mode
Shape: documents: [{ "path": "Memory/imports/note.md", "markdown": "…" }, …]
- Paths are vault-relative, must end with
.md, no..(same rules asmemory_ingest). - Up to 50 files per call; invalid entries surface in
documentErrors; valid paths can still import. - Start with
dryRun: trueto validate paths and size before committing.
URL fetch mode
Shape: source: "url", url: https://… (or allowed localhost http), optional scope for the target .md path (default under Memory/external/).
- JSON responses are pretty-printed in a fenced
jsonblock; HTML is converted with node-html-markdown; other bodies go under ## Raw text. - Response size is capped (2 MiB); 60s timeout. Many public HTML pages return 403 interstitials (for example heavy bot protection) — use a stable API URL when you can (see limitations in external-ingest.md).
Roadmap preview calls
Call the tool without documents and without url: you get stub: true, roadmap[], and relatedIssues — no vault required. With a vault and memory.db, responses may also include optional merkleSnapshot / cuckooMembershipReady when the sidecar is warm. Use this to confirm the tool is registered and to read the issue-linked roadmap pointers before enabling real ingest.
Pipeline memory sync and recall
After a successful write (not dry-run): vault write lock → syncMemoryDbForVaultScanRoot → updateProviderIndexPage — the same path as memory_ingest (external-ingest.md).
Then use memory_recall (and graph depth / hybrid options as needed) to query what you imported. For handoff patterns between chats, see Vault memory between chats.
Security and limits
- No secrets in chat — keep tokens in env / your secret manager; the ingest module avoids logging full URLs when tool logging is enabled.
- Fetch is off by default — only
CLAWQL_EXTERNAL_INGEST_FETCH=1enables networkfetch. - Tenant safety — one MCP process, one vault root; separate deployments per customer for isolation (same theme as the roadmap).
Knowledge lake roadmap and what is next
The checked-in knowledge-lake-roadmap.md is the product direction for turning ingest_external_knowledge into first-class connectors that normalize SaaS and repo data into Markdown notes + memory.db (and optional vectors), with incremental sync and stable frontmatter ids.
Planned sources (summary):
| Source | Direction |
|---|---|
| GitHub repositories | First priority — phased: G1 default-branch code & docs (README*, docs/**, *.md, selected configs via Trees/Contents API); G2 issues & PRs; G3 richer surfaces (releases, wiki, discussions where APIs fit). API-first (optionally clone + API for issues) so metadata and incremental cursors stay tractable. |
| Notion | Pages, databases, blocks → Markdown-like notes; integration token + shared content. |
| Confluence | Spaces and pages → Markdown (HTML or storage format through existing HTML→MD paths); Atlassian auth. |
| Slack workspaces | Conversations, files, and export-friendly surfaces → Markdown under a predictable External/slack/… layout; least-privilege tokens; distinct from outbound Slack notify. |
| Linear / Jira | Issues as Markdown notes with JQL / GraphQL-driven scope (see roadmap for paths). |
Cross-cutting goals from the same doc: completeness (authorized content only), queryable recall (keywords + graph + optional CLAWQL_VECTOR_BACKEND), incremental sync (ETag / since / updated_at), and tenant-safe tokens.
Tracking: umbrella #40, hybrid memory epic #24. Next concrete step in the roadmap: source: "github" (or a dedicated ingest_github_repo) behind the external-ingest flag family, implementing G1 with dryRun and documented GITHUB_TOKEN.
Until those connectors ship, use today’s modes: bulk Markdown from your own extractors, or URL fetch for stable API endpoints — then memory_recall over the vault.
Related guides and references
- Using search and execute — reach provider APIs when ingest is not the right primitive.
- Onyx knowledge search — enterprise search over an existing Onyx corpus (complementary to vault file ingest).
- OpenClaw + ClawQL — bootstrap flows that mention
ingest_external_knowledge. - Repo recipes: memory-and-knowledge-recipes.md.
