Security Monitoring and Observability Architecture: Falco, Wazuh, SIEM Integration, and Telemetry Design

Falco, Wazuh, SIEM Integration, and Telemetry Design

Hello and welcome to Module 19!

Modules 1–18 have given us trusted images, admission control, vetted skills, zero-trust networking, a hardened gateway, egress controls, scoped identities, dynamic secrets, per-request authentication, agent lifecycle, sandboxing, runtime enforcement, input validation, multi-agent trust, data classification, model weights, GPU isolation, and immutable memory. Now we make sure we can actually see what is happening — in real time, with high signal and low noise.

Security observability is not just “monitoring.” Poorly designed telemetry creates three failure modes that are as dangerous as missing controls: alert fatigue, coverage gaps, and telemetry itself becoming an exfiltration channel. In this module we design a complete observability stack that turns every component into a reliable security sensor. By the end you will have a canonical event schema, smart correlation rules, cardinality-safe metrics, tail-based tracing, and a NOC dashboard that tells you exactly what matters.

The Three Failure Modes of Security Observability

Most teams deploy tools and hope for the best. That produces noise instead of signal. The three classic failures are:

Alert fatigue: Too many low-quality alerts train operators to ignore everything — including the real attack.
Coverage gaps: The right events are never emitted, or the context needed for investigation is missing.
Telemetry as exfiltration channel: Sensitive data leaks through metric labels or logs that have weaker access controls than the WORM audit trail.

We solve all three by design, not by accident.

Canonical Security Event Schema

Every security-relevant event across the entire platform conforms to one structured schema. This is the prerequisite for reliable SIEM correlation.

Required fields:

schemaVersion
eventId (unique UUID)
timestamp (UTC, nanosecond precision)
source (component + version + cluster + region)
principal (agent ID, session ID, tenant ID)
event (type, subtype, outcome, severity)
detail (tool, rule ID, claims, etc.)
traceContext (W3C trace + span IDs)
payloadHash — never the payload itself, always its SHA-256 hash (correlation key without data exposure)

Seven event types (used for routing and correlation):

AUTH
TOOL_CALL
MEMORY
SKILL
AGENT
NETWORK
POLICY

Schema version is pinned in SIEM correlation rules. Any schema drift is detected and blocked before deployment.

Falco Rules for Agentic Platforms

Falco provides runtime syscall monitoring inside every container and Kata VM.

Key rules we ship out of the box:

Gateway process binding to 0.0.0.0 (violates Module 5)
exec from a non-approved binary inside an agent container
Unexpected outbound connection from an agent pod (bypassing egress allowlists)
Sensitive file read outside declared paths
Any process other than approved model-serving binaries accessing /dev/nvidia\* (Module 17)

Falco output is shipped via Fluent Bit to both the WORM audit trail and the SIEM in parallel.

Wazuh SIEM Correlation

Wazuh correlates events across the entire platform. Three critical rules every deployment must have:

ATR violation followed by an egress block in the same session within 60 seconds → injection + exfiltration attempt.
Agent compromise event followed by a cross-pipeline memory read → lateral movement.
5 authentication failures from the same source in 5 minutes followed by a successful auth → credential stuffing.

All correlation rules are pinned to a specific schema version. Any schema upgrade requires the rules to be updated first.

Metric Cardinality and Sensitive Data in Labels

Prometheus is excellent for operational metrics but dangerous for security data.

Rules we enforce:

Never use high-cardinality or sensitive labels: sessionId, agentId, userId, tenantId, memoryEntryId.
Allowed labels only: component, decision, ruleCategory, deploymentTier, region, atrRole.

Per-session and per-agent security metrics are emitted as log events to the WORM pipeline (restricted access) instead of Prometheus metrics. Prometheus scrape endpoints are restricted to the monitoring namespace via NetworkPolicy.

Distributed Tracing for Security Call Graphs

We use W3C traceparent headers propagated across every inter-component call (gateway ↔ Panguard ↔ tool handler ↔ NATS ↔ subagent, etc.).

Security-relevant spans include attributes:

Event type
Tool name
ATR claims presented
Panguard decision
Rule ID

Tail-based sampling policy:

100 % retention for any trace containing a Panguard block, WORM write, memory integrity failure, or AUTH denial.
Security traces (with ATR claims and session IDs) are accessible only to the security team role.

NOC Dashboard: Three Views

We give operators exactly the information they need at a glance.

Current state (5-second refresh):

Panguard block rate vs 7-day baseline
Active Falco alerts by severity
Vault revocations in the last 60 minutes
Memory integrity check failures (must be zero)
HITL queue depth

Trend (hourly):

7-day rolling ATR violation rate per rule category
Egress anomaly rate per tenant
Skill quarantine events this week

Investigation (on-demand):

Session timeline linking SIEM events → WORM audit entries → distributed traces for any sessionId

Alert Tuning Methodology

We treat alert quality as a measurable engineering metric:

Every new rule starts in observation mode for 14 days (logs only).
Monthly precision measurement: classify each fired alert as true positive, false positive, or indeterminate.
Target >80 % precision before a rule is moved to paging.
Tiered routing: CRITICAL → PagerDuty (15-minute ACK deadline); HIGH → Slack + email; WARNING → Slack; INFO → WORM only.

Never route INFO or WARNING to the same channel as CRITICAL.

Key Takeaways (Memorize These!)

A canonical event schema is the prerequisite for SIEM correlation — schema drift silently breaks detection rules.
Prometheus label cardinality is a security issue, not just an operational one — high-cardinality sensitive labels create an unintended data exposure path.
Tail-based sampling for security traces is mandatory — a security event that is sampled out cannot be investigated.
Alert fatigue is a security failure with the same practical outcome as having no alerts — precision measurement and tiered routing are not optional.

You now have observability that is designed for agentic platforms: high-signal, low-noise, tamper-evident, and cardinality-safe. When something goes wrong, the right people know within minutes, with full forensic context already available in the WORM trail. This turns detection from a hope into a reliable, measurable capability.