⚡ InvestCloud Security Lakehouse

Migration Strategy

Parallel ingest, coexistence tools, sourcetype priority order, and risk register for a zero-disruption Splunk migration.

Splunk → Security Lakehouse: Migration Strategy

Author: RedEye Security | Date: 2026-04-06 | Status: Draft v1.0

Executive Summary

Replace a high-volume Splunk enterprise deployment (~1B events/bucket, 100+ sourcetypes, multi-region) with a modern AWS-native security data lakehouse. The migration is non-disruptive - Splunk remains operational throughout, with teams migrating workloads at their own pace. Legacy users can keep Splunk as a query frontend via DB Connect while new consumers use Grafana, direct API, or AI-driven interfaces.

Guiding Principles

Zero forced cutover - Splunk stays live until each team voluntarily migrates
Parallel ingest - data flows to both Splunk and the lakehouse simultaneously during transition
Coexistence tools - Splunk DB Connect lets Splunk query lakehouse data (users see no change)
AD-native access control - Grafana, API, and AI layer all authenticate via Active Directory (LDAP/SAML)
Infrastructure as Code - every component managed via GitLab + Ansible + Jenkins pipelines
Self-contained PKI - internal CA chains up to the Investcloud Certificate Authority

Current State (from Splunk data)

Dimension	Detail
Peak ingest	~1B events/time bucket
Sourcetypes	100+ (Windows, Linux, Kubernetes, AWS, network, financial apps)
Indexes	200+ (`ic_*` pattern, multi-region: use1/euw1/aps2/den/tpa)
Environments	prd, pre, uat per region
Custom apps	FIXTDR, MWM, TFLOW, RECON, NDM (trading platform)
Cloud sources	AWS (billing, CloudWatch, S3, metadata), Apigee, O365
K8s	10+ container sourcetypes (EKS clusters, multiple regions)

Estimated daily ingest: 500GB–2TB/day compressed

Target State

All log sources
      │
      ├──► Splunk (existing  -  stays live, optional long-term)
      │         └── DB Connect ──► Lakehouse (query passthrough for Splunk users)
      │
      └──► Vector (new parallel ingest)
                │
                ▼
          AWS S3 (raw + OCSF normalized)
                │
                ▼
          Apache Iceberg (table format, partitioned)
                │
           ┌────┴────┐
           │         │
           ▼         ▼
        Athena     Grafana     AI Query API
        (SQL)     (dashboards)  (NL → SQL)
           │         │              │
           └────┬────┘              │
                │                  │
                ▼                  ▼
           AD/LDAP auth      Jira/Zendesk
           (per team)        → auto dashboards

Migration Phases

Phase 0 - Foundation (Weeks 1–4)

Goal: AWS infrastructure live, parallel ingest flowing, zero Splunk disruption

Success criteria: Data flowing to lakehouse, Splunk untouched, Grafana showing live data

Phase 1 - Sourcetype Migration (Weeks 5–12)

Goal: High-volume, low-complexity sourcetypes migrated first

Priority order (easiest → highest volume wins):

Priority	Sourcetype group	Reason
1	`aws:*`, `cloudflare:json`	Native JSON, easy OCSF mapping
2	`kube:container:*`	Already structured, high volume
3	`linux_audit`, `linux_secure`, `nix_logs`	Standard formats
4	`XmlWinEventLog*`, `WinHostMon`	High value, OCSF auth events
5	`cisco:asa`, `F5`, `nginx`, `syslog`	Network - OCSF network activity
6	`kafka`, `iis`, `tomcat`	App logs
7	`FIXTDR`, `MWM`, `TFLOW`, `RECON`, `NDM`	Custom - needs schema work

Each sourcetype migration: 1. Vector transform written + tested (OCSF mapping) 2. Grafana dashboard created (mirrors existing Splunk equivalent) 3. Team notified: “Your data is in the lakehouse, Grafana dashboard ready” 4. Team optionally keeps Splunk access - no forced change

Phase 2 - Team Onboarding (Weeks 8–16, overlaps Phase 1)

Goal: Teams adopt Grafana/API; AD groups control access

AD group mapping: sg-splunk-<team> → Grafana org/team
Self-service: teams request new dashboards via Jira ticket → AI auto-generates
API access: teams get service account tokens (scoped to their indexes)
Power users: direct Athena/SQL access via AD group
Splunk coexistence: DB Connect lets Splunk search lakehouse tables

Phase 3 - AI Layer (Weeks 12–20)

Goal: Natural language queries, auto-generated dashboards from tickets

Jira/Zendesk webhook → AI pipeline → Grafana dashboard PR → auto-deploy
NL query API: “Show me auth failures from EKS last 24h” → Athena SQL → result
Anomaly detection: scheduled AI jobs flag unusual patterns, create Jira tickets
Alert authoring: describe alert in English → generates Grafana alert rule

Phase 4 - Decommission Planning (Weeks 20–52)

Goal: Identify what (if anything) still needs Splunk

Monthly report: which teams/searches still hitting Splunk only
Migrate remaining workloads or document why Splunk stays
Decision point: keep Splunk for specific use cases or full cutover

Coexistence: Splunk + Lakehouse

Key tools that make parallel operation painless:

Splunk DB Connect

Connects Splunk to Athena via JDBC
Splunk users run | dbxquery to query lakehouse tables
Zero workflow change for Splunk-dependent teams
Use case: unified search across Splunk index + Iceberg table

Dual-ingest Vector

Same log source → Vector → two sinks (Splunk HEC + S3/lakehouse)
Splunk continues receiving data uninterrupted
Lakehouse builds historical depth in parallel

AD-controlled access tiers

User type	Access method	Auth
Splunk power user	Splunk UI (no change)	Splunk AD integration
Dashboard consumer	Grafana	SAML → AD
Developer/analyst	Athena SQL, API	IAM role → AD federation
AI pipeline	Service account	Scoped IAM role

Risk Register

Risk	Likelihood	Impact	Mitigation
Splunk licensing cost during parallel run	High	Medium	Negotiate overlap window with Splunk; reduce forwarder count as sourcetypes migrate
OCSF mapping errors in custom sourcetypes	Medium	High	Raw data always preserved in S3; can re-process
Team resistance to UI change	High	Low	Splunk UI stays available; Grafana is additive
AWS cost overrun	Medium	High	Cost alerts at 80% of monthly budget; right-size after 30 days of data
PKI chain to Investcloud CA	Medium	High	Early engagement with PKI team; get subordinate CA cert signed in Phase 0

Key Decisions Needed

AWS account structure - single account or landing zone with separate log-archive + security accounts?
Splunk licensing model - term license or consumption? Affects decommission timing
Investcloud CA - who owns the subordinate CA cert request? Timeline to get it signed?
GitLab vs GitHub - using existing GitLab instance or new?
Athena vs Trino - Athena is easiest (serverless, no infra), Trino gives more control
Retention - hot (Iceberg/Athena queryable): 90 days? Cold (S3 Glacier): 7 years?