Etairos.
⚑ InvestCloud Security Lakehouse

Migration Strategy

Parallel ingest, coexistence tools, sourcetype priority order, and risk register for a zero-disruption Splunk migration.

Splunk β†’ Security Lakehouse: Migration Strategy

Author: RedEye Security | Date: 2026-04-06 | Status: Draft v1.0


Executive Summary

Replace a high-volume Splunk enterprise deployment (~1B events/bucket, 100+ sourcetypes, multi-region) with a modern AWS-native security data lakehouse. The migration is non-disruptive - Splunk remains operational throughout, with teams migrating workloads at their own pace. Legacy users can keep Splunk as a query frontend via DB Connect while new consumers use Grafana, direct API, or AI-driven interfaces.


Guiding Principles

  1. Zero forced cutover - Splunk stays live until each team voluntarily migrates
  2. Parallel ingest - data flows to both Splunk and the lakehouse simultaneously during transition
  3. Coexistence tools - Splunk DB Connect lets Splunk query lakehouse data (users see no change)
  4. AD-native access control - Grafana, API, and AI layer all authenticate via Active Directory (LDAP/SAML)
  5. Infrastructure as Code - every component managed via GitLab + Ansible + Jenkins pipelines
  6. Self-contained PKI - internal CA chains up to the Investcloud Certificate Authority

Current State (from Splunk data)

Dimension Detail
Peak ingest ~1B events/time bucket
Sourcetypes 100+ (Windows, Linux, Kubernetes, AWS, network, financial apps)
Indexes 200+ (ic_* pattern, multi-region: use1/euw1/aps2/den/tpa)
Environments prd, pre, uat per region
Custom apps FIXTDR, MWM, TFLOW, RECON, NDM (trading platform)
Cloud sources AWS (billing, CloudWatch, S3, metadata), Apigee, O365
K8s 10+ container sourcetypes (EKS clusters, multiple regions)

Estimated daily ingest: 500GB–2TB/day compressed


Target State

All log sources
      β”‚
      β”œβ”€β”€β–Ί Splunk (existing  -  stays live, optional long-term)
      β”‚         └── DB Connect ──► Lakehouse (query passthrough for Splunk users)
      β”‚
      └──► Vector (new parallel ingest)
                β”‚
                β–Ό
          AWS S3 (raw + OCSF normalized)
                β”‚
                β–Ό
          Apache Iceberg (table format, partitioned)
                β”‚
           β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
           β”‚         β”‚
           β–Ό         β–Ό
        Athena     Grafana     AI Query API
        (SQL)     (dashboards)  (NL β†’ SQL)
           β”‚         β”‚              β”‚
           β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜              β”‚
                β”‚                  β”‚
                β–Ό                  β–Ό
           AD/LDAP auth      Jira/Zendesk
           (per team)        β†’ auto dashboards

Migration Phases

Phase 0 - Foundation (Weeks 1–4)

Goal: AWS infrastructure live, parallel ingest flowing, zero Splunk disruption

Success criteria: Data flowing to lakehouse, Splunk untouched, Grafana showing live data


Phase 1 - Sourcetype Migration (Weeks 5–12)

Goal: High-volume, low-complexity sourcetypes migrated first

Priority order (easiest β†’ highest volume wins):

Priority Sourcetype group Reason
1 aws:*, cloudflare:json Native JSON, easy OCSF mapping
2 kube:container:* Already structured, high volume
3 linux_audit, linux_secure, nix_logs Standard formats
4 XmlWinEventLog*, WinHostMon High value, OCSF auth events
5 cisco:asa, F5, nginx, syslog Network - OCSF network activity
6 kafka, iis, tomcat App logs
7 FIXTDR, MWM, TFLOW, RECON, NDM Custom - needs schema work

Each sourcetype migration: 1. Vector transform written + tested (OCSF mapping) 2. Grafana dashboard created (mirrors existing Splunk equivalent) 3. Team notified: β€œYour data is in the lakehouse, Grafana dashboard ready” 4. Team optionally keeps Splunk access - no forced change


Phase 2 - Team Onboarding (Weeks 8–16, overlaps Phase 1)

Goal: Teams adopt Grafana/API; AD groups control access

  • AD group mapping: sg-splunk-<team> β†’ Grafana org/team
  • Self-service: teams request new dashboards via Jira ticket β†’ AI auto-generates
  • API access: teams get service account tokens (scoped to their indexes)
  • Power users: direct Athena/SQL access via AD group
  • Splunk coexistence: DB Connect lets Splunk search lakehouse tables

Phase 3 - AI Layer (Weeks 12–20)

Goal: Natural language queries, auto-generated dashboards from tickets

  • Jira/Zendesk webhook β†’ AI pipeline β†’ Grafana dashboard PR β†’ auto-deploy
  • NL query API: β€œShow me auth failures from EKS last 24h” β†’ Athena SQL β†’ result
  • Anomaly detection: scheduled AI jobs flag unusual patterns, create Jira tickets
  • Alert authoring: describe alert in English β†’ generates Grafana alert rule

Phase 4 - Decommission Planning (Weeks 20–52)

Goal: Identify what (if anything) still needs Splunk

  • Monthly report: which teams/searches still hitting Splunk only
  • Migrate remaining workloads or document why Splunk stays
  • Decision point: keep Splunk for specific use cases or full cutover

Coexistence: Splunk + Lakehouse

Key tools that make parallel operation painless:

Splunk DB Connect

  • Connects Splunk to Athena via JDBC
  • Splunk users run | dbxquery to query lakehouse tables
  • Zero workflow change for Splunk-dependent teams
  • Use case: unified search across Splunk index + Iceberg table

Dual-ingest Vector

  • Same log source β†’ Vector β†’ two sinks (Splunk HEC + S3/lakehouse)
  • Splunk continues receiving data uninterrupted
  • Lakehouse builds historical depth in parallel

AD-controlled access tiers

User type Access method Auth
Splunk power user Splunk UI (no change) Splunk AD integration
Dashboard consumer Grafana SAML β†’ AD
Developer/analyst Athena SQL, API IAM role β†’ AD federation
AI pipeline Service account Scoped IAM role

Risk Register

Risk Likelihood Impact Mitigation
Splunk licensing cost during parallel run High Medium Negotiate overlap window with Splunk; reduce forwarder count as sourcetypes migrate
OCSF mapping errors in custom sourcetypes Medium High Raw data always preserved in S3; can re-process
Team resistance to UI change High Low Splunk UI stays available; Grafana is additive
AWS cost overrun Medium High Cost alerts at 80% of monthly budget; right-size after 30 days of data
PKI chain to Investcloud CA Medium High Early engagement with PKI team; get subordinate CA cert signed in Phase 0

Key Decisions Needed

  1. AWS account structure - single account or landing zone with separate log-archive + security accounts?
  2. Splunk licensing model - term license or consumption? Affects decommission timing
  3. Investcloud CA - who owns the subordinate CA cert request? Timeline to get it signed?
  4. GitLab vs GitHub - using existing GitLab instance or new?
  5. Athena vs Trino - Athena is easiest (serverless, no infra), Trino gives more control
  6. Retention - hot (Iceberg/Athena queryable): 90 days? Cold (S3 Glacier): 7 years?