Splunk β Security Lakehouse: Migration Strategy
Author: RedEye Security | Date: 2026-04-06 | Status: Draft v1.0
Executive Summary
Replace a high-volume Splunk enterprise deployment (~1B events/bucket, 100+ sourcetypes, multi-region) with a modern AWS-native security data lakehouse. The migration is non-disruptive - Splunk remains operational throughout, with teams migrating workloads at their own pace. Legacy users can keep Splunk as a query frontend via DB Connect while new consumers use Grafana, direct API, or AI-driven interfaces.
Guiding Principles
- Zero forced cutover - Splunk stays live until each team voluntarily migrates
- Parallel ingest - data flows to both Splunk and the lakehouse simultaneously during transition
- Coexistence tools - Splunk DB Connect lets Splunk query lakehouse data (users see no change)
- AD-native access control - Grafana, API, and AI layer all authenticate via Active Directory (LDAP/SAML)
- Infrastructure as Code - every component managed via GitLab + Ansible + Jenkins pipelines
- Self-contained PKI - internal CA chains up to the Investcloud Certificate Authority
Current State (from Splunk data)
| Dimension | Detail |
|---|---|
| Peak ingest | ~1B events/time bucket |
| Sourcetypes | 100+ (Windows, Linux, Kubernetes, AWS, network, financial apps) |
| Indexes | 200+ (ic_* pattern, multi-region:
use1/euw1/aps2/den/tpa) |
| Environments | prd, pre, uat per region |
| Custom apps | FIXTDR, MWM, TFLOW, RECON, NDM (trading platform) |
| Cloud sources | AWS (billing, CloudWatch, S3, metadata), Apigee, O365 |
| K8s | 10+ container sourcetypes (EKS clusters, multiple regions) |
Estimated daily ingest: 500GBβ2TB/day compressed
Target State
All log sources
β
ββββΊ Splunk (existing - stays live, optional long-term)
β βββ DB Connect βββΊ Lakehouse (query passthrough for Splunk users)
β
ββββΊ Vector (new parallel ingest)
β
βΌ
AWS S3 (raw + OCSF normalized)
β
βΌ
Apache Iceberg (table format, partitioned)
β
ββββββ΄βββββ
β β
βΌ βΌ
Athena Grafana AI Query API
(SQL) (dashboards) (NL β SQL)
β β β
ββββββ¬βββββ β
β β
βΌ βΌ
AD/LDAP auth Jira/Zendesk
(per team) β auto dashboards
Migration Phases
Phase 0 - Foundation (Weeks 1β4)
Goal: AWS infrastructure live, parallel ingest flowing, zero Splunk disruption
Success criteria: Data flowing to lakehouse, Splunk untouched, Grafana showing live data
Phase 1 - Sourcetype Migration (Weeks 5β12)
Goal: High-volume, low-complexity sourcetypes migrated first
Priority order (easiest β highest volume wins):
| Priority | Sourcetype group | Reason |
|---|---|---|
| 1 | aws:*, cloudflare:json |
Native JSON, easy OCSF mapping |
| 2 | kube:container:* |
Already structured, high volume |
| 3 | linux_audit, linux_secure,
nix_logs |
Standard formats |
| 4 | XmlWinEventLog*, WinHostMon |
High value, OCSF auth events |
| 5 | cisco:asa, F5, nginx,
syslog |
Network - OCSF network activity |
| 6 | kafka, iis, tomcat |
App logs |
| 7 | FIXTDR, MWM, TFLOW,
RECON, NDM |
Custom - needs schema work |
Each sourcetype migration: 1. Vector transform written + tested (OCSF mapping) 2. Grafana dashboard created (mirrors existing Splunk equivalent) 3. Team notified: βYour data is in the lakehouse, Grafana dashboard readyβ 4. Team optionally keeps Splunk access - no forced change
Phase 2 - Team Onboarding (Weeks 8β16, overlaps Phase 1)
Goal: Teams adopt Grafana/API; AD groups control access
- AD group mapping:
sg-splunk-<team>β Grafana org/team - Self-service: teams request new dashboards via Jira ticket β AI auto-generates
- API access: teams get service account tokens (scoped to their indexes)
- Power users: direct Athena/SQL access via AD group
- Splunk coexistence: DB Connect lets Splunk search lakehouse tables
Phase 3 - AI Layer (Weeks 12β20)
Goal: Natural language queries, auto-generated dashboards from tickets
- Jira/Zendesk webhook β AI pipeline β Grafana dashboard PR β auto-deploy
- NL query API: βShow me auth failures from EKS last 24hβ β Athena SQL β result
- Anomaly detection: scheduled AI jobs flag unusual patterns, create Jira tickets
- Alert authoring: describe alert in English β generates Grafana alert rule
Phase 4 - Decommission Planning (Weeks 20β52)
Goal: Identify what (if anything) still needs Splunk
- Monthly report: which teams/searches still hitting Splunk only
- Migrate remaining workloads or document why Splunk stays
- Decision point: keep Splunk for specific use cases or full cutover
Coexistence: Splunk + Lakehouse
Key tools that make parallel operation painless:
Splunk DB Connect
- Connects Splunk to Athena via JDBC
- Splunk users run
| dbxqueryto query lakehouse tables - Zero workflow change for Splunk-dependent teams
- Use case: unified search across Splunk index + Iceberg table
Dual-ingest Vector
- Same log source β Vector β two sinks (Splunk HEC + S3/lakehouse)
- Splunk continues receiving data uninterrupted
- Lakehouse builds historical depth in parallel
AD-controlled access tiers
| User type | Access method | Auth |
|---|---|---|
| Splunk power user | Splunk UI (no change) | Splunk AD integration |
| Dashboard consumer | Grafana | SAML β AD |
| Developer/analyst | Athena SQL, API | IAM role β AD federation |
| AI pipeline | Service account | Scoped IAM role |
Risk Register
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Splunk licensing cost during parallel run | High | Medium | Negotiate overlap window with Splunk; reduce forwarder count as sourcetypes migrate |
| OCSF mapping errors in custom sourcetypes | Medium | High | Raw data always preserved in S3; can re-process |
| Team resistance to UI change | High | Low | Splunk UI stays available; Grafana is additive |
| AWS cost overrun | Medium | High | Cost alerts at 80% of monthly budget; right-size after 30 days of data |
| PKI chain to Investcloud CA | Medium | High | Early engagement with PKI team; get subordinate CA cert signed in Phase 0 |
Key Decisions Needed
- AWS account structure - single account or landing zone with separate log-archive + security accounts?
- Splunk licensing model - term license or consumption? Affects decommission timing
- Investcloud CA - who owns the subordinate CA cert request? Timeline to get it signed?
- GitLab vs GitHub - using existing GitLab instance or new?
- Athena vs Trino - Athena is easiest (serverless, no infra), Trino gives more control
- Retention - hot (Iceberg/Athena queryable): 90 days? Cold (S3 Glacier): 7 years?