Security Data Engineering
From security data deluge to agent-first security: How LLMs+agents and new architectures are reinventing security observability
Intro:
After 18 months since ChatGPT’s release and OpenAI’s APIs reaching enterprise use cases, we now have a good sense of what LLMs are best suited for: searching, retrieving and parsing enormous data sets to automate repetitive tasks, leveraging domain-specific knowledge. To find where LLMs will have enterprise use cases, look for the largest data sets (“data gravity”): customer support, domain knowledge, logs and events, customer and operational data, sales and marketing data, and IT infrastructure data are some of the largest.
In response to the rise of foundation models and the vast training data sets powering them, enterprises and technology vendors are increasingly investing in their own observability and telemetry data to build defensibility. By collecting unique, deeply integrated product usage data, they create proprietary data assets that are difficult to replicate by foundation models. As data becomes more commoditized, this truly owned data is becoming a vital competitive advantage. With the rise of data sources and cloud applications, security as a field has become more like data engineering.
Logs and Events are typically the largest volume data set in the modern enterprise (often petabytes per day), from sources such as system and application logs, security events and network traffic. These logs are mission-critical for monitoring, diagnostics and compliance. These data sets get enormous over time and create major data engineering problems. Logs and events are also real-time sensitive. Overage costs can also be a nightmare for executives, who have limited predictability on data volumes and thus costs. It’s not uncommon for F500 enterprises to spend tens to hundreds of millions on observability data: OpenAI is paying Datadog in excess of $100M, with Coinbase paying $65M/yr in 2023 and Capital One paying >$50M. Thus, logs and events stand out as a prime use case to insert an LLM due to the sheer volume and frequency of data generation across all systems and applications.
The observability market is fragmented, with leading vendors like Datadog, Splunk, and Dynatrace each holding less than 20% of the market share and together boasting over $10B of revenue. Further, the number of Cribl customers sending data to multiple SIEM products increased 45% year-over-year as different data sources require different destinations and lock-in has reduced with new storage formats [Cribl Report]. Microsoft Sentinel has crossed $1B of ARR in 3 years since its launch. The total observability spend is in excess of $30B and is reasonably fragmented. A Gartner survey found the average F2000 has between 7-10 observability tools, each with its own query language and data model. The number of data sources is growing 32% year-over-year and over a third of Cribl customers are consuming data from ten or more sources. Modern distributed systems generate petabytes of telemetry data daily, in various heterogeneous formats such as logs, metrics, and traces – all across different tools.
While this works for storing the data, the intelligence layer is fragmented across data silos. The real challenge isn’t just in collecting or storing this data but in making sense of it quickly enough to drive real business value. Data dimensionality, the rise of open telemetry and open table formats in storage are key trends reshaping the market. Complex data challenges become far more addressable with AI and LLMs, which can read in heterogeneous data formats.
Basics on Data Schema for Security Observability:
Observability is the thorough process of collecting and analyzing data to understand system performance, made up for three pillars. Each of these three pillars has its own data types and format, historically requiring its own querying engine and specialized storage.
1. Logs
Nature: Semi-structured data capturing events within a system, extremely voluminous as detailed information on user actions, system errors, access logs
Key attributes:
Structured Data: Timestamps, severity level, service and instance information, tracing information, basic user and request info, basic error details, metadata.
Unstructured Data: Detailed event descriptions, detailed user and request info, detailed error messages.
Use cases: Debugging, auditing, tracking errors, server and applications capturing each request and response
2. Metrics
Nature: Structured, numerical data representing quantitative system state over time, less voluminous than logs as metrics are numerical data points at regular intervals representing system states
Use cases: Monitoring system performance and tracking resource usage like GPU/CPU usage, memory consumption, request rates, error rates
3. Traces
Nature: Semi-structured, moderate volume (much less than logs) as captures detailed information about the flow of requests through a system
Structured attributes: Trace ID, span ID, parent span ID, service name, operation name, timestamps, duration, status code, resource info
Unstructured attributes: annotations like custom messages or status updates, logs, tags/labs like error descriptions to provide additional context, metadata about the operation like user environment
Use cases: Capturing the flow of a user request through multiple microservices, including time spent in each service; to be used for performance optimization, identifying bottlenecks and understanding end-to-end request flows
Data Dimensionality:
In the realm of data processing and storage, particularly for log data, the dimensionality and complexity of the data significantly influence how it is managed. Hydrolix, for example, specializes in handling large, multi-dimensional transaction logs such as CDN logs, which capture entire user sessions, including detailed activities on platforms like Disney+. These logs are not only vast in size but also rich in context, making them critical for long-term storage and analysis without any data loss. Hydrolix’s approach is built to accommodate the high-dimensional nature of these logs, ensuring that every piece of data is retained and accessible for years, which is essential for compliance and in-depth analytics.
On the other hand, traditional application logs, which are smaller and less complex, are often generated by microservices and containerized environments. These logs are typically high in frequency but lower in individual value, leading to the development of solutions like Cribl, which intelligently filters and decimates less valuable logs, metrics, and traces. Cribl’s method involves identifying and discarding low-value logs while forwarding the more relevant data to platforms like Splunk for further analysis. This approach contrasts with Hydrolix’s, where the goal is to preserve the integrity of all data due to the higher dimensionality and value of transaction logs. Together, these strategies highlight the importance of tailoring data processing and storage techniques to the specific nature and dimensionality of the data being managed.
Where are LLMs useful in observability?
Thus, given logs have the highest data volumes and the most unstructured text-heavy data attributes, LLMs are the most natural fit here. LLMs should be highly effective at parsing and analyzing the text-heavy nature of logs with NLP. LLMs can do the following with logs:
Extract Info: Identify and extract key entities, error messages, and patterns from logs
Anomaly Detection: Detect unusual patterns or anomalies that indicate system issues.
Summarization: Generate summaries of log data to highlight critical events and trends.
Log Categorization: Classify and tag log entries to streamline searching and analysis.
Metrics have limited usefulness with LLMs as they are typically structured and numerical. Expect incumbents to integrate the few useful advances from LLMs like natural language summaries of metrics trends or anomalies, or predictive analysis based on trends in data.
Traces have medium usefulness, with some excellent and defensible use cases given the structured identifiers mixed with semi-structured annotations at moderate data volume. For traces, LLMs can better perform:
Root Cause Analysis: Identify patterns in trace data that may indicate performance bottlenecks or errors.
Trace Analysis: Understand and summarize the flow of requests through various services.
Correlation and Context: Provide contextual insights by correlating trace data with logs and metrics to offer a comprehensive view of system behavior.
Predictive Maintenance: By analyzing logs and patterns, LLMs can predict system failures or performance issues before they occur.
Surprisingly upon digging in, many of the state of the art techniques for data engineering in security are highly rules-based and interpreted from the data schema of event data from customers, rather than being based on a trained AI model. The rules layer originally started with rules for cost savings, such as which data to route to cold storage and cheaper systems than Splunk, but opportunities also exist for processing with a focus on escalation. Cribl has built a $200M ARR business growing 70% and valued at $3.5B in applying these principles of data engineering (traditionally found in ETL for warehouses) to security data systems. This is their secret sauce, which is an alternative to custom configuring Kafka rules.
Security Data Engineering (Today):
This diagram outlines the data flow from initial sources like Network Data, Identity Data, Cloud APM Data, CDN Data, and Infrastructure Security Data through to data processing, streaming, and finally to storage in SIEM systems or Data Lakes. Data begins at these source points, where it is generated in huge volumes; as mentioned, the largest data sets in most enterprises which creates lots of challenges. Cribl’s report shares insights on the most popular data sources it sees, with most of its enterprise customers using more than 10 different data sources. Splunk is most popular across the board, while other tools like S3 are popular with fast growing companies and O365/Windows Event Logs are most popular for Enterprise. You’ll notice the flow looks a lot like traditional data eng around Snowflake/DB.
Historically, data transformation has been limited as the sheer costs are prohibitive. Thus, tools in the “Preprocess, Filter, and Enrich” category (e.g., Cribl, Splunk DSP; Observo, Tarsal and Databahn as the new startups) are relatively new, with new techniques to clean, filter, and prepare this data, ensuring it is optimized for further use. These preprocessing companies are the most opportune for startups, as they can play neutral across vendors, have a clear cost savings ROI and also have a clear quality ROI. There is surprisingly limited use of AI in transformations here.
Once processed, data often moves into “Data Streaming” platforms (e.g., Kafka, Pulsar, Flink) that handle real-time data flow, critical for applications requiring immediate insights. Finally, the data is stored in “Traditional SIEM” systems like Splunk and Elastic or in “Data Lakes” such as Snowflake and Databricks, where it can be analyzed and retained long-term. The map also highlights emerging players in “Next-Gen Data Lake/SIEM” like Hydrolix and Runreveal, which are designed to manage modern data demands with advanced analytics and storage capabilities. This structured flow ensures that data is efficiently managed from generation to storage. Many of these next-gen SIEMs have similar product marketing on cost savings, just through different solutions like storage formats.
Much like the data sources, we see Splunk (majority of Cribl customers) and S3 topping the list as most popular used destinations. However, we see a growing fragmentation away from Splunk’s historical dominance on the destinations side with CrowdStrike’s Falcon SIEM, Azure Logs (via Sentinel) and Google SecOps each growing >250% in data volume across the Cribl user base. There was a 73% increase in companies using multiple SIEM products this year. Destination systems are growing in fragmentation, with 90%+ of Cribl customers sending to 2+ destinations and 12% sending to 4+ destinations, with overall destinations growing 15% YoY.
The Splunk Architecture: Forward, Indexer and Search Head
Splunk’s architecture is designed around three core components: the forwarder, the indexer, and the search head, each playing a critical role in data collection, processing, and querying. The forwarder is responsible for collecting data from various sources, such as sensors, APIs, and firewall appliances, and sending it to the indexer in real time; this is what Cribl has attacked head on. However, the forwarder is agnostic to the data it receives, meaning all data sent to the indexer counts against a user’s data allowance, regardless of its relevance or value.
The indexer ingests the data and builds an index to facilitate efficient querying. This process, however, comes with significant cost and performance challenges, especially in cloud environments. Tools like Cribl offer a solution by pre-processing data before it reaches the indexer, removing unnecessary fields and reducing storage costs. Cribl’s ability to send event data to cold storage can save companies up to 97% of their storage costs, significantly reducing the financial burden of managing large volumes of log data. Additionally, Snowflake and Databricks provide a compelling alternative for data storage and querying.. Its cloud-native architecture scales automatically and can offer querying speeds up to 200 times faster than traditional SIEM systems like Splunk, making it a powerful tool for organizations needing to analyze large datasets quickly. These advantages allow companies to streamline their data management processes, reduce costs, and improve the speed and accuracy of their security operations. Thus, this data engineering flow around the SIEM is being unbundled
Current Challenges:
In Security Data Engineering, buyers have focused on two main problems. One of the primary challenges is the ballooning cost associated with storing the vast and ever-growing volumes of security data generated by today’s distributed systems. This data, ranging from unstructured logs to events and telemetry to complex traces, often reaches petabyte scale, leading to soaring and unpredictable storage expenses. These costs are exacerbated by the lack of a unified data model, which forces companies to rely on specialized storage solutions and tools. This situation has resulted in fragmented systems and data silos, further complicating data management and driving up operational costs. When it comes to managing the dimensionality of data and how that affects the growing volume of security data, Cribl and Hydrolix take different approaches. Cribl focuses on decimating low-value, high-frequency application logs, intelligently filtering out less important data to optimize storage and reduce costs. In contrast, Hydrolix handles high-dimensional transaction logs, ensuring that all data is preserved without any loss, which is crucial for long-term analysis and compliance.
The second challenge is the difficulty in distinguishing critical security signals from the vast amounts of data noise. Retrieval and ranking of relevant information remain major challenges in observability, despite advances in data search capabilities. The ability to efficiently identify what and where to look within massive data volumes continues to be a bottleneck. The Mean Time to Resolution for critical incidents (MTTR) still averages 4-5 hours per research from Gartner. Traditional systems often fail to prioritize and filter effectively, making it hard to detect significant threats. The core issue isn’t just collecting or storing this data but making sense of it quickly enough to provide real business value—a problem that has often been treated as a big data issue rather than an intelligence one.
Further, modern observability tools encounter a variety of technical challenges, including handling heterogeneous data formats, and grappling with the absence of a unified data model. Iceberg and open table formats are making the storage problem less severe. With these tools, there becomes less need to duplicate data across different systems as the querying engine can read from multiple sources. For example, Cribl’s Search product can search across Splunk, cold storage, time series DBs and other security tooling because of the rise of open table formats. There may be an opportunity for companies like Lakeway to support these search use cases on observability data. Additionally, the diversity of query languages across different tools—such as Lucene for Elasticsearch, PromQL for Prometheus, and various SQL-like languages for tracing—adds another layer of complexity. This diversity makes it difficult for teams to effectively diagnose and resolve issues, as they must navigate multiple systems to get a complete picture of system health.
These challenges open up significant opportunities for companies like Cribl, Tarsal, Databahn and Observo, which are developing solutions to preprocess, filter, and enrich data, making it easier to prune useless data and to identify the most critical security signals – this attacks the core of Splunk’s forwarder technology. LLMs are particularly promising in this space because they offer a unified approach to data analysis. LLMs are well-suited to parsing and interpreting the unstructured, text-heavy data found in logs, and they can also incorporate context from system documentation, code repositories, and historical incident reports. This ability to adapt quickly to new data patterns makes LLMs a powerful tool for tackling the ongoing problem of concept drift in dynamic systems. Further, LLMs can be used as a policy/cost hygiene layer in addition to a semantic understanding layer. Depending on requirements, enterprises will have much more flexibility in addressing these challenges.
Traditionally, using AI for these tasks was seen as too costly, but recent advancements in open-source models are changing this. A new approach using a Llama-based log and event transformation model, hosted within a private environment, offers a scalable and cost-efficient way to enhance logs by identifying patterns, categorizing data, and prioritizing critical events.
This approach not only enhances threat detection capabilities but also optimizes resource allocation in secure environments, helping to resolve many of the fragmentation issues currently plaguing the observability market. However, it’s important to recognize that while LLMs are powerful, they come with their own set of challenges, particularly around real-time processing, which can be hindered by current latency and cost limitations. A promising future direction involves combining LLMs with graph databases, which could enable better automated root cause analysis and potentially reducing the Mean Time to Resolution (MTTR) by a significant margin.
Emerging Technologies: The Rise of Open Observability
OpenTelemetry: As data in observability becomes far more open, frameworks like OpenTelemetry are transforming the industry by standardizing the collection and transmission of telemetry data across various platforms. OpenTelemetry, an open-source observability framework, enables the seamless integration of different observability tools, allowing for the collection of data such as traces, metrics, and logs in a consistent format. This openness significantly reduces the risk of vendor lock-in and increases fragmentation of destinations, as organizations can now easily switch between back-end systems without the need to replace their existing data collection infrastructure. Read our piece “What is OpenTelemetry” to learn more.
Open Table Formats: Traditionally, observability vendors often locked customers into their ecosystems through proprietary data formats and collection agents, making it difficult and costly to switch providers. However, with OpenTelemetry, organizations gain the flexibility to select the best tools for their specific needs, promoting a more competitive and interoperable market. This shift is further reinforced by the adoption of Cribl Search, which complements OpenTelemetry by offering search capabilities across various destinations like S3 and other security data lakes, making it easier to manage and analyze observability data in a more open and vendor-agnostic environment. Read out piece “What is Apache Iceberg” to learn more"
Open-Source Models: Yet to see much application of small, domain-based models for preprocessing and routing events but expect this as mentioned above.
Startup Opportunites:
1) AI-First Security Data Engineering or “Security ETL”
Description: Platforms to enhance security data engineering by focusing on efficient data preprocessing, filtering, and enrichment. Cribl reduces vendor lock-in by allowing organizations to route and process observability data across multiple platforms, such as Splunk, Datadog, and Elastic, enabling them to avoid dependency on a single vendor and choose the best tools for their specific needs. This is a big market that is growing faster than traditional ETL like Fivetran, with Cribl at over $200M of ARR and growing 90%. Particularly Splunk’s forwarder is being unbundled.
Cribl began as a cost savings story for Splunk (where 75% of its data are synced to as destinations). Cribl promised to reduce Splunk bills by 30-40%.
Instead of routing all data to Splunk, Cribl emerged to route some to cold storage, hot storage, or other systems like a time-series DB. This can save 97% if cold storage versus Splunk. [Cribl Docs]
The emergence of open table formats has enabled Cribl to get into the search business, as Cribl search offers querying across various destinations. Because you can now query across various destinations, there’s less need to store everything in Splunk.
Their solutions aim to reduce storage costs and improve the relevance of security data retained for analysis. These tools are most commonly deployed on-premise (i.e. ⅚ of Cribl) due to cost efficiency.
2) AI-Based Tracing and Root-Cause Analysis, and Event Correlation
Observability tools are evolving beyond traditional metrics, logs, and traces by integrating code-level insights, which addresses significant gaps in root cause analysis. This shift allows for a more comprehensive understanding of system behavior by connecting performance data with the underlying code.
New AI models, including LLMs, are enhancing observability by analyzing code, metrics, and logs within context, which reduces dependency on domain expertise and improves troubleshooting efficiency.
Description: LLMs have the potential to revolutionize tracing by efficiently analyzing and correlating vast amounts of trace data, identifying patterns and anomalies, and providing deeper contextual insights. This capability can help teams more accurately and quickly pinpoint the root causes of issues. Moreover, LLMs can enhance accessibility by enabling natural language interfaces, allowing engineers to query systems using plain language. For example, asking a question like ‘Show me all HTTP 500 errors in the payment service correlated with high CPU usage in the last hour’ could yield immediate and accurate results, simplifying the debugging process and reducing reliance on specialized knowledge of query languages.
New tracing techniques can automatically identify and resolve issues in complex distributed systems. Building on the advancements of companies like BigPanda, Moogsoft and Epsagon (acquired by Cisco for $500M), which represent the latest generation of event correlation and advanced tracing solutions, new startups are now exploring opportunities to further enhance the management and troubleshooting of large-scale, event-driven architectures.
3) Affordable and/or Open-Source Telemetry
OpenTelemetry’s impact is extending beyond reducing vendor lock-in. The observability landscape is seeing increased fragmentation as companies utilize multiple tools simultaneously, such as Datadog, Prometheus, and Grafana, which are integrated through open standards like OpenTelemetry.
Description: The high costs of traditional observability and security logging tools create significant opportunities for open-source solutions like Signoz. By providing an affordable, open-source alternative for observability, Signoz can replicate the success that Grafana had with attaching itself to Prometheus (Grafana recently raised at $6B on $270M of ARR). Notably, at the first Prometheus conference just five months after Grafana’s release, 30% of attendees were already using Grafana. Signoz is attaching itself to OpenTelemetry, which is the 2nd most popular CNCF open-source repo after Kubernetes.
4) Next-Gen Orchestration:
Company: Maestro does not have a company yet
Description: Maestro (Github) by Netflix is developing next-generation orchestration tools that streamline the management of complex, distributed applications, enabling more efficient and resilient operations. Experts like Bharat at Airbnb have mentioned Airbnb moving off Airflow (of note, Airflow was developed at Airbnb and led to Astronomer) towards Maestro and the market in need of new solution for orchestration.
5) Next-Gen Storage Formats
Description: Nimble (Github) by Facebook is a new storage technology for columnar data (as an alternative to Iceberg, Hudi and Delta Lake). They claim significant cost savings particularly for ML and analytical workloads on event data. There is perhaps a gap in the market now with Iceberg / Tabular and Delta Lake owned by Databricks for something that is independent. This alone may not be big enough to build a platform company, but is a differentiated attribute that addresses the cost problem. Differentiated storage formats can be a big differentiator for innovation further up the stack, such as with Hydrolix and its cost efficiencies. This may be a differentiated entry point, overcoming the data gravity of incumbents.