Why training LLMs with endpoint data will strengthen cybersecurity

DTN

2 years ago

Join leaders in San Francisco on January 10 for an exclusive night of networking, insights, and conversation. Request an invite here. Capturing weak signals across endpoints and predicting potential intrusion attempt patterns is a perfect challenge for Large Language Models (LLMs) to take on.

The goal is to mine attack data to find new threat patterns and correlations while fine-tuning LLMs and models. Leading endpoint detection and response (EDR) and extended detection and response (XDR) vendors are taking on the challenge. Nikesh Arora, Palo Alto Networks chairman and CEO, said , “We collect the most amount of endpoint data in the industry from our XDR.

We collect almost 200 megabytes per endpoint, which is, in many cases, 10 to 20 times more than most of the industry participants. Why do you do that? Because we take that raw data and cross-correlate or enhance most of our firewalls, we apply attack surface management with applied automation using XDR. ” CrowdStrike co-founder and CEO George Kurtz told the keynote audience at the company’s annual Fal.

Con event last year, “One of the areas that we’ve really pioneered is that we can take weak signals from across different endpoints. And we can link these together to find novel detections. We’re now extending that to our third-party partners so that we can look at other weak signals across not only endpoints but across domains and come up with a novel detection.

” XDR has proven successful in delivering less noise and better signals . Leading XDR platform providers include Broadcom, Cisco, CrowdStrike, Fortinet, Microsoft, Palo Alto Networks, SentinelOne, Sophos, TEHTRIS, Trend Micro and VMWare. The AI Impact Tour Getting to an AI Governance Blueprint – Request an invite for the Jan 10 event.

Enhancing LLMs with telemetry and human-annotated data defines the future of endpoint security. In Gartner’s latest Hype Cycle for Endpoint Security, the authors write , “Endpoint security innovations focus on faster, automated detection and prevention, and remediation of threats, powering integrated, extended detection and response (XDR) to correlate data points and telemetry from endpoint, network, web, email and identity solutions. ” Spending on EDR and XDR is growing faster than the broader information security and risk management market.

That’s creating higher levels of competitive intensity across EDR and XDR vendors. Gartner predicts the endpoint protection platform market will grow from $14. 45 billion today to $26.

95 billion in 2027, achieving a compound annual growth rate (CAGR) of 16. 8%. The worldwide information security and risk management market is predicted to grow from $164 billion in 2022 to $287 billion in 2027, achieving an 11% CAGR.

VentureBeat recently sat down (virtually) with Elia Zaitsev , CTO of CrowdStrike to understand why training LLMs with endpoint data will strengthen cybersecurity. His insights also reflect how quickly LLMs are becoming the new DNA of endpoint security. Elia Zaitsev: “ So when the company was started, one of the reasons why it was created as a cloud-native company is that we wanted to use AI and ML technologies to solve tough customer problems.

Because if you think about the legacy technologies, everything was happening at the edge, right? You were making all the decisions and all the data lived at the edge, but there was this idea we had that if you wanted to use AI technology, you needed to have, especially for those older ML type solutions, which are still by the way, very effective. You need that quantity of information and you can only get that with a cloud technology where you can bring in all the information. We could train these heavy-duty classifiers into the cloud and then we can deploy them at the edge.

So train in the cloud, deploy to the edge, and make smart decisions. The funny thing though, is that’s occurring now that generative AI is coming into the fore and they’re different technologies. Those are less about deciding what’s good and what’s bad and more about empowering human beings like taking a workflow and accelerating it.

” Zaitsev: “It’s not about replacing human beings, it’s about augmenting humans. It’s that AI-assisted human, which I think is such a key concept, and I think too many people in technology, and I’ll say this as a CTO, I’m supposed to be all about the technology the focus sometimes goes too far on wanting to replace the humans. I think that’s very misguided, especially in cyber.

But when you think about the way the underlying technology works, gen AI, it’s actually not necessarily about quantity. Quality becomes much more important. You need a lot of data to create these models to begin with, but then when it comes time to actually teach it to do something specific, and this is key when you want to go from that general model that can speak English or whatever language, and you want to do what’s called fine-tuning when you want to teach it, how to do something like summarize an incident for a security analyst or operate a platform, these are the kinds of things that our generative product Charlotte AI is doing.

” Zaitsev: “Most of these automation technologies, whether it’s LLMs or something like that, they don’t tend to replace humans really. They tend to automate the rote basic tasks and allow the expert humans to take their valuable time and focus on something harder. Usually, people start asking, what about the adversaries using AI? And to me it’s a pretty simple conversation.

In a typical arms race, the adversaries are going to use AI and other technologies to automate some baseline level of threats. Great. You use AI to counteract that.

So you balance that out and then what do you have left? You’ve still got a really savvy, smart human attacker rising above the noise, and that’s why you’re still going to need a really smart, savvy defender. ” Zaitsev: “When we build LLMs, it’s actually easier to train many small LLMs on these specific use cases. So take that Overwatch dataset that Falcon completed, that [threat] intel dataset.

It’s actually easier and less prone to hallucination to take a small purpose-built large language model or maybe call it a small language model if you will. You can actually tune them and get higher accuracy and less hallucinations if you’re working on a smaller purpose-built one than trying to take these big monolithic ones and make them like a jack of all trades. So what we use is a concept called a mixture of experts.

You actually in many cases get better efficacy with these LLM technologies when you’ve got specialization, right? A couple of really purpose-built LLMs working together versus trying to get one super smart one that actually doesn’t do anything particularly well. It does a lot of things poorly versus any one thing particularly well. We also apply validation.

We’ll let the LLMs do some things, but then we’ll also check the output. We’ll use it to operate the platform. We’re ultimately basing the responses on our telemetry on our platform API so that there’s some trust in the underlying data.

It’s not just coming out of the ether, out of the LLMs brain, so to speak, right? It’s rooted in a foundation of truth. Zaitsev: When you start to do those types of use cases, you don’t need millions and billions and trillions of examples. What you need is actually in many cases, a couple of thousand, maybe tens of thousands of examples, but needed to be very high quality and ideally what we call human-annotated data sets.

You basically want an expert to say to the AI systems, this is how I would do it, learn from my example. So I won’t take credit and say we knew that the generative AI boom was going to happen 11, 12 years ago, but because we were always passionate believers in this idea of AI assisting humans not replacing humans, we set up all these expert human teams from day one. So as it turns out, because we’ve in many ways uniquely been investing in our human capacity and building up this high-quality human annotated platform data, we now all of a sudden have this goldmine, right, this treasure trove of exactly the right kind of information you need to create these generative AI large language models, specifically fine-tuned to cybersecurity use cases on our platform.

So a little bit of good luck there. Zaitsev: Our approach, I’ll use the old adage when all you have is a hammer, everything looks like a nail, right? And this is not true just for AI technology. It is the way we approach data storage layers.

We’ve always been a fan of this concept of using all the technologies because when you don’t constrain yourself to use one thing, you don’t have to. So Charlotte is a multi-modal system. It uses multiple LLMs, but it also uses non-LLM technology.

LLMs are good at instruction following. They’re going to take a natural language interfaces and convert them into structured tasks. Zaitsev: The output that the user sees from Charlotte is almost always based off of some platform data.

For example, vulnerability information from our Spotlight product. We may take that data and then tell Charlotte to summarize it for a layperson. Again, things that LLMs are good at, and we may train it off of our internal data.

That’s not customer-specific, by the way. It’s general information about vulnerabilities, and that’s how we deal with the privacy aspects. The customer-specific data is not training into Charlotte, it’s the general knowledge of vulnerabilities.

The customer-specific data is powered by the platform. So that’s how we keep that separation of church and state, so to speak. The private data is on the Falcon platform.

The LLMs get trained on and hold general cybersecurity knowledge, and in any case, make sure you’re never exposing that naked LLM to the end user so that we can apply the validation. VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

From: venturebeat
URL: https://venturebeat.com/security/why-training-llms-with-endpoint-data-will-strengthen-cybersecurity/