Protect Data Security When Deploying LLMs with the Dig Platform

table of contents

Large language models (LLMs) and generative AI are undoubtedly the biggest tech story of 2023. These technologies, which power OpenAI’s headline-dominating ChatGPT, are now broadly available to be used as building blocks for new software. Many product teams are giddy with excitement; at the same time, security professionals have broken into a cold sweat over the potential security hazards of implementing new and often poorly-understood tools - especially when these tools are given access to enterprise data.

As is often the case, the best security teams will find ways to enable innovation securely, rather than try to stand in its way. This is also our approach at Dig – which is why we have developed a set of capabilities to support companies looking to train or deploy LLMs while maintaining data security, privacy, and compliance.

We detail the Dig solution below, but let’s start by understanding the data-related risks of generative AI in the enterprise.

LLMs + Enterprise Data: Use Cases and Potential Risks

AI is moving so fast that making any prediction in this space is fraught. However, we can point to an emerging trend: enterprises are exploring use cases that involve ‘feeding’ the company’s own data to a large language model, rather than relying on the general-purpose chatbots provided by the likes of OpenAI and Google. This trend was exemplified by Databricks’s recent $1.3BN acquisition of MosaicML, which develops software to help enterprises train their own LLMs.

Companies can choose to incorporate their own data into generative AI systems in multiple ways:

Fine tuning on company data: Existing models (such as OpenAI’s GPT-3) can be fine-tuned on internal datasets to improve accuracy in specific use cases. For example, a company developing an AI support agent might create a fine-tuned model by training GPT-3 on previous support tickets, with the goal of generating responses that are more sensitive to cultural differences between countries.
Live access to organizational records: More advanced architectures allow AI models to respond to prompts based on relevant files or documents (such as through the use of embeddings to encode and retrieve relevant text data). For example, a chatbot used for employee onboarding might be given access to company knowledge bases or internal wikis.
Training new LLMs: A larger enterprise might choose to train their own generative AI model in order to better leverage proprietary data – for example, by training a coding assistant on its own codebase.

Risk 1: LLMs trained on sensitive data

In all of the examples described above, the LLM incorporates proprietary company information in its responses. If there were sensitive records contained in the data that was used for fine tuning, training, or embedding, that data will now be part of the model.

This poses real risks to security and compliance. A malicious actor could reveal the data through a ‘prompt attack’ which causes the model to reveal the sensitive records it was trained on (or has access to). The non-deterministic nature of model responses makes defending against such attacks extremely difficult. There could also be compliance complications, such as models violating data residency requirements.

Moreover - there’s no easy way to get a model to forget or delete data that it’s been trained on. Once ‘contaminated’ data has gone into the model, the only real recourse is deleting the model completely and retraining a new one; this makes the costs of error exceedingly high.

Risk 2: Shadow data and shadow models

The second category of risks is more practical than technical: LLMs are the shiny new toy, and everyone is racing to find the killer use case and extract the most enterprise value from them. The rush to ship new features or products might lead teams to throw caution to the wind, leading to sloppy data security practices.

For example, it’s easy to imagine an overeager development team duplicating a customer database to use for LLM training experiments – creating shadow data that’s vulnerable in its own right, in addition to ‘contaminating’ the AI model with sensitive information.

Furthermore, there are now many AI models besides the widely known ones from OpenAI – including proprietary offerings such as Google’s PaLM and Anthropic’s Claude, as well as open-source models such as Meta’s LLaMa. The proliferation of models, some safer than others, could result in another blind spot for security teams, who might struggle to understand which AI model is responsible for a document or code snippet.

Addressing LLM Security Risks with Dig’s Cloud Data Security Platform

While the security challenges are real, they won’t (and shouldn’t) prevent enterprises from experimenting with LLMs and eventually deploying them in production. Dig’s Data Security Platform allows businesses to do so while maintaining visibility and control over the data that’s being passed to the relevant AI models, and prevent inadvertent data exposure during model training or deployment.

Here’s how companies can use Dig’s range of cloud data security capabilities to secure their LLM architectures:

Monitoring the data that’s going into a model: Dig’s DSPM scans every database and bucket in a company’s cloud accounts, detects and classifies sensitive data (PII, PCI, etc.), and shows which users and roles have access to the data. This can quickly reveal whether sensitive data is being used to train, fine tune, or inform the responses of AI models. Security teams can then ‘earmark’ models that are at higher risk of leaking sensitive information.

Detecting data-related AI risk early (before a model is trained): We’ve mentioned above that after AI models are trained, they’re essentially black boxes - there’s no surefire way to retrieve data from a model’s training corpus. This makes it nearly impossible to detect sensitive data that has already gone into a model, or to ‘fix’ a model after the sensitive data is already in it. Dig’s DDR allows you to nip the problem in the bud by identifying data flows that can result in downstream model risk – such as PII being moved into a bucket used for model training.

Mapping all AI actors that have access to sensitive data: Dig’s data access governance capabilities can highlight AI models that have API access to organizational data stores, and which types of sensitive data this gives them access to.

Identifying shadow data and shadow models running on unmanaged cloud infrastructure. Dig’s agentless solution covers the entire cloud environment - including databases running on unmanaged VMs. Dig alerts security teams to sensitive data stored or moved into these databases. Dig will also detect when a VM is used to deploy an AI model or a vector database (which can store embeddings).

Finding the Right Balance Between AI Innovation and Data Security

Sleeping on AI is a bad idea, but so is ignoring the very real data security risks. Luckily, it’s possible to continue exploring LLMs and generative AI while minimizing the risk of sensitive data exposure. Navigating this path successfully will be key to driving enterprise value from these exciting new technologies.

‍To find out whether your LLM deployment is putting your enterprise at risk, request a free risk assessment.