Safeguarding Enterprise Adoption of LLMs

Governing the unpredictability and inconsistency of LLMs will be key to enterprise adoption.

Jun 07, 2023

Overview

The WSJ published a piece a few weeks ago about how enterprises are “seeking out ChatGPT tech for searching and analyzing” their own data. The fascination with ‘ChatGPT for X’ stems from the unlock of intelligent question answering that large language models uniquely enable. Whether it’s conversing with codebases, legal and financial documents, or internal data, stakeholders are able to access and understand information much faster than ever before.

However, amidst the outward enthusiasm, numerous CIOs have expressed concerns about the lack of guardrails around working with LLMs, which is ultimately gating true enterprise adoption. Why are guardrails even needed?

LLMs are inherently non-deterministic and their outputs and actions can be unpredictable and inconsistent. Without a better story around model governance, even if enterprises stand to benefit from large efficiency gains, overall adoption of “Generative AI” will be slower than expected.

In this piece, I talk about two critical areas for improvement in the governance story:

Data privacy and the importance of role based access control (RBAC) as a primitive.
The need for factuality, consistency and fairness around LLM outputs.

The Absence of RBAC as a Primitive

One of the big outcomes of this LLM based AI wave is that enterprises’ unstructured data (PDFs, documents, text etc) starts to become very useful. If the data was previously underutilized, businesses can now ask all sorts of questions and interact with it thanks to language models and a chat interface. Consider an internal ChatGPT product for legal or HR data, where you can ask questions like “provide information on [legal case name]” or “what is the summary of employee X’s past three performance reviews?” Tremendously useful in one sense.

However, it would be careless to simply “fine-tune ChatGPT” on internal data and release it to users because in the enterprise setting not every user should have access to the outputs of specific prompts. Business documents contain tons of sensitive information that needs to be subject to granular role based access controls.

A random employee shouldn’t be able to ask an internal model to summarize another employee’s compensation. What about the model spitting out PII data to an unauthorized user? A legal intern shouldn’t be privy to information in all legal cases within the company’s corpus.

The way enterprises are handling this data privacy problem today is preventing entire swathes of data from being fine tuned upon, which is a bandaid solution. In the world of structured data, we already have the powerful concept of row level access control on databases. What is the analogy in the LLM world?

Well, in order for enterprise data to be useful for LLM chat applications, it first needs to be converted to vector embeddings. For the non-technical reader, vectors are the key bridge between language and LLMs. Below, I briefly talk about how I believe RBAC should play a role in the I/O workflow.

Inputs:

Imagine a confidential document that needs to be ingested by an internal ChatGPT application. Either the document itself or subsections of it contain information that should have role based access control (example: a company’s financial contracts). Vector databases like Pinecone let you define access control at a project level (i.e. collection of vectors) which is helpful but not granular enough to cater to this need, especially if the underlying model is global and the application is broadly available to all internal users.

The missing component here is the ability to easily map portions of data within documents, or the document itself, to an RBAC policy, and then have the map be preserved at the embeddings level within the vector store of choice.

Outputs:

Subsequently, when users are prompting the application to retrieve information, there needs to be a service that inspects the vectors retrieved at inference time and checks if the user’s ID is allowed to retrieve those vectors per the RBAC policy.

Startups like LlamaIndex, Vectara, Glean and Credal are well positioned to add this primitive into their offerings which, if done right, should assuage CIO concerns around data privacy. The vector database companies (Pinecone, Weaviate etc) will certainly be able to introduce vector level permissions, however, they’ll also have to be creative with the workflow they provide to inspect and permission vectors post-facto so that admins understand what underlying data each vector represents.

Extending to write permissions in the context of RPA:

Lots of projects and companies have emerged that have shown that language models can be very effective at RPA. They can be used to orchestrate API calls or browser based actions to complete tasks that would otherwise require manual, human labor. While a very exciting direction of software, LLM based RPA systems that perform actions across enterprise systems also need to be paired with granular access control to ensure safe and secure task execution.

Taming Hallucinations, Inconsistencies & Bias

While better access control mechanisms will help solve data privacy concerns, another equally important side to the governance story is the unpredictability of model outputs. This unpredictability is an artifact of the inherent non-determinism and it rears its head in the form of hallucinations, inconsistencies and bias.

Hallucinations

It’s well documented that one of the fundamental issues with LLMs is that they hallucinate i.e. generate authoritative sounding outputs that lacks factuality. While a regular consumer may be willing to overlook this “bug” in exchange for the other benefits that LLM applications can provide, the stakes are much, much higher for enterprises.

The NYT recently published an article on a high profile incident in the legal world where a lawyer, working alongside ChatGPT, generated legitimate sounding but ultimately “bogus” legal cases. He faces sanctions for his actions with some constituents of the legal industry questioning whether ChatGPT should be allowed at all unless it stops hallucinating.

Combating model hallucinations is one of the most interesting areas of inquiry at the moment and there are different schools of thought on how it should be addressed - more investment in training data curation, pairing models with external data sources that can be incorporated at inference time, and training domain specific models and having them work in unison to resolve prompts.

Given this backdrop, LLM based products that crack the interplay of determinism with the reasoning capabilities of models, packaged in well thought out UX, will do well.

Inconsistent and Biased Outputs

Let’s say we want to leverage LLMs to build data pipelines or create automated business workflows. One of the other issues that emerges is the lack of structure around model outputs, which makes it hard to consume by downstream programs and services. We might not always want a blob of text from a model. Sometimes we might want a number, a list, XML, JSON, or generally output in a very specific format that is consistent and can be operated upon by downstream business logic.

Imagine a database with customer feedback on a product and we want to create a new column that captures the sentiment of each row of feedback. A business may want to do this to quickly filter for “negative” reviews that it can then send to the product team. If we threw an LLM as-is at the problem and asked it to capture the sentiment of each row, it may say “positive” for one example and “the provided feedback is generally positive” for another.

This inconsistency makes the output useless to a business analyst, even if it is an unlock that LLMs can understand and summarize potentially thousands of rows of free form customer feedback. Ultimately, what’s needed is the ability to coax the model to adhere to strict guidelines specified by business users. Projects like Guardrails, PredictionGuard and ReLLM are providing the tools to do exactly that.

Lastly, it’s critical for enterprises to be able to detect and systematically prevent biased and harmful outputs generated by models or applications they’ve deployed in production. Without the tools to do so, enterprises will be slower to buy customer facing LLM applications as the potential liability costs outweigh any benefits. Companies such as Woolly and Credo are helping tackle this piece of the puzzle.

Closing Thoughts

In the same way SOC-2 compliance is a must for any B2B product, I wonder if a separate standard will be coined and made a requirement for third-party LLM native applications. Pace of business adoption and usage will be slower than preferred until we have a clearer picture of the compliance standard and how to serve it, without fully hamstringing the reasoning and creative abilities of LLMs.

Thanks for reading! If you have any comments or thoughts, feel free to tweet at me and I’d be happy to discuss. If you’re generally interested in chatting about the wonderful world of AI, you can email me at aditya@kleinerperkins.com

Special thanks to Dan Robinson, Ethan Chan, Amin Ahmad, Nicolos Ouporov, Alex Mckenzie and Alex George for encouraging me to publish this piece.

Aditya’s Newsletter

Discussion about this post