The Paradox of Guidelines

Can artificial intelligence replace static guidelines with a generative system that works at scale?

Written by Mike Creighton, Director of AI Research & Development

One of the primary things we do at Instrument is create guidelines. Brand guidelines, campaign guidelines, design system guidelines. There’s every kind of guideline! But how effectively do brands use these? How does someone who isn't “a creative" or brand strategist interpret those guidelines when selecting photography or illustration?

There are a lot of challenges when it comes to implementing and adhering to brand guidelines. But time-to-market schedules and sheer quantity and quality of output aren’t changing. If anything, they’re becoming more demanding.

Many companies — the big incumbents and startups alike — have been rising to these challenges by creating products and services to address them. And in the last year, we’re starting to see generative AI being employed, which could make these types of solutions much more effective at scale.

But how would it actually work?

Time to Experiment

While I’m sure there’s a tremendous amount of research, sophistication, and proprietary technology behind those solution providers’ implementations, I was curious how far we could get with just off-the-shelf models.

Intuitively, I know that large language models (LLMs) can evaluate written language for adherence to things like brand voice and tone. Even without fine-tuning, sufficient prompting and examples can get most modern LLMs to evaluate copy against brand guidelines successfully.

But I wanted to test a hypothesis: that we can use a state-of-the-art multi-modal LLM to determine whether a visual asset adheres to brand guidelines.

In case you don’t know what “multi-modal LLMs” are, they’re basically language models that have other types of inputs — other modalities — besides text. Google’s Gemini Pro, OpenAI’s GPT-4 Turbo, and Anthropic’s Claude 3 are all multi-modal. They can “see.” Give them an image, and they understand what’s in it. It’s pretty wild. Want to read more? Check out this article.

So now that LLMs can see, can they help us solve the brand consistency problem?

The Approach

I realized over lunch that this was probably something that could be prototyped in about an hour. Strictly with some simple Python and the OpenAI API. We’re specifically going to use OpenAI’s most recent state-of-the-art model: GPT-4 Turbo.

In case you don’t know, working with an API for an LLM is very different from working with something like ChatGPT. When you use the API, you’re effectively scripting the conversation with the LLM, not chatting with it. And you’re “priming” it with what’s called a “system message.” This is where you describe its role, behavior, and what it’s meant to do.

So, we’d use the system message to tell GPT-4 Turbo (the assistant) that it’s basically the brand police and that it evaluates images against user-supplied brand guidelines. We’d add a user message providing those guidelines, add an assistant message acknowledging that it’s received them, and then add a final user message that has the image to be evaluated.

More simply, if this was turned into an app (or even just a single function in code), here’s what would happen conceptually:

  1. User supplies in an image
  2. The AI tells you whether or not it’s on-brand (with rationale and a score)

Simple.

LLM Whispering

Most of the effort with this type of prototype really goes into “prompt engineering,” which is crafting the system prompt and user prompts that make up the fictional conversation. And believe it or not, prompt engineering is actually a thing (at least as of April 2024… as models advance, this will be less necessary). Understanding how LLMs work — along with all their quirks and limitations — is crucial for getting them to do what you want them to do. And it’s only when you’re able to write effective prompts that the true power and magic of LLMs are exposed. But LLMs are weird, and effective prompts can come in the strangest of forms, though there are some best practices that are emerging. Ethan Mollick covers both sides of that spectrum in a recent article.

The Prompts

I want to take a moment to focus on what these prompts look like, so you can get a sense of what it takes to get a large language model to do some pretty remarkable stuff. Here’s a quick glance of the what the general format of the system prompt looks like:

  • What role the AI is playing
  • What it’s supposed to do
  • What its inputs are
  • What its outputs are

Please note: This is what I came up with just to get the idea out of my head so I could test the hypothesis quickly. After re-reading it, I’d probably reorder a few things for sake of clarity. I’d also be more precise in defining the format of its outputs.

First, we needed to prompt the LLM with the role it’s playing in this exercise and what it’s supposed to do for us.

The First User Prompts

We’ve created a fictional SaaS brand (called Zephyr) with fictional brand color guidelines for this exercise. This is written in Markdown syntax since GPT-4 Turbo handles that type of formatting well.

We aimed to be as descriptive and accessible as possible with our brand color guidelines.

Here are the 4 images that we append to the user prompt for the LLM to reference as “good” examples.

(More examples would make this system more effective.)

The Results

I ran a number of images through this AI-powered image evaluator. All of these images were created with image generation models (like Midjourney and DALL-E), but some were art directed to be more “on brand” and some were intended to deviate.

One big thing to note: this is only intended to be evaluating images on the COLOR guidelines, though it does tend to comment on the content of the images as well. Further iterations on the prompts would likely minimize this.

The Learnings

I sort of thought this might work, but that it would take some poking and prodding to really know if this hypothesis could be proven with “off-the-shelf” technology. Meaning: to figure out if this could be done without access to fine-tuning (where you further train a model on custom data to get a good result).

Even with such a simple first-pass prototype, I can pretty confidently say that this is possible. More samples need to be tested, but the added discovery here is that the LLM can give you detailed rationale for why an image might not be on-brand and how you might improve it.

We could iterate on the prompts further to give a user more specific and actionable advice for ensuring a given image is on-brand. Furthermore, we could tell the LLM to provide this advice in a way that would be intuitive and accessible for a “non-creative” person. The unlock here is that both agency partners and the brand’s employees can get up to speed and start executing with effectiveness by getting instant feedback. This effectively empowers everyone to take the right actions to keep assets on-brand.

Looking Ahead

As we think about these guidelines that we meticulously create for our clients and as we think about the challenges that their organizations face with implementing these guidelines, this little experiment represents a possible evolution in how we think about guidelines. And how we think about our own service offerings.

No longer do guidelines need to be a static Figma doc. Guidelines can be a holistic system — a system that’s supported and made enforceable by this very weird technology. Sufficient examples and supporting descriptors of rules will be the activator of these systems. Yes, that means that creating guidelines will be a unique form of prompt engineering, suitable for LLM intake and interpretation. This doesn’t just serve AI; it aids all humans reviewing these guidelines because it leaves things less open to interpretation. And it still exists in plain language.

There are many ways we can extend this system to cover more than color guidelines. It’s a matter of some additional prompting and providing the LLM with more examples. It would result in more time to process an image, but a couple minutes of processing time would yield so much more efficiency, autonomy, and consistency for a company at large.

This approach could be further extended to include copywriting guidelines, ensuring that voice and tone and length are adhered to. As I said at the top, LLMs are really good at spotting those types of inconsistencies.

All-in-all, this approach of using off-the-shelf models is surprisingly promising. There’s certainly more experimentation to do, since my gut says that certain types of evaluations simply won’t be possible with that technology. That’s really where I’m expecting the companies focused on this problem space to excel.

But the benefits of a system that gets you 80% of the way there — even without full evaluation coverage — is that it still brings more consistency to output, streamlines the review process, and brings confidence to everyone creating brand content. Moreover, I think there’s merit to a more bespoke system that’s able to be customized to a given brand’s workflow and infrastructure rather than the other way around. This is an area that we’ll continue pushing into.

No longer do guidelines need to be a static Figma doc. Guidelines can be a holistic system — a system that’s supported and made enforceable by this very weird technology.

Mike CreightonDirector of AI Research & Development

In Summary

Generative AI has a ton of potential to address current challenges for brand guidelines. By prompting multi-modal large language models, we can turn static visual guidelines into holistic, AI-powered systems that evaluate content adherence and provide detailed feedback. The result is a streamlined review process that’s customized to a brand, resulting in improved consistency and efficiency that can outperform a one-size-fits-all solution.

Want to learn more about what AI can do for your business?
Get in touch.

Related Reading