In this post, I’m sharing how we’re developing our quality evaluation framework for AI-generated outputs at Tofu. As a team, we’re aware consumers are still early in their journey of trusting any LLM-generated output. Any successful generative AI application needs to nail quality evaluation to produce a best-in-class product. A couple reasons why we want to #buildinpublic:
Source: Sequoia Capital: Generative AI Act Two
For AI-first companies to reach their full potential, it’s crucial to gain user trust similar to that placed in human assistants. While most AI companies start with automating singular workflows, their biggest opportunities involve automating entire systems. This is only possible if trust is established early on.
For example, at Tofu, our vision begins with AI-assisted content for marketing and evolves toward a fully automated top-of-funnel workflow based on data-driven recommendations.
2. Openness Fosters Accountability and Learning: We’ve been testing quality for months before our launch out of stealth. As we rapidly introduce new features, we’re unwavering in prioritizing quality. Sharing our progress not only holds us accountable but also helps us learn best practices from our broader network.
A Glimpse into Personalization with Tofu
Before delving into our quality evaluation design, here’s a brief on what Tofu does.
We enable B2B marketing teams to generate on-brand, omnichannel personalized content, from emails to eBooks. Customers feed their brand collateral and target segment information into our proprietary Playbook, and Tofu crafts countless variations of tailored content from there.
As a simple example, I’ll walk you through how we personalize a generic Tofu landing page for an account, Heap.io, and a Persona (CMO).
In the Tofu app, we select the components of the original that we want to personalize by account and persona, and generate the output in our Factory.
As you can see, the output adds some personalized detail about the industry Heap is in (digital insights) as well as details relevant to the CMO role.
Our ultimate goal is for marketers to confidently publish Tofu-generated content with minimal oversight.
Quality Evaluation Blueprint
Our CEO, EJ, outlined the foundational guidelines for our testing process a few months back. In the spirit of authenticity, the following key points are directly extracted from the original document:
Designing our Metrics and Scoring System
In our first iteration, we had a 10 point scale. Notice that for all of our criteria besides personalization, we stuck to a binary metric.
We decided from the start that we would recruit third party evaluators to eliminate biases from our team. We want our criteria and instructions to be easy enough so that someone who has never heard of Tofu can follow them. We decided against a purely automated process because we want our human review to mirror that of a real marketing professional evaluating content that Tofu has generated for them.
Scoring Criteria — 10 points possible
We also added a column for Additional Comments, where Evaluators could note any other errors that weren’t accounted for in our scoring criteria, or anything they were confused about. This feedback is extremely helpful in the early iterations of your tests to help you clarify instructions, particularly as some of the criteria, like the ones relating to Alignment, are subjective.
Running the Test
We trained a handful of evaluators on Upwork, a contractor marketplace. Here’s the exact first job description we posted.
Some of our best advice for the test design process:
Processing Results — First Pass
This was our first ever quality test results dashboard. The goal was to highlight which content type and criteria our team should focus on improving, and also highlight the quality gap between our live models (at the time gpt3.5 and gpt4) given the latter is much more expensive to generate.
Iteration 1 — Scorecard for Launch
Leading up to our launch, we decided to prioritize a subset of the initial metrics (P0 Criteria below). We also modified the point scales for Alignment-Original, Repetition, Personalization, and Format to account for nuances/confusion that our Evaluators flagged to us the first time.
Below is our current criteria (changes from our first iteration bolded)
P0 Criteria
P1 Criteria
P2 Criteria
In addition to the changes in criteria, we added two more aggregate criteria:
Reviewing Results
What Comes Next
We’re always looking to refine our existing testing framework. Here’s two approaches we’re excited to experiment with next.
We’d love to hear from you!
Whether you’re a fellow AI builder, an automated testing provider, or just have tips for us, we’d love to hear your thoughts. You can reach me at jacqueline@tofuhq.com
By leveraging AI to generate tailored content pieces based on audience interests, behavior, and customer journey stage, companies have achieved significant increases in engagement and efficiency, highlighting the transformative potential of AI in delivering impactful 1:1 marketing campaigns.
As Generative AI transforms B2B marketing campaigns, measuring the ROI of these AI-driven initiatives presents new challenges and opportunities. By leveraging AI's analytical capabilities to gain deeper insights into customer engagement, conversion patterns, and revenue impact, B2B marketing leaders can develop a more comprehensive view of the customer journey and accurately assess the true impact of their campaigns on the bottom line.
Generative AI is revolutionizing the way B2B marketers nurture customer connections, offering unprecedented opportunities for personalized content creation, data-driven insights, and dynamic campaign strategies. By harnessing the power of AI, brands can humanize their interactions, deepen relationships, and navigate the complexities of the digital ecosystem to drive sustained growth and loyalty.
The rise of generative AI is transforming the skills landscape for B2B marketers, demanding a blend of technical proficiency, creative acumen, and ethical awareness. As CMOs navigate this shift, fostering a culture of innovation, continuous learning, and responsible AI usage within their teams will be crucial to harnessing the power of these tools.
Generative AI has emerged as a game-changer in B2B marketing, bridging the gap between data analytics and creative strategy. By leveraging the power of AI to generate personalized, data-driven content and campaigns, B2B marketers can enhance efficiency, drive innovation, and strengthen brand building efforts.
Ipsos B2B Marketing Benchmark Report for 2023 unveils the transformative role of personalized content and Generative AI in redefining customer engagement and loyalty. As leaders navigate economic uncertainties and increasing competition, leveraging these tools emerges as a strategic imperative for securing long-term growth and resilience.
Discover how AI-powered content creation can help you build stronger relationships, demonstrate a deep understanding of your prospects' needs, and ultimately drive more significant revenue growth in your ABM campaigns.
Discover six ways generative AI is revolutionizing content repurposing and personalization, enabling B2B marketers to create hyper-personalized, engaging content at scale in a matter of minutes.
Discover the do's and don'ts of using generative AI for B2B content marketing. Learn how to create personalized, engaging content that resonates with your audience while avoiding common pitfalls.