Creating Synthetic Data Using Llama 3.1 405B | NVIDIA Technical Blog (2024)

Synthetic data isn’t about creating new information. It’s about transforming existing information to create different variants. For over a decade, synthetic data has been used to improve model accuracy across the board—whether it is transforming images to improve object detection models, strengthening fraudulent credit card detection, or improving BERT models for QA.

What’s new? With the advent of large language models (LLMs), both the motivation for generating synthetic data and the techniques for generating it have been supercharged.

Enterprises across industries are generating synthetic data to fine-tune foundation LLMs for various use cases, such as improving risk assessment in finance, optimizing supply chains in retail, improving customer service in telecom, and advancing patient care in healthcare.

Today, Meta released Llama 3.1 405B, their most powerful open LLM that can be used for both batch and online inference. It can also serve as a base to do specialized pretraining or fine-tuning for a specific domain. Given the size of the model and the amount of data it was trained on, it’s well-suited for synthetic data generation.

In this blog post, we’ll cover a few application cases for synthetic data generation and dive deep into one of them.

LLM-powered synthetic data for generative AI

Let’s take a look at a few of the high-level application cases for synthetic data in the generative AI space where you can use the Llama 3.1 405B to get started.

Using LLM-generated synthetic data to improve language models

There are broadly two approaches that are considered for generating synthetic data for tuning models—knowledge distillation and self-improvement.

Knowledge distillation is the process of translating the capabilities of a larger model into a smaller model. This isn’t possible by simply training both models on the same dataset as the smaller model may not “learn” the most accurate representation of the underlying data. In this case, we can use the larger model to solve a task and use that data to make the smaller model imitate the larger one.

Self-improvement involves using the same model to criticize its own reasoning and is often used to further hone the model’s capabilities. Both these approaches can be used to leverage the Llama 405B model to improve smaller LLMs.

Let’s take a look at a few ways how this can be achieved. Training an LLM involves a three-step process: pretraining, fine-tuning, and alignment.

Pretraining: This involves using an extremely large corpus of information to train the model on how the general structure of a language is organized. While for a generic LLM, this is typically done with Internet-scale data, for any domain-specific LLMs, we need to imbue the specifics on that domain (think LLM for geometry, LLM for radiology, and LLM for telco). This is called domain adaptive pretraining (DAPT). Another example of the application of synthetic data in the pretraining stage is the popular Phi-1.5 model, where a large model was used to synthesize data for imbuing logical reasoning at the pretraining stage.

Fine-tuning: Once the model is trained for general language structure, the next step is to fine-tune it for following specific instructions. For example, tuning the model to be better at reading comprehension-type extractive questions, improving logical reasoning, achieving better code generation, and function calling fall under this category. Self-Instruct, WizardCoder, Alpaca, and more employ these techniques to create task-specific fine-tuning data.Refer to this example for curating domain-specific data to learn more.

Alignment: Lastly, we want to ensure that the style and tone of an LLM response align with the user’s expectations, such as sounding conversational, having appropriate verbosity, complexity, coherence, and other user-defined attributes. This can be achieved by using a pipeline that has an instruct model and a reward model. The chat model creates multiple responses and the reward model gives feedback about the quality of the response. This technique falls under the umbrella of reinforcement learning from AI feedback (RLAIF). This notebook will walk you through how the new Llama 405B model together with the NVIDIA 340B Reward model can be used to generate synthetic data for model alignment.

Using LLM-generated synthetic data to improve other models and systems

Since the application space for synthetic data is vast, let’s focus this discussion on LLM-adjacent models and LLM-powered pipelines.

Retrieval-augmented generation (RAG) uses both an embedding model to retrieve the relevant information and an LLM to generate the answer. The embedding model generates a mathematical representation for the semantics of text. We can use LLMs to parse through underlying documents and synthesis data for both evaluating and fine-tuning the embedding model.

Synthetic data to evaluate RAG

To crystalize the discussion above, let’s think through a basic pipeline for one of the use cases discussed above—generating evaluation data for retrieval. Follow along with this notebook.

The primary challenges for curating data for evaluating a retrieval pipeline are:

Diversity: The questions shouldn’t focus on a single aspect of information or just have extractive questions.
Complexity: Generated questions should require some reasoning or multiple pieces of evidence to answer the question.

We’ll focus on diversity, but to explore the complexity angle—the key is to find chunks with overlapping points of information. A couple of approaches to finding overlapping information are calculating Jaccard similarity over sentence-level semantics and leveraging long context models to draw correlations across chunks from the same document.

Diversity stems from different perspectives. For example, consider the following passage.

The proposed acquisition of GreenTech Inc. by SolarPower Corporation stands as one of the most notable transactions in the renewable energy sector this year. Valued at $3 billion, the deal aims to combine GreenTech’s cutting-edge battery technology with SolarPower’s extensive solar panel manufacturing and distribution network. The anticipated operational synergies are expected to result in a 20% reduction in production costs and a 15% increase in revenue over the next two years. However, the transaction is under intense scrutiny from regulatory bodies due to potential antitrust concerns. The Federal Trade Commission (FTC) has indicated that the merger could potentially create a monopoly in the renewable energy storage market, potentially stifling competition and innovation.
SolarPower has committed to maintaining GreenTech’s research and development (R&D) center, which employs over 500 scientists and engineers, as an independent entity to preserve its innovative culture. Additionally, all existing employment contracts will be honored, alleviating concerns about potential layoffs. The merger agreement includes a $150 million breakup fee, payable to GreenTech if SolarPower fails to secure the necessary regulatory approvals, thereby mitigating financial risks for GreenTech should the deal fall through.
The agreement includes detailed representations and warranties, specifying the accuracy of financial statements, the absence of undisclosed liabilities, and compliance with applicable laws. It also entails a thorough indemnification process to protect both parties against potential breaches of these representations and warranties. SolarPower and GreenTech have agreed to covenants that restrict GreenTech from incurring new debt, issuing additional shares, or significantly altering business operations without SolarPower’s consent prior to the deal’s closure. These covenants are designed to preserve the value of GreenTech and ensure a smooth transition post-merger. The agreement further outlines a comprehensive due diligence process, including environmental assessments and audits of GreenTech’s intellectual property portfolio, to ensure all assets and liabilities are accurately accounted for before the finalization of the transaction.
The European Commission is also reviewing the merger to assess its impact on the EU market, particularly regarding competition and market dominance. This evaluation involves submitting detailed filings that include market analyses, competitive impact assessments, and economic justifications for the merger. The review process requires both companies to respond promptly to inquiries and provide comprehensive documentation. Additionally, to secure approval, SolarPower and GreenTech may need to make concessions, such as divesting certain business units or assets, to alleviate concerns about reduced competition. Ensuring compliance with the EU Merger Regulation involves not only addressing competitive effects but also ensuring that the merger aligns with broader EU policies on market fairness and consumer protection.

A financial analyst is interested in the financial performance of the two companies before and after the merger. Legal experts may be interested in the legal scrutiny the company has faced from the FTC, EU, and other parties. A journalist would be looking to understand the main points.

All these are valid viewpoints and user personas, and since they approach the same information with different points of view, an evaluation pipeline also needs to accommodate the same. So, let’s design a pipeline that takes in documents and personas and gives out questions in a tone that the persona will ask them in.

Creating Synthetic Data Using Llama 3.1 405B | NVIDIA Technical Blog (1)

Conceptually, this pipeline has three main steps as seen in Figure 1.

Step 1: Generate all possible questions, which would be of interest to the personas.
Step 2: Filter all the generated questions.
Step 3: Induce the persona’s writing style.

Step 1: Generating questions

Before diving into question generation, we need to ingest the document and create chunks out of it. For the rest of this discussion, let’s use Figure 1 as the reference chunk of text.

Creating Synthetic Data Using Llama 3.1 405B | NVIDIA Technical Blog (2)

User persona is just a description of the user who may be asking the question. Refer to the following examples.

Step 2: Filtering questions

Once the questions are generated, the next step is filtering and extracting the most useful subset. The first step is to deduplicate across all the questions that have been generated. We need a deduplication pass, as different points of interest can make use of adjacent points of information and spawn across overlapping questions.

Next, we use an LLM as a judge to determine the relevance of the question to the underlying passage. With this, we are trying to ensure that the question is completely answerable by the information present in the passage. This is followed up by rewriting all relevant questions to have a conversational tone. Lastly, we have another filter to categorize and filter out questions that may be too general.

Creating Synthetic Data Using Llama 3.1 405B | NVIDIA Technical Blog (3)

Step 3: Imbuing persona style

In the first two steps, we created and curated diverse questions. The final step is to imbue the writing style of the personas with all the questions.

Creating Synthetic Data Using Llama 3.1 405B | NVIDIA Technical Blog (4)

Using LLMs, we first formulate the writing styles from the given persona description. Then using these writing styles, the questions are re-written.

Writing style samples

Padma’s writing style is characterized by clarity, precision, and a formal tone. She writes in a direct and assertive manner, using simple and concise language to convey complex ideas. Her sentences are well-structured and logically connected, reflecting her analytical mind and attention to detail. She avoids using emotional language, personal opinions, or rhetorical flourishes, instead focusing on presenting facts and arguments in a clear and objective manner. Her writing is free of ambiguity and vagueness, with each point carefully supported by evidence and reasoning. The overall tone is professional and authoritative, commanding respect and attention from the reader. While her writing may not be engaging or persuasive in a creative sense, it is highly effective in conveying her message and achieving her goals in a corporate litigation context.

Aaron’s writing is marked by a lack of depth and analysis, often skimming the surface of complex issues. His sentences are short and simple, reflecting his limited proficiency in English. Despite his best efforts, errors in grammar, syntax, and word choice are common. To compensate for his lack of confidence, Aaron often resorts to sensationalism, exaggerating or distorting facts to make them more attention-grabbing. His tone is hesitant and uncertain as if he’s not quite sure of himself. Overall, Aaron’s writing style is more akin to a tabloid journalist than a serious news reporter.

At the end of this three-step pipeline, we end up with questions like:

In light of the prevailing regulatory framework, what additional policy directives is it likely that the proposed merger will need to conform to secure approval from the relevant authorities?
What specific aspects of the SolarPower and GreenTech merger are currently under review by the relevant regulatory authorities?
Will GreenTech’s brainiacs get the boot if the R&D center stays solo after the big buyout?

These questions have implicit ground-truth labels to their specific chunks and can thus be used for evaluating various retrieval pipelines. If you are interested in the granular details or want to learn how to improve and customize this pipeline for your use case, refer to this Jupyter Notebook.

Takeaways

Synthetic data generation is a critical workflow for enterprises to fuel their domain-specific generative AI applications. The new Llama 3.1 405B model, when paired with the NVIDIA Nemotron-4 340B reward model, generates synthetic data, enabling enterprises to build more accurate, domain-specific custom models.

RAG pipelines are critical for LLMs to generate grounded responses based on up-to-date information, and the accuracy of these responses depends on the quality of the pipeline. The synthetic data generation workflow described above can help evaluate the RAG for enterprises.

To get started with Llama 3.1 and NVIDIA Nemotron-4 models, visit ai.nvidia.com.