Benefits of Artificial Intelligence: Leveraging GPT and Diffusion Models for Creating Images

The increasing presence of artificial intelligence (AI) in our world, particularly the recent advancements in natural language processing (NLP) and generative AI, has captured global attention for a good reason. These innovative technologies have the capacity to significantly improve our daily productivity across a wide range of tasks. For instance, GitHub Copilot empowers developers to code complete algorithms swiftly, OtterPilot automates the creation of meeting summaries for executives, and Mixo enables entrepreneurs to establish websites quickly.

This article will provide a concise explanation of generative AI, encompassing pertinent examples of AI technology. We’ll then move from theory to practice with a generative AI tutorial, where we will utilize GPT and diffusion models to craft artistic representations.

Six AI-generated images of the article’s author in various animated and artistic styles. — Six AI-generated images of the author, created using the techniques in this tutorial.

Generative AI: A Brief Introduction

Note: Readers already familiar with the technical underpinnings of generative AI can proceed directly to the tutorial.

The year 2022 witnessed the emergence of numerous foundation model applications, propelling AI advancements across diverse sectors. To better grasp the concept of a foundation model, let’s clarify some key terms:

Artificial intelligence is a broad term referring to any software designed to intelligently execute a specific task.
Machine learning is a subset of artificial intelligence employing algorithms that improve through data analysis.
A neural network falls under machine learning and utilizes interconnected nodes inspired by the structure of the human brain.
A deep neural network is a neural network characterized by multiple layers and learning parameters.

A foundation model is essentially a deep neural network trained on a vast dataset. To put it simply, it represents a highly effective type of AI capable of easily adapting to and performing a wide array of tasks. Foundation models are central to generative AI: Both text-generating language models like GPT and image-generating diffusion models fall under this category.

Text Generation: NLP Models

Within the realm of generative AI, natural language processing (NLP) models are trained to generate human-like text. Large language models](https://blogs.nvidia.com/blog/2023/01/26/what-are-large-language-models-used-for/) (LLMs), in particular, play a crucial role in today’s AI systems. These models, distinguished by their ability to process massive datasets, excel at [recognizing and generating text and other forms of content.

Practically speaking, these models can function as writing assistants or even coding aids. Applications of natural language processing include restating complex concepts simply, translating text, drafting legal documents, and even creating workout plans (although such implementations come with certain limitations).

Lex](https://lex.page/) is a prime example of an NLP writing tool offering a wide range of features, such as suggesting titles, completing sentences, and generating entire paragraphs based on a given topic. Currently, the most widely recognized LLM is GPT, developed by [OpenAI, GPT can respond to almost any question or command in a matter of seconds with high accuracy. OpenAI’s various models are available through a single API. Unlike Lex, GPT possesses the ability to work with code, generating programming solutions based on functional specifications and identifying code-related issues, thereby simplifying the work of developers.

Image Generation: AI Diffusion Models

A diffusion model, a type of deep neural network, possesses latent variables that enable it to learn the underlying structure of an image by progressively introducing removing its blur (i.e., random distortions). Once a model’s network is trained to “understand” the abstract concept behind an image, it can create novel variations of that image. For instance, by eliminating the noise from a picture of a cat, the diffusion model effectively “sees” a clear image of the cat, learns its appearance, and applies this knowledge to generate new cat image variations.

Diffusion models have various applications, such as denoise or sharpen images (enhancing their clarity and details), manipulating facial expressions, or generating face-aging images to visualize a person’s potential appearance over time. The Lexica search engine provides a glimpse into the impressive capabilities of these AI models in generating new images.

Putting It Into Practice: Diffusion Model and GPT Implementation Tutorial

To demonstrate the implementation and utilization of these technologies, we’ll delve into a hands-on exercise where we generate anime-style images using a HuggingFace diffusion model and GPT. This approach requires no intricate infrastructure or software setup. We’ll work with a pre-built model (i.e., one that is pre-trained) that we only need to fine-tune.

Note: This article aims to guide you on how to employ generative AI image and language models for creating high-quality images of yourself in diverse artistic styles. The information provided should not be misused to generate deepfakes that violate Google Colaboratory’s terms of use.

Setting Up and Photo Requirements

Before commencing the tutorial, please register at:

Google	To use Drive and Colab.
OpenAI	To make GPT API calls.

You’ll need a minimum of 20 photos of yourself—more photos enhance performance—saved on the device you intend to use. For optimal results, ensure your photos meet the following criteria:

Resolution of at least 512 x 512 pixels.
Feature only you with no other individuals present.
Share a consistent file extension format.
Capture a variety of angles.
Include a minimum of three to five full-body shots and two to three midbody shots; the rest can be facial close-ups.

It’s worth noting that the photos needn’t be flawless—in fact, observing how deviations from these recommendations impact the output can be insightful.

Generating AI Images Using the HuggingFace Diffusion Model

To begin, access the accompanying Google Colab notebook for this tutorial, which contains the necessary code.

Execute cell 1 to link Colab with your Google Drive, enabling you to store the model and save the generated images.
Run cell 2 to install the required dependencies.
Execute cell 3 to download the HuggingFace model.
In cell 4, input “How I Look” in the Session_Name field, and then run the cell. The session name typically represents the concept the model will learn.
Run cell 5 to upload your photos.
Proceed to cell 6 to train the model. You can retrain the model multiple times by selecting the Resume_Training option before running the cell. (This step might take approximately an hour to complete.)
Finally, execute cell 7 to test your model and observe it in action. The system will generate a URL, directing you to an interface for image creation. After entering a prompt, click the Generate button to render the images.

A screenshot of the model’s user interface with many configurations, an input text box, a “generate” button, and an output of an animated character. — The User Interface for Image Generation

With a functional model, we can now experiment with various prompts to produce different visual styles (e.g., “me as an animated character” or “me as an impressionist painting”). However, leveraging GPT for character prompts is more effective as it provides greater detail compared to user-generated prompts, thereby maximizing our model’s potential.

Enhancing Diffusion Model Prompts Using GPT

We’ll incorporate GPT into our pipeline through OpenAI, although platforms like Cohere offer similar functionality for our purpose. Begin by registering on the OpenAI platform and obtaining your API key. Next, in the Colab notebook’s “Generating good prompts” section, install the OpenAI library:

1
pip install openai

Afterward, load the library and input your API key:

1
2
import openai
openai.api_key = "YOUR_API_KEY"

We will now generate optimized prompts from GPT to create our image in an anime character style, replacing YOUR_SESSION_NAME with “How I Look,” the session name defined in cell 4 of the notebook:

1
2
3
4
5
ASKING_TO_GPT = 'Write a prompt to feed a diffusion model to generate beautiful images '\
                'of YOUR_SESSION_NAME styled as an anime character.' 
response = openai.Completion.create(model="text-davinci-003", prompt=ASKING_TO_GPT,
                                    temperature=0, max_tokens=1000)
print(response["choices"][0].text)

The temperature parameter, ranging from 0 to 2, controls the model’s adherence to the training data. Values near 0 enforce strict adherence, while values close to 2 encourage greater creativity. The max_tokens parameter dictates the length of the generated text, with four tokens roughly equating to one English word.

In my case, the GPT model output is:

1
2
3
"Juan is styled as an anime character, with large, expressive eyes and a small, delicate mouth.
His hair is spiked up and back, and he wears a simple, yet stylish, outfit. He is the perfect
example of a hero, and he always manages to look his best, no matter the situation."

Finally, by feeding this text as input to the diffusion model, we obtain our final result:

Six AI-generated images of the article’s author styled as various anime characters. — Six AI-generated images of the author, refined with GPT-generated prompts.

By utilizing GPT to generate diffusion model prompts, you can bypass the need to meticulously describe the intricacies of an anime character’s appearance—GPT will handle that for you. Feel free to fine-tune the prompt to your liking. With this tutorial completed, you are equipped to create intricate and creative images of yourself or any concept that piques your interest.

Embracing the Accessibility of AI Advantages

GPT and diffusion models represent two fundamental implementations of modern AI. We’ve explored their individual applications and witnessed how their combined power, using GPT’s output as input for the diffusion model, amplifies their capabilities. This synergy creates a pipeline of two large language models that enhance each other’s usability.

These AI technologies are poised to have a profound impact on our lives. Many experts predict that large language models will lead to significant drastically affect the labor market across various professions, automating specific tasks and reshaping existing roles. While the future remains uncertain, one thing is clear: Early adopters who harness the power of NLP and generative AI to optimize their work will have a distinct advantage over those who don’t.

The editorial team of the Toptal Engineering Blog expresses its sincere gratitude to Federico Albanese for his invaluable review of the code samples and technical content presented in this article.