AI Humor Model Caption Generation for Images

Core Process Overview

Image from: arxiv.org

AI humor models generate captions for images through a complex process that mimics human cognitive and creative skills^[2]. This involves several key steps, often including visual detail extraction, humor ideation, narrative extrapolation, and caption ranking^[2]^[3]. By integrating creative, social, and cognitive skills, AI-generated humor aims to produce communication that resonates with people^[2].

Visual Detail Extraction

The initial stage involves a detailed analysis of the input image using visual language models (VLMs) such as GPT-4o^[3]. This component extracts key visual elements, including objects, human expressions, and background settings^[3]. The AI identifies the subject, main action, and background elements to build a foundation for humor^[3]. For example, in an image of a demolition site, the system identifies a large industrial excavator and a person spraying the site with a hose^[3].

Humor Ideation and Angle Selection

After extracting visual details, the system ideates on potential humorous elements^[3]. This involves identifying funny facial expressions or analogous elements within the image^[3]. The system analyzes the image and proposes humorous angles, considering both direct and indirect humorous aspects^[3]. For instance, the visual contrast between an excavator and a person might be interpreted as a David versus Goliath scenario, providing a foundational metaphor for generating humorous captions^[3].

Narrative and Conflict Extrapolation

To add depth and relatability, AI models often extrapolate narratives and conflicts that draw upon common and relatable experiences^[3]. This step connects the visual elements with broader life experiences, making the humor more accessible^[3]. The system chains together the results of the previous steps into a new prompt sent to GPT-4o^[3]. The prompt contains the visual details, the visual humor ideation, and a list of common Gen Z experiences, and the instruction to 'generate narratives that reflect the essence of the image that is set within the framework of the Gen Z experience'^[3]. These narratives are generated based on common experiences such as work, school, family, and relationships^[3]. For example, a demolition site image might generate narratives like 'Tackling student loans' or 'Group Project Disaster,' which are common among Gen Z^[3].

Caption Generation

In this stage, the system generates humorous captions using a fine-tuned language model^[3]. A fine-tuned version of GPT-3.5 trained on humorous Instagram comments is often employed^[3]. The captions are generated through two distinct strategies: focusing on the visual humor of the image and incorporating external narratives^[3]. Image-focused captions comment directly on the image content, while narrative-driven captions introduce external references to add humor^[3]. For example, an image-focused caption might be, "bro out here getting paid $8 an hour to spray some water on some bricks," while a narrative-driven caption could be, "The entitled bro you tried to make the group presentation with"^[3]. Caption generation is segmented into two separate prompts utilizing the fine-tuned GPT-3.5 model^[3].

Caption Ranking and Filtering

The generated captions are then ranked and filtered to select the most effective ones^[3]. A GPT-4o-based agent, fine-tuned to evaluate humor from a Gen Z perspective, assesses the captions based on humor, relatability, and alignment with the image and narrative^[3]. This agent filters out captions that do not meet the humor threshold, ensuring that only the most relevant and relatable captions are presented^[3]. For example, captions like 'Me mopping up my last relationship' might be favored over less relatable ones^[3].

Fine-Tuning and Training Data

Fine-tuning is crucial for tailoring the AI model to generate relevant and engaging humor^[3]. This involves training the model on datasets of humorous comments and captions^[3]. For example, a GPT-3.5 model can be fine-tuned using a dataset of humorous Instagram comments to better capture Gen Z humor^[3]. The quality and quantity of the training data play a significant role in the performance of the model^[1].

Specific AI Techniques

Several AI techniques are utilized in this process, including prompt engineering, fine-tuning, and chain-of-thought processing^[2]. Prompt engineering involves crafting prompts that clearly define the problem and expected output^[2]. Fine-tuning allows the model to learn specific patterns of a target output type^[2]. Implicit in this is the tone, the style, and the vocabulary expected in the humor^[2]. Chain-of-thought processing helps models by explicitly detailing the steps^[2]. Chains are used to separate stages of the humor generation process^[2]. An observation stage makes implicit information in images explicit, similar to the spirit of chain-of-thought and thought experiments^[2].

Utilizing User Preferences & Cultural Nuance

How to Write Comedy Using AI Writing Tools? — Image from: allaboutai.com

AI models can analyze user preferences, interests, and even their sense of humor to generate tailored jokes^[1]. One approach to achieving this personalization is by leveraging collaborative filtering techniques, which are commonly used in recommendation systems^[1]. This involves identifying users with similar tastes and recommending jokes that have been enjoyed by those users^[1]. By combining state-of-the-art AI techniques with a deep understanding of human psychology and humor, AI-generated comedy is revolutionizing the way we create and consume humor^[1]. Cultural context also plays a significant role with Wu et al. (2024) revealing significant differences in humor perception between Western and Eastern cultures^[3].

Challenges and Considerations

When Robots Make Us Laugh: The Emergence of AI-Generated Humor — Image from: riseoftherobots.ai

Can AI help humans be funnier? — Image from: acs.org.au

Despite advancements, AI-generated humor faces challenges, including the need for ethical considerations and the difficulties in replicating human-like social skills^[2]^[1]. Ensuring inclusive and non-offensive humor is critical, as AI models are trained on large datasets that may contain biased or offensive content^[1]. Intellectual property and joke ownership also become complex as AI-generated humor gains prominence^[1]. As AI models improve, they may have the potential to both disingenuously create human bonding and to augment human’s ability to bond, carrying the potential to change the nature of human trust and communication^[2].