Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models

Published:02/23/2024

Analysis

~14 min read · 19,477 charsThis analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

1. Bibliographic Information

1.1. Title

The central topic of the paper is the use of Large Language Models (LLMs) to generate synthetic datasets for humor detection. Specifically, it investigates the ability of LLMs to edit existing humorous texts to make them non-humorous (a process termed "unfunning"), and evaluates the quality of this synthetic data for training humor classifiers.

1.2. Authors

The authors are Zachary Horvitz, Jingru Chen, Rahul Aditya, Harshvardhan Srivastava, Robert West, Zhou Yu, and Kathleen McKeown.

  • Affiliations: Most authors (1^{1}) are affiliated with Columbia University. Robert West (2^{2}) is affiliated with EPFL (École Polytechnique Fédérale de Lausanne).
  • Research Backgrounds: The authors are associated with natural language processing (NLP) and computational linguistics research groups. Kathleen McKeown is a prominent figure in NLP, particularly in text generation and summarization.

1.3. Journal/Conference

The paper is currently available as a preprint on arXiv (arXiv:2403.00794v2). It has not yet been published in a peer-reviewed journal or conference proceedings based on the provided text. However, the subject matter fits well within top-tier NLP conferences like ACL (Association for Computational Linguistics) or EMNLP (Empirical Methods in Natural Language Processing).

1.4. Publication Year

The paper was posted on arXiv in 2024 (specifically February 23, 2024, according to the timestamp).

1.5. Abstract

The paper addresses the challenge of humor detection, which is complicated by the scarcity of datasets containing paired humorous and non-humorous texts. The authors investigate whether LLMs can generate such synthetic data by editing texts. They benchmark LLMs on an existing human dataset (the Unfun Corpus) and demonstrate that current LLMs have an impressive ability to "unfun" jokes (remove humor), as validated by human judgment and downstream humor detection tasks. Furthermore, they extend this approach to a code-mixed English-Hindi dataset, finding that GPT-4's synthetic data is highly rated by bilingual annotators and creates challenging adversarial examples for classifiers.

The paper is available as a preprint at: https://arxiv.org/abs/2403.00794v2 The PDF link is: https://arxiv.org/pdf/2403.00794v2

2. Executive Summary

2.1. Background & Motivation

Humor is a complex and nuanced aspect of human communication that remains difficult for Artificial Intelligence (AI) to master. While recent advances in NLP have improved many language tasks, humor detection and generation remain significant hurdles. A major bottleneck is the lack of high-quality datasets that pair a humorous text with a semantically similar but non-humorous version (aligned data). Existing datasets like the "Unfun Corpus" are valuable but limited in size due to the high cost and effort required for human annotation. The core problem is the scarcity of these aligned datasets, which are crucial for training models to distinguish between what is funny and what is serious, independent of the topic or vocabulary. The paper's entry point is the observation that while humans (and LLMs) struggle to create humor from scratch, humans are relatively good at editing humor to make it serious. The authors hypothesize that LLMs might share this asymmetrical capability: being "unfunny" might be easier for them than being funny.

2.2. Main Contributions / Findings

The paper's primary contributions are:

  1. Benchmarking LLMs on "Unfunning": The authors evaluate several LLMs (GPT-4, GPT-3.5, Mistral) on their ability to edit satirical headlines into serious news headlines.
  2. Demonstrating Asymmetry: They find that LLMs are significantly better at removing humor ("unfunning") than creating it. GPT-4, in particular, can outperform human crowd-workers in generating convincing "unfunned" text.
  3. Synthetic Data Utility: They show that humor classifiers trained on this synthetic "unfunned" data perform very well, nearly matching the performance of models trained on human-edited data.
  4. Generalization to Code-Mixed Languages: They successfully apply the method to a code-mixed English-Hindi dataset, proving that the "unfunning" capability generalizes across languages and domains.
  5. Adversarial Examples: The synthetic data serves as a challenging test set that reveals existing humor classifiers rely on superficial features rather than deep semantic understanding.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, one must grasp several key concepts:

  • Large Language Models (LLMs): These are deep learning models, typically based on the Transformer architecture, trained on vast amounts of text data. They learn to predict the next word in a sequence, allowing them to generate coherent text, answer questions, and perform various language tasks.
  • Few-Shot Learning (In-Context Learning): A technique where an LLM is provided with a few examples (shots) in the input prompt to guide its behavior without updating the model's internal weights. For example, giving the model three pairs of (satire headline -> serious headline) before asking it to convert a fourth one.
  • Incongruity Theory: A dominant theory in humor psychology suggesting that humor arises from the violation of expectations or the presence of incongruity (a mismatch between what is expected and what occurs). The paper references this to explain why replacing low-probability (surprising) tokens with high-probability (expected) tokens might remove humor.
  • Code-Mixing: The practice of alternating between two or more languages in a single conversation or utterance (e.g., using Hindi and English together in the same sentence). This presents unique challenges for NLP models.
  • Perplexity: A measurement of how well a probability model predicts a sample. Lower perplexity indicates the text is more predictable (less surprising) to the model. Humorous text often has higher perplexity because it contains unexpected twists.

3.2. Previous Works

The paper builds upon and references several key studies:

  • The Unfun Corpus (West and Horvitz, 2019): This is the foundational dataset for the study. It was created by a human game where players edited The Onion headlines to make them serious. The paper uses this as the gold standard for benchmarking LLM performance.
  • Humor Generation Limitations: The authors cite works like Jentzsch and Kersting (2023) and Veselovsky et al. (2023), which established that LLMs struggle to generate novel, funny jokes.
  • Asymmetry in Synthetic Data (Josifoski et al., 2023): Previous research noted that generating data by editing existing text (e.g., for information extraction) is often more effective than generating from scratch. This paper applies that insight to humor.

3.3. Technological Evolution

Historically, computational humor relied on hand-crafted linguistic features or simple statistical models. With the advent of Deep Learning and Transformers, the field shifted toward using representations from models like BERT or RoBERTa. Recently, the focus has moved to generative LLMs (like GPT-4). This paper fits into the latest trend: exploring whether the generative power of massive LLMs can be harnessed to solve the data scarcity problem that plagues the specific sub-field of computational humor.

3.4. Differentiation Analysis

Unlike previous works that focused on generating humor (which LLMs fail at) or detecting humor using static datasets, this paper focuses on the inverse operation: removing humor. The core innovation is leveraging the "asymmetry of difficulty"—it is easier to destroy structure (humor) than to create it. By focusing on "unfunning," the authors unlock a scalable method for creating aligned training data that was previously impossible to generate synthetically.

4. Methodology

4.1. Principles

The core principle of the method is "Unfunning". Instead of asking the LLM to "write a joke," the authors prompt it to "edit this text to make it serious." The theoretical intuition is that humor relies on incongruity (surprise). LLMs, trained via Maximum Likelihood Estimation (MLE), are optimized to predict the most probable, "normal" next word. Therefore, when asked to edit a text, they naturally tend to replace the incongruous, surprising elements of a joke with high-probability, sensible alternatives, effectively stripping away the humor.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology consists of two main approaches: a Generative approach using LLMs and a Lightweight Masked approach using RoBERTa.

4.2.1. Generative "Unfunning" with LLMs

This approach uses the generative capabilities of models like GPT-4 and GPT-3.5.

  1. Prompt Construction: The authors use a few-shot setting. They construct a prompt containing a task description (e.g., "You are a helpful assistant that edits humorous headlines to make them realistic") and a set of input-output exemplar pairs.

  2. Exemplar Selection: To encourage diversity, the input-output pairs (satire headline -> serious headline) are sampled from a subset of the human Unfun dataset that was rated as high quality.

  3. Inference: The target satirical headline is fed into the model. The model autoregressively generates the "unfunned" version token by token.

    The following figure (Figure 1 from the original paper) illustrates the input and output of this process:

    img-0.jpeg 该图像是示意图,展示了通过 GPT-4 模型将幽默文本转化为非幽默文本的过程。图中给出了三个示例,其中每个幽默句子及其相应的非幽默版本进行了匹配,展示了模型在生成幽默检测数据方面的能力。

4.2.2. Lightweight "Unfunning" with ROBERTA-SWAP

This approach is a non-generative, rule-based baseline motivated by the Incongruity Theory. It assumes that humor is associated with low-probability tokens (high surprise). To remove humor, we should replace these low-probability tokens with the highest-probability tokens predicted by a language model.

The algorithm operates iteratively for kk steps (where k=3k=3 in the experiments):

Step 1: Masking and Prediction For every token position ii in the headline, the algorithm temporarily masks (hides) that token. It then uses a pre-trained RoBERTa model to predict the most likely token for that position given the context.

The prediction for the replacement token x^i\hat{x}_i is calculated as: x^i=argmaxxP(xxi,θR o B E R T a) \hat {x} _ {i} = \arg \max _ {x} P (x \mid x _ {\neq i}, \theta_ {\text {R o B E R T a}}) Where:

  • x^i\hat{x}_i: The predicted replacement token at position ii.
  • xx: A candidate token from the vocabulary.
  • P(xxi,θRoBERTa)P(x \mid x_{\neq i}, \theta_{\text{RoBERTa}}): The probability of token xx appearing at position ii, given all other tokens xix_{\neq i} in the sequence and the model parameters θ\theta.
  • xix_{\neq i}: The sequence of all tokens in the headline except the one at position ii.
  • θRoBERTa\theta_{\text{RoBERTa}}: The parameters (weights) of the RoBERTa model.

Step 2: Selecting the Swap Position The algorithm must decide which position to edit. It selects the position where the "surprise" factor is highest—specifically, where the ratio of the predicted token's probability to the original token's probability is the largest. This identifies the most incongruous word.

The swap position is determined by the following formula: swapposition=argmaxiP(x^ixi,θRoBERTa)P(xixi,θRoBERTa) swap position = \arg \max_{i}\left|\frac{P(\hat{x}_{i}\mid x_{\neq i},\theta_{\mathrm{RoBERTa}})}{P(x_{i}\mid x_{\neq i},\theta_{\mathrm{RoBERTa}}})\right| Where:

  • swap position: The index ii of the token selected for replacement.
  • P(x^i)P(\hat{x}_{i}\mid \dots): The probability assigned by RoBERTa to the predicted best token.
  • P(xi)P(x_{i}\mid \dots): The probability assigned by RoBERTa to the original token currently in the text.
  • The ratio P(x^i)P(xi)\frac{P(\hat{x}_{i})}{P(x_{i})} represents how much more "normal" the predicted token is compared to the original. A high ratio suggests the original token was very unlikely (surprising), and thus a candidate for removal.

Step 3: Swapping and Repeating Once the position is identified, the original token xix_i is replaced with the predicted token x^i\hat{x}_i. This process is repeated kk times (3 times in the paper) to progressively remove incongruity from the sentence.

4.2.3. Extension to English-Hindi Code-Mixed Data

The authors generalize the approach to a code-mixed English-Hindi dataset (Khandelwal et al., 2018).

  1. Zero-Shot Initialization: Since no aligned pairs exist for this dataset, they first generate 50 examples in a zero-shot setting (no examples in prompt) to create a small seed set.
  2. Few-Shot Generation: They select 9 high-quality results from the seed set to use as in-context examples for the main generation task.
  3. Filtering: Because the task is harder in a code-mixed setting, they add a filtering step. They prompt GPT-4 to classify its own outputs as "Humorous" or "Non-Humorous." Any output still classified as humorous is discarded.

5. Experimental Setup

5.1. Datasets

The experiments utilize two primary datasets:

  1. The Unfun Corpus:

    • Source: Created by West and Horvitz (2019) via a crowdsourced game.
    • Content: Pairs of satirical headlines from The Onion and their serious counterparts edited by humans.
    • Scale: The paper uses a filtered subset of 11,831 pairs, split into training (3,882), development (186), and test (375) sets.
    • Example:
      • Satire: "Scientists Discover Delicious New Species"
      • Serious: "Scientists Discover New Species"
    • Why Chosen: It provides the gold standard for aligned humor data and allows direct comparison between LLMs and human editors.
  2. English-Hindi Code-Mixed Tweets:

    • Source: Khandelwal et al. (2018).
    • Content: Tweets annotated as humorous or non-humorous, containing a mix of English and Hindi (and often Hinglish).
    • Scale: 2,951 available samples.
    • Example:
      • Humorous: "Husbands should be like Vim bar, gale kam aur chale zyada." (Husbands should be like Vim bar, less talk and more work.)
      • Non-Humorous: "Hrithik Roshan is using Vodafone."
    • Why Chosen: To test the generalizability of the "unfunning" capability to different languages and more informal, noisy text domains.

5.2. Evaluation Metrics

The paper employs a mix of automatic and human evaluation metrics.

5.2.1. Accuracy (Automatic)

  • Conceptual Definition: The percentage of correct predictions made by a classifier. In this context, it measures how well a binary humor classifier (trained on synthetic data) can distinguish between real satire and serious news on a holdout test set.
  • Mathematical Formula: Accuracy=TP+TNTP+TN+FP+FN \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
  • Symbol Explanation:
    • TP: True Positives (Humor correctly identified as humor).
    • TN: True Negatives (Non-humor correctly identified as non-humor).
    • FP: False Positives (Non-humor incorrectly identified as humor).
    • FN: False Negatives (Humor incorrectly identified as non-humor).

5.2.2. Human Ratings (Realness, Funniness, Grammar, Coherence)

  • Conceptual Definition: Annotators rate the outputs on categorical scales.
    • Realness: Is the text a convincing real news headline?
    • Funniness: Is the text funny? (0: No, 1: Slightly, 2: Funny).
    • Grammar/Coherence: Is the text well-formed and logical?
  • Mathematical Formula: These are discrete categorical labels, aggregated by majority vote. No continuous formula is used, but the percentage of annotators agreeing on a label is reported.

5.2.3. Edit Distance

  • Conceptual Definition: A measure of how dissimilar two strings are. It counts the minimum number of operations (insertions, deletions, or substitutions) required to transform one string into the other. Here, it measures how much the LLM altered the original joke.
  • Mathematical Formula (Levenshtein Distance): Lev(a,b)={aif b=0bif a=0Lev(a,b)if a[1]=b[1]1+min(Lev(a,b),Lev(a,b),Lev(a,b))otherwise \text{Lev}(a, b) = \begin{cases} |a| & \text{if } |b| = 0 \\ |b| & \text{if } |a| = 0 \\ \text{Lev}(a', b') & \text{if } a[-1] = b[-1] \\ 1 + \min(\text{Lev}(a, b'), \text{Lev}(a', b), \text{Lev}(a', b')) & \text{otherwise} \end{cases}
  • Symbol Explanation:
    • a, b: The two strings being compared.
    • a|a|: The length of string aa.
    • aa': String aa without its last character.
    • bb': String bb without its last character.

5.2.4. Type-Token Ratio (TTR)

  • Conceptual Definition: A measure of lexical diversity. It is the ratio of unique words (types) to the total number of words (tokens).
  • Mathematical Formula: TTR=Number of Unique TypesTotal Number of Tokens \text{TTR} = \frac{\text{Number of Unique Types}}{\text{Total Number of Tokens}}
  • Symbol Explanation:
    • Types: Distinct vocabulary items in the text.
    • Tokens: Total words in the text.

5.3. Baselines

The paper compares several methods:

  1. Human Players: The original crowd-sourced editors from the Unfun dataset.
  2. Real News Headlines: Using unrelated, actual serious news headlines as negative examples (non-aligned data).
  3. ROBERTA-SWAP: The lightweight token-swapping baseline described in the methodology.
  4. LLMs (Mistral, GPT-3.5, GPT-4): The primary models being evaluated for their "unfunning" capability.
  5. Synthetic Humor: The same LLMs prompted to add humor to serious headlines (the reverse direction), to demonstrate the asymmetry.

6. Results & Analysis

6.1. Core Results Analysis

The experiments yield several key findings:

  1. Synthetic Unfuns are High Quality: Classifiers trained on GPT-4's synthetic "unfunned" data achieved a holdout accuracy of 76.5% (Mistral classifier) and 69.9% (RoBERTa classifier). This is very close to the performance of models trained on human-edited data (80.3% and 72.7%).
  2. Asymmetry Confirmed: LLMs are much better at removing humor than creating it. Classifiers trained on LLM-generated humor performed significantly worse (Δ<10%\Delta < -10\%) than those trained on LLM-generated unfuns.
  3. Human Validation: In human evaluations, GPT-4's "unfunned" headlines were rated as "Real" news 49% of the time, significantly outperforming human players (33%) and rivaling actual news headlines (81%). They were also rated as highly grammatical and coherent.
  4. English-Hindi Success: The approach generalized well. GPT-4's unfunned tweets were rated as non-humorous (84% after filtering) and coherent. They acted as strong adversarial examples; a classifier trained on the original dataset failed miserably (22.6% accuracy) on the synthetic unfuns, indicating it had learned superficial features rather than true humor.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Direction Source Data Characteristics Holdout Accuracy
Diversity (TTR) Edit Dist MISTRAL ROBERTA
Unfun ROBERTA-SWAP 0.262 2.7 69.9 (0.9) 62.7 (0.7)
MISTRAL 0.257 2.1 70.7 (0.7) 61.7 (0.3)
MISTRAL INSTRUCT 0.255 2.4 70.9 (0.7) 64.7 (0.5)
GPT-3.5 0.259 4.5 72.9 (0.2) 65.9 (0.4)
GPT-4 0.252 3.8 76.5 (0.2) 69.9 (0.5)
News Headlines 0.306 - 66.3 (0.2) 64.1 (0.2)
Unfun Players 0.271 2.9 80.3 (0.5) 72.7 (0.4)
Humor MISTRAL 0.244 2.8 66.3 (0.7) 56.3 (0.4)
MISTRAL INSTRUCT 0.221 4.5 65.2 (0.8) 58.8 (0.4)
GPT-3.5 0.24 4.6 69.9 (0.5) 58.7 (0.4)
GPT-4 0.246 5.5 69.5 (0.7) 59.7 (0.6)
The Onion 0.262 - - -

The following are the results from Table 2 of the original paper:

Direction Source Rated Real Slightly Funny / Funny Grammatical Coherence
Slightly Funny Funny
Unfun ROBERTA-SWAP 30% 15% 5% 93% 86%
MISTRAL INSTRUCT 21% 50% 14% 100% 96%
GPT-3.5 51% 23% 3% 100% 98%
GPT-4 49% 21% 3% 100% 99%
News Headlines 81% 2% 0% 99% 93%
Human Players 33% 21% 7% 94% 92%
Humor MISTRAL INSTRUCT 21% 34% 9% 99% 93%
GPT-3.5 11% 54% 8% 100% 94%
GPT-4 10% 45% 10% 100% 98%
The Onion 4% 68% 24% 99% 97%

The following are the results from Table 3 of the original paper:

Source Edit Dist Humor Coherence
Non-Humor - 16.8% 92.8%
GPT-4 Unfuns 6.6 16.0% 93.6%
+ GPT-4 Filter 6.9 3.6% 89.3%
Humor - 48.0% 93.6%

The following are the results from Table 4 of the original paper:

Source Unfuns Original Dataset
Balanced Accuracy Humor Non-Humor
Original 22.6 (3.7) 67.9 (0.9) 80.3 (3.5) 56.9 (5.1)
(25%) Synth Unfuns 34.0 (8.4) 67.7 (1.7) 78.4 (3.3) 55.4 (5.9)
(50%) Synth Unfuns 57.7 (6.0) 62.1 (0.6) 68.4 (5.7) 55.9 (4.7)

6.3. Ablation Studies / Parameter Analysis

The paper does not perform a traditional hyperparameter ablation (like varying learning rates or layer counts) for the LLMs themselves, as they are used as black-box APIs. However, it does perform an ablation on the data composition:

  • Synthetic vs. Real Data: It compares training on "Synthetic Unfun + Original Satire" vs. "Human Unfun + Synthetic Satire." This effectively ablates the direction of generation (unfun vs. joke writing), proving that "unfunning" is the superior direction for data synthesis.
  • Filtering (Hindi-English): For the code-mixed dataset, the authors analyze the effect of filtering. Unfiltered GPT-4 outputs had a humor rating of 16.0% (comparable to real non-humor tweets which were rated 16.8% humorous due to subjectivity). After filtering with GPT-4's own classifier classifier, the humor rating dropped to 3.6%, significantly improving quality at the cost of sample size.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper demonstrates that while LLMs are not inherently funny, they are exceptionally good at being "serious." By prompting models like GPT-4 to "unfun" existing jokes, researchers can generate high-quality, aligned datasets of (humorous, non-humorous) pairs. This synthetic data is nearly as effective as human-curated data for training humor classifiers. Furthermore, this capability generalizes to complex, code-mixed languages like English-Hindi, providing a novel way to create datasets for low-resource linguistic scenarios.

7.2. Limitations & Future Work

The authors identify several limitations:

  1. Cultural Scope: The experiments are limited to American satire and English-Hindi code-mixed tweets. Humor is deeply cultural, and the authors suggest future work should investigate other cultures and languages.
  2. Subjectivity: Humor evaluation is inherently subjective. Disagreements among annotators (as seen in the Hindi-English evaluation where 48% agreement is low) pose a challenge for robust evaluation.
  3. Data Contamination: There is a risk that LLMs may have memorized parts of the Unfun corpus during training. The authors checked for this and found minimal overlap, but it remains a general concern for LLM benchmarks.
  4. Ethical Risks: The authors note that while their method helps identify offensive humor, it could theoretically be misused to create pairs of (offensive, safe) text, which could then be used to train models to generate offensive content by reversing the process.

7.3. Personal Insights & Critique

This paper offers a clever and practical solution to the "data scarcity" problem in NLP. Instead of trying to force AI to be creative (a hard problem), it leverages AI's "conservatism" (its tendency to predict probable, normal text) to generate useful training data.

  • Innovation: The insight that "unfunning" is the inverse of humor generation and that LLMs are asymmetrically skilled at the former is profound. It shifts the paradigm from "generating creativity" to "curating normality."
  • Utility: The method is highly practical. It does not require training new models or complex architectures; it simply requires clever prompting of existing GPT-4 access. This makes it accessible to the wider research community immediately.
  • Adversarial Value: The finding that existing classifiers fail on synthetic unfuns is a significant contribution in itself. It serves as a "stress test" for humor detection models, revealing that they likely rely on spurious correlations (like specific keywords or sentence structures) rather than semantic understanding.
  • Potential Improvement: Future work could explore the "filtering" mechanism more deeply. Since the paper uses GPT-4 to filter GPT-4, there might be a bias. Using a distinct model or a human-in-the-loop filter for the seed data might improve the quality further. Additionally, applying this to other nuanced tasks like sarcasm detection or sentiment analysis (e.g., editing a positive review to be negative while keeping the content) seems like a promising avenue.