In a recent study published in the journal Scientific Reports, researchers compared human and artificial intelligence (AI) chatbot creativity using the alternate uses task (AUT) to understand the current boundaries and potential of machine-generated creativity.
Study: Best humans still outperform artificial intelligence in a creative divergent thinking task. Image Credit: girafchik / Shutterstock
Background
Generative AI tools, such as Chat Generative Pre-Trained Transformer (ChatGPT) and MidJourney, have stirred debates regarding their impact on jobs, education, and legal protections for AI-generated content. Historically, creativity was seen as uniquely human and traditionally linked to originality and usefulness, but AI’s emerging capabilities are now challenging and redefining this belief. However, there is a pressing need for further research to profoundly comprehend the underlying mechanisms of human and AI creativity and their implications for society, employment, ethics, and the shifting definition of human identity in the AI era.
About the study
In the present study, AUT data from humans was sourced from a previous research project, and using the online platform Prolific, native English speakers were engaged. Out of the 310 participants who began the study, 279 completed it. After evaluating their attentiveness through visual tasks, 256 were deemed diligent and included in the analysis, boasting an average age of 30.4. Most participants were full-time employees or students, hailed primarily from the United States of America (USA), the United Kingdom (UK), Canada, and Ireland.
In 2023, three AI chatbots, ChatGPT3.5 (later referred to as ChatGPT3), ChatGPT4, and Copy.Ai, were tested on specific dates, undergoing examination 11 times using four different object prompts across separate sessions. This method ensured a sizable sample to discern differences, especially when compared against the extensive human data.
For the AUT procedure, participants were engaged with four objects: rope, box, pencil, and candle, and were advised to prioritize the originality of their answers rather than sheer volume. While humans were tested once per session, AIs underwent 11 distinct sessions with instructions slightly modified to fit their design. The primary concern was keeping the AI responses comparable to human answers, especially in length.
Before analysis, the responses underwent a spell-check, and any ambiguous short answers were discarded. The essence of divergent thinking was gaged using the semantic distance between an object and its AUT response, utilizing the SemDis platform. Any potential bias in responses, particularly from AIs using jargon like “Do It Yourself” (DIY), was addressed for consistency.
The originality of answers was evaluated by six human raters who, unaware of the AI-generated responses, rated each answer’s originality on a scale of 1 to 5. The rating methodology had clear guidelines to ensure objectivity, and their collective ratings demonstrated high inter-rater reliability.
Lastly, the data underwent rigorous statistical analyses with the aim of deriving meaningful conclusions from the amassed data. Various models were employed to evaluate the scores, taking into account fixed effects such as group and object and potential covariates.
Study results
The present study analyzed the creative divergent thinking in humans and AI chatbots, focusing on their responses to different objects, and observed a moderate correlation between the semantic distance and humans’ subjective ratings. This suggested that while both scoring methods measured similar qualities, they were not identical in nature. Therefore, it was deemed appropriate to evaluate the data using both semantic distance and subjective ratings.
Using linear mixed-effect models for a broad comparison between humans and AI, a consistent pattern emerged: AI chatbots not only generally outperformed humans but also had higher mean and max scores in terms of semantic distance. When fluency was considered as a covariate, it was seen to decrease the mean scores but increase the max scores. This trend was also reflected in the human subjective ratings of creativity, where AI again scored higher in both mean and max scores. An interesting observation was that while some human participants responded with typical or even illogical uses of the objects, AI chatbots consistently provided atypical and logical uses, never scoring below a certain threshold.
The study delved deeper into comparing the responses of humans and individual AI chatbots to specific objects, namely a rope, box, pencil, and candle. The analyses showcased that ChatGPT3 and ChatGPT4, two of the AI models, outperformed humans in terms of mean semantic distance scores. However, when considering max scores, there was no statistically significant difference between human participants and the AI chatbots. It was also observed that responses to the rope were typically rated lower in terms of semantic distance than the other objects.
The human subjective ratings assessing creativity revealed that ChatGPT4 consistently received higher ratings than both humans and the other chatbots, showcasing its clear edge. However, this advantage was not observed when the chatbots were tasked with the object “pencil.” An interesting pattern emerged with the candle, as responses related to it generally received lower ratings compared to other objects. A standout observation was that two AI sessions, one from ChatGPT3 and the other from ChatGPT4, recorded max scores higher than any human for the object “box.”
The data underlined the impressive performance of AI chatbots, particularly ChatGPT4, in creative divergent thinking tasks compared to humans. However, it is worth noting that AI did not uniformly outpace humans across all metrics or objects, underscoring the complexities of creativity and the areas where humans still hold an advantage.