DETECTION OF MACHINE-GENERATED TEXT BY INTEGRATING ROBERTA EMBEDDINGS WITH TOPOLOGICAL FEATURES
Main Article Content
Abstract
In the contemporary digital landscape, language generation models have experienced an explosive surge in popularity, driven by remarkable advancements in Artificial Intelligence (AI) and Natural Language Processing (NLP). As a result, distinguishing between human-generated and machine- generated text has become increasingly complex. The pervasive presence of highly advanced language models and hence machine- generated content has heightened concerns surrounding the spread of misinformation and the proliferation of deceptive and plagiarized content. To address this pressing challenge, an innovative solution exists in harnessing the combined power of the RoBERTa (Robustly Optimized BERT Approach) model and TDA (Topological Data Analysis) features to develop a model capable of discerning between human and machine-generated text effectively. The idea is to capture semantic differences in text belonging to these two classes, as identified by RoBERTa, and integrate it with the structural and geometrical properties of the associated attention maps as learned from TDA to give rise to a model that outperforms any of these approaches taken individually. Through this endeavor, a valuable tool could be provided across various domains, including academia, enabling the detection of AI-generated content and fostering a safer and more trustworthy digital environment.