# AI-generated-text-detection
The objective of this project was to identify AI-generated text vs. human written text. The problem statement is as follows: Can you predict the 'ind' (0= human, 1 = AI) as a function of the 768 document embeddings, the word count and the punctuation?
All text data was converted into embeddings and the project further details the outcomes of this endeavor. Below are some comments that arose during this project.
There are five critical observations from the study focusing on improving classification model performance:
1. Feature Selection Methods Impact on F-1 Score: Utilizing feature selection methods like Random Forest classifiers, PCA, Boruta shadow variables, and K-best led to lower F-1 scores, attributed to the model's inability to capture the full dataset trend and information loss. Conversely, using all parameters increased computational time, highlighting a balance challenge between interpretability and computational efficiency.
2. Decision Against SMOTE for Class Balancing: Opting out of using SMOTE, despite its potential for addressing class imbalance, was based on the observation that it introduced noise and reduced model performance metrics (F-1 score, accuracy, and precision), thereby impairing accurate target variable predictions.
3. Significance of Word Count as a Feature: Word count emerged as a crucial predictor for the target variable, supported by EDA and permutation importance analysis findings. This contrasts with the less significant role of punctuation number (punc_num), which correlated strongly with word count but did not emerge as a key feature.
4. Model Interpretability and Predictor Impact: The analysis of word_count through partial dependence plots revealed a direct correlation with the likelihood of the target variable indicating AI-origin. This was contrary to initial EDA findings. Conversely, higher values of specific features (feature_512 and feature_386) were associated with a higher likelihood of human-originated content.
5. Ensemble Modeling for Improved Accuracy: A singular model could not achieve an F-1 score above 0.68, limited by the complexity of the dataset. Success was achieved by employing a stacking ensemble approach, which significantly improved accuracy in predicting the target class, though at the expense of reduced interpretability due to the complex nature of ensemble models.
These observations underscore the complexities and trade-offs involved in model selection, feature importance, class balancing techniques, and the pursuit of higher accuracy versus interpretability in classification tasks.
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
项目旨在利用先进的深度学习技术,特别是基于人工智能的生成模型,来检测和识别文本中的虚假信息。通过结合自然语言处理和计算机视觉技术,该项目致力于构建一个高效的系统,能够自动识别和标记生成的文本内容中的不实信息。这项工作对于应对网络中的虚假信息和欺诈行为具有重要意义,有助于保护用户免受误导和不实信息的影响。
资源推荐
资源详情
资源评论
收起资源包目录
AI-generated-text-detection.zip (2个子文件)
AI_TextGeneration_Group_5_3_Dec_Final.ipynb 1.52MB
README.md 3KB
共 2 条
- 1
资源评论
- weixin_420508312024-05-11大佬,我付费下载了您的“AI 生成文本检测:利用深度学习技术识别虚假信息”,我有些看不懂,您知道怎么用吗【有偿】
- qq_306189012024-03-27资源简直太好了,完美解决了当下遇到的难题,这样的资源很难不支持~
Meta.Qing
- 粉丝: 2w+
- 资源: 117
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功