# Sentiment Analysis on Tweets
![Status badge](https://img.shields.io/badge/Status-Archived-important)
**Update**(21 Sept. 2018): I don't actively maintain this repository. This work was done for a course project and the dataset cannot be released because I don't own the copyright. However, everything in this repository can be easily modified to work with other datasets. I recommend reading the [sloppily written project report](https://github.com/abdulfatir/twitter-sentiment-analysis/tree/master/docs/report.pdf) for this project which can be found in `docs/`.
## Dataset Information
We use and compare various different methods for sentiment analysis on tweets (a binary classification problem). The training dataset is expected to be a csv file of type `tweet_id,sentiment,tweet` where the `tweet_id` is a unique integer identifying the tweet, `sentiment` is either `1` (positive) or `0` (negative), and `tweet` is the tweet enclosed in `""`. Similarly, the test dataset is a csv file of type `tweet_id,tweet`. Please note that csv headers are not expected and should be removed from the training and test datasets.
## Requirements
There are some general library requirements for the project and some which are specific to individual methods. The general requirements are as follows.
* `numpy`
* `scikit-learn`
* `scipy`
* `nltk`
The library requirements specific to some methods are:
* `keras` with `TensorFlow` backend for Logistic Regression, MLP, RNN (LSTM), and CNN.
* `xgboost` for XGBoost.
**Note**: It is recommended to use Anaconda distribution of Python.
## Usage
### Preprocessing
1. Run `preprocess.py <raw-csv-path>` on both train and test data. This will generate a preprocessed version of the dataset.
2. Run `stats.py <preprocessed-csv-path>` where `<preprocessed-csv-path>` is the path of csv generated from `preprocess.py`. This gives general statistical information about the dataset and will two pickle files which are the frequency distribution of unigrams and bigrams in the training dataset.
After the above steps, you should have four files in total: `<preprocessed-train-csv>`, `<preprocessed-test-csv>`, `<freqdist>`, and `<freqdist-bi>` which are preprocessed train dataset, preprocessed test dataset, frequency distribution of unigrams and frequency distribution of bigrams respectively.
For all the methods that follow, change the values of `TRAIN_PROCESSED_FILE`, `TEST_PROCESSED_FILE`, `FREQ_DIST_FILE`, and `BI_FREQ_DIST_FILE` to your own paths in the respective files. Wherever applicable, values of `USE_BIGRAMS` and `FEAT_TYPE` can be changed to obtain results using different types of features as described in report.
### Baseline
3. Run `baseline.py`. With `TRAIN = True` it will show the accuracy results on training dataset.
### Naive Bayes
4. Run `naivebayes.py`. With `TRAIN = True` it will show the accuracy results on 10% validation dataset.
### Maximum Entropy
5. Run `logistic.py` to run logistic regression model OR run `maxent-nltk.py <>` to run MaxEnt model of NLTK. With `TRAIN = True` it will show the accuracy results on 10% validation dataset.
### Decision Tree
6. Run `decisiontree.py`. With `TRAIN = True` it will show the accuracy results on 10% validation dataset.
### Random Forest
7. Run `randomforest.py`. With `TRAIN = True` it will show the accuracy results on 10% validation dataset.
### XGBoost
8. Run `xgboost.py`. With `TRAIN = True` it will show the accuracy results on 10% validation dataset.
### SVM
9. Run `svm.py`. With `TRAIN = True` it will show the accuracy results on 10% validation dataset.
### Multi-Layer Perceptron
10. Run `neuralnet.py`. Will validate using 10% data and save the best model to `best_mlp_model.h5`.
### Reccurent Neural Networks
11. Run `lstm.py`. Will validate using 10% data and save models for each epock in `./models/`. (Please make sure this directory exists before running `lstm.py`).
### Convolutional Neural Networks
12. Run `cnn.py`. This will run the 4-Conv-NN (4 conv layers neural network) model as described in the report. To run other versions of CNN, just comment or remove the lines where Conv layers are added. Will validate using 10% data and save models for each epoch in `./models/`. (Please make sure this directory exists before running `cnn.py`).
### Majority Vote Ensemble
13. To extract penultimate layer features for the training dataset, run `extract-cnn-feats.py <saved-model>`. This will generate 3 files, `train-feats.npy`, `train-labels.txt` and `test-feats.npy`.
14. Run `cnn-feats-svm.py` which uses files from the previous step to perform SVM classification on features extracted from CNN model.
15. Place all prediction CSV files for which you want to take majority vote in `./results/` and run `majority-voting.py`. This will generate `majority-voting.csv`.
## Information about other files
* `dataset/positive-words.txt`: List of positive words.
* `dataset/negative-words.txt`: List of negative words.
* `dataset/glove-seeds.txt`: GloVe words vectors from StanfordNLP which match our dataset for seeding word embeddings.
* `Plots.ipynb`: IPython notebook used to generate plots present in report.
没有合适的资源?快使用搜索试试~ 我知道了~
twitter-sentiment-analysis:使用Naive Bayes,SVM,CNN,LSTM等对推文进行情感分析
共24个文件
py:17个
txt:2个
gitignore:1个
需积分: 50 10 下载量 107 浏览量
2021-02-03
07:24:38
上传
评论 1
收藏 869KB ZIP 举报
温馨提示
推文情感分析 更新(2018年9月21日):我没有积极维护该存储库。 这项工作是针对课程项目完成的,由于我不拥有版权,因此无法发布数据集。 但是,可以轻松修改此存储库中的所有内容以与其他数据集一起使用。 我建议阅读该的,该可在docs/找到。 数据集信息 我们使用和比较各种不同的方法来对推文(二进制分类问题)进行情感分析。 训练数据集应该是tweet_id,sentiment,tweet类型的csv文件tweet_id,sentiment,tweet其中tweet_id是标识该tweet的唯一整数, sentiment是1 (正)或0 (负), tweet是括在""的tweet 。 类似地,测试数据集是tweet_id,tweet类型的csv文件。 请注意,不需要csv标头,应将其从训练和测试数据集中删除。 要求 该项目有一些一般的图书馆要求,而某些则是针对个别方法的。 一般要求如下。 numpy scikit-learn scipy nltk 某些方法特有的库要求是: 带TensorFlow后端的keras ,用于Logistic回归,MLP,RNN(LSTM)和CNN
资源详情
资源评论
资源推荐
收起资源包目录
twitter-sentiment-analysis-master.zip (24个子文件)
twitter-sentiment-analysis-master
.gitignore 490B
dataset
positive-words.txt 20KB
negative-words.txt 45KB
code
extract-cnn-feats.py 3KB
neuralnet.py 6KB
cnn-feats-svm.py 1KB
logistic.py 6KB
naivebayes.py 6KB
maxent-nltk.py 3KB
utils.py 2KB
svm.py 6KB
stats.py 4KB
decisiontree.py 6KB
cnn.py 5KB
majority-voting.py 727B
preprocess.py 4KB
randomforest.py 6KB
baseline.py 2KB
xgboost.py 6KB
lstm.py 4KB
LICENSE 1KB
README.md 5KB
docs
Plots.ipynb 91KB
report.pdf 829KB
共 24 条
- 1
阚发景
- 粉丝: 23
- 资源: 4614
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- (源码)基于CC++和wxWidgets框架的LEGO模型火车控制系统.zip
- (源码)基于C语言的操作系统实验项目.zip
- (源码)基于C++的分布式设备配置文件管理系统.zip
- (源码)基于ESP8266和Arduino的HomeMatic水表读数系统.zip
- (源码)基于Django和OpenCV的智能车视频处理系统.zip
- (源码)基于ESP8266的WebDAV服务器与3D打印机管理系统.zip
- (源码)基于Nio实现的Mycat 2.0数据库代理系统.zip
- (源码)基于Java的高校学生就业管理系统.zip
- (源码)基于Spring Boot框架的博客系统.zip
- (源码)基于Spring Boot框架的博客管理系统.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0