#A PyTorch based demo for text classification.
## Introduction
Text classification is a core problem to many applications like e.g. spam filtering, email routing, book classification.
The task aim to train a classifier using labelled dataset containing text documents and their labels, which can be
a web page, paper, email, reviewer etc.
This demo is designed to assign documents to 20 different newsgroups, each corresponding to a different topic using a
text classification method. The newsgroups data is simply the raw text:
```
From: cmk@athena.mit.edu (Charles M Kozierok)
Subject: Re: bosio's no-hitter
Date: 24 Apr 1993 03:59:58 GMT
Organization: Massachusetts Institute of Technology
I watched the final inning of Bosio's no-hitter with several people at work. After Vizquel made that barehanded
grab of the chopper up the middle, someone remarked that if he had fielded it with his glove, he wouldn't have
had time to throw Riles out. Yet, the throw beat Riles by about two steps. I wonder how many others who watched
the final out think Vizquel had no choice but to make the play with his bare hand.
In this morning's paper (or was it on the radio?), Vizquel was quoted as saying that he could have fielded the
ball with his glove and still easily thrown out Riles, that he barehanded it instead so as to make the final
play more memorable. Seems a litle cocky to me, but he made it work so he's entitled. i guess so.
still, that's kind of a stupid move, IMO. he'd be singing a different tune if he had booted it, and the next
guy up had hit a bloop single. stranger things have happened (hey, i used to be a big Dave Stieb fan...) and
unfortunately, there's no such thing as an "unearned hit". :^)
cheers,
-*-
charles
```
This text describes about baseball. The goal of this demo is to learn to tag it with rec.sport.baseball. Category
includes 6 major categories and 20 fine-grained categories, See the section on Dataset for more information about
labels.
## Dataset
The 20 Newsgroups data set contains 20000 newsgroup documents collected across 20 different newsgroups. Here is a list
of the 20 newsgroups partitioned (more or less) according to subject matter:
<table border="1" cellpadding="0">
<tbody>
<tr>
<td>
<p align="left">
comp.graphics<br>
comp.os.ms-windows.misc<br>
comp.sys.ibm.pc.hardware<br>
comp.sys.mac.hardware<br>
comp.windows.x
</p>
</td>
<td>
<p align="left">
rec.autos<br>
rec.motorcycles<br>
rec.sport.baseball<br>
rec.sport.hockey
</p>
</td>
<td>
<p align="left">
sci.crypt<br>
sci.electronics<br>
sci.med<br>
sci.space
</p>
</td>
</tr>
<tr>
<td>
<p align="left">
misc.forsale
</p>
</td>
<td>
<p align="left">
talk.politics.misc<br>
talk.politics.guns<br>
talk.politics.mideast
</p>
</td>
<td>
<p align="left">
talk.religion.misc<br>
alt.atheism<br>
soc.religion.christian
</p>
</td>
</tr>
</tbody>
</table>
## Requirements
- [PyTorch](http://pytorch.org/) Deep learning library, should install follow the offical web.
- numpy==1.15.3
- Pillow==5.3.0
- requests==2.20.0
- scikit-learn==0.20.0
- scipy==1.1.0
- six==1.11.0
- sklearn==0.0
- torchtext==0.2.3
- tqdm==4.28.1
- urllib3==1.24
## Runtime requirements
Using the default parameters:
- GPU mode need about 2GB cuda memory.
- CPU mode takes tiny CPU utilization and less than 500 MB of main memory usage.
## Usage
```
python/python3 text_classification.py --model_name <TextCNN|LSTMSelfAttentionHighway> --batch_size 128 --epochs 32
```
This downloads the following data automatically:
- [Twenty Newsgroups Dataset](https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups) (This dataset consists of 20000 messages taken from 20 newsgroups)
## Results
with the two methods we provide(TextCNN, LSTMSelfAttentionHighway), you will get the following results on valid dataset:
```
methods accuracy loss
TextCNN 97% 0.000836
LSTMSelfAttentionHighway 93% 0.003445
```
## FAQ
#### I am getting out-of-memory errors, what is wrong?
You are likely to encounter out-of-momery issues using the default parameters if your GPU momery less than 2GB.
The factors that affect memory usage are batch_size, hidden_dim, embed_dim, filters, and so on, you can try to reduce
these parameters.
## Acknowledgements
* [Pytorch team](https://github.com/pytorch/pytorch#the-team) for Python library<br>
* [A pytorch implementation of CNNText classification](https://github.com/Shawn1993/cnn-text-classification-pytorch)
* [A pytorch implementation of LSTM + Self Attention classification](https://github.com/nn116003/self-attention-classification)
## Reference
* [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882)
* [A structured self-attention sentence embedding](https://arxiv.org/pdf/1703.03130.pdf)
* [Highway Network](https://arxiv.org/abs/1505.00387)
* [NewsWeeder: Learning to Filter Netnews](http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=F78E1658C9E109F677438D805DF0BF9E?doi=10.1.1.22.6286&rep=rep1&type=pdf)
## License
MIT
没有合适的资源?快使用搜索试试~ 我知道了~
跨TensorFlow、CNTK、Theano 等开始深度学习的示例 仅供学习参考用代码.rar
共520个文件
md:106个
py:90个
png:64个
需积分: 0 0 下载量 95 浏览量
2023-05-06
21:16:02
上传
评论
收藏 107.56MB RAR 举报
温馨提示
跨TensorFlow、CNTK、Theano 等开始深度学习的示例 仅供学习参考用代码
资源推荐
资源详情
资源评论
收起资源包目录
跨TensorFlow、CNTK、Theano 等开始深度学习的示例 仅供学习参考用代码.rar (520个子文件)
cmudict-0.7b 3.54MB
cmudict-0.7b 3.54MB
Package.appxmanifest 2KB
checkpoint 105B
000000000.chunk 31.82MB
G2P.cntk 18KB
01_OneHidden.cntk 4KB
04_OneConvBN.cntk 3KB
02_OneConv.cntk 3KB
FeedForward.cntk 3KB
MNIST.cntkproj 1KB
AN4.cntkproj 1KB
CMUDict.cntkproj 1KB
App.config 2KB
App.config 698B
packages.config 502B
packages.config 499B
App.config 183B
packages.config 145B
packages.config 144B
packages.config 144B
MainViewModel.cs 16KB
MainViewModel.cs 11KB
ResultViewModel.cs 9KB
MainWindow.cs 5KB
App.xaml.cs 4KB
BearClassification.cs 3KB
MainWindow.Designer.cs 3KB
TransferFactory.cs 3KB
TransferFactory.cs 3KB
Resources.Designer.cs 3KB
Resources.Designer.cs 3KB
Resources.Designer.cs 3KB
Mnist.cs 3KB
AssemblyInfo.cs 2KB
AssemblyInfo.cs 2KB
RelayCommand.cs 2KB
ImageConstants.cs 2KB
ModelTransfer.cs 2KB
ViewModelLocator.cs 2KB
ModelTransfer.cs 2KB
ViewModelLocator.cs 2KB
ByteArrayUtils.cs 1KB
AssemblyInfo.cs 1KB
AssemblyInfo.cs 1KB
AssemblyInfo.cs 1KB
AssemblyInfo.cs 1KB
CameraDevice.cs 1KB
CameraDevice.cs 1KB
AssemblyInfo.cs 1KB
Settings.Designer.cs 1KB
Settings.Designer.cs 1KB
Settings.Designer.cs 1KB
MainWindow.xaml.cs 927B
ByteArrayUtils.cs 898B
MainWindow.xaml.cs 830B
App.xaml.cs 784B
App.xaml.cs 772B
ImageConstants.cs 700B
Program.cs 637B
IModel.cs 587B
IModel.cs 544B
MainPage.xaml.cs 477B
PictureStyleTransfer.csproj 9KB
VideoStyleTransfer.csproj 7KB
BearClassificationUWP.App.csproj 7KB
VideoStyleTransferLibrary.csproj 5KB
PictureStyleTransferLibrary.csproj 5KB
MNISTModelLibrary.csproj 4KB
MNIST.App.csproj 4KB
cmudict-0.7b.train-dev-20-21.ctf 33.25MB
cmudict-0.7b.train-dev-20-21.ctf 33.25MB
cmudict-0.7b.test.ctf 3.83MB
cmudict-0.7b.test.ctf 3.83MB
atis.train.ctf 2.89MB
cmudict-0.7b.train-dev-1-21.ctf 1.61MB
cmudict-0.7b.train-dev-1-21.ctf 1.61MB
atis.test.ctf 485KB
tiny.ctf 220B
tiny.ctf 220B
variables.data-00000-of-00001 12.69MB
variables.data-00000-of-00001 6.95MB
variables.data-00000-of-00001 6.95MB
variables.data-00000-of-00001 6.95MB
NTM-copy_copy.model-3402.data-00000-of-00001 806KB
gpu_detector_win.exe 948KB
training.gif 14.59MB
NTM.gif 18KB
NTM.gif 18KB
.gitattributes 103B
gpu_detector_linux 551KB
gpu_detector_mac 371KB
NTM-copy_copy.model-3402.index 5KB
variables.index 2KB
variables.index 2KB
variables.index 2KB
variables.index 670B
ImageConstants.ini 39B
NTM Test.ipynb 238KB
cheetah.jpg 1012KB
共 520 条
- 1
- 2
- 3
- 4
- 5
- 6
资源评论
极客11
- 粉丝: 385
- 资源: 5519
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功