# ChatLearner
![](https://img.shields.io/badge/python-3.6.2-brightgreen.svg) ![](https://img.shields.io/badge/tensorflow-1.4.0-yellowgreen.svg?sanitize=true)
A chatbot implemented in TensorFlow based on the new sequence to sequence (NMT) model, with certain rules seamlessly integrated.
**For those who are interested in chatbots in Chinese, please check [here](#chinese_chatbots).**
The core of ChatLearner (Papaya) was built on the NMT model(https://github.com/tensorflow/nmt), which has been adapted here to fit the needs of a chatbot. Due to the changes made on tf.data API in TensorFlow 1.4 and many other changes since TensorFlow 1.12, this ChatLearner version only supports TF version 1.4 through 1.11.
Easy updates can be made in tokenizeddata.py file if you need to support TensorFlow 1.12.
Before starting everything else, you may want to get a feeling of how ChatLearner behaves. Take a look at the sample conversation below or [here](https://github.com/bshao001/ChatLearner/blob/master/Data/Test/responses.txt), or if you prefer to try my trained model, download it [here](https://drive.google.com/file/d/1mVWFScBHFeA7oVxQzWb8QbKfTi3TToUr/view?usp=sharing). Unzip the downloaded .rar file, and copy the Result folder into the Data folder under your project root. A vocab.txt file is also included in case I update it without updating the trained model in the future.
![](/Data/Test/chat.png)
## Highlights and Specialties:
Why do you want to spend time checking this repository? Here are some possible reasons:
1. The Papaya Data Set for training the chatbot. You can easily find tons of training data online, but you cannot find any with such high quality. See the detailed description below about the data set.
2. The concise code style and clear implementation of the new seq2seq model based on dynamic RNN (a.k.a. the new NMT model). It is customized for chatbots and much easier to understand compared with the official tutorial.
3. The idea of using seamlessly integrated ChatSession to handle basic conversational context.
4. Some rules are integrated to demo how to combine traditional rule-based chatbots with new deep learning models. No matter how powerful a deep learning model can be, it cannot even answer questions requiring simple arithmetic calculations, and many others. The approach demonstrated here can be easily adapted to retrieve news or other online information. With the rules implemented, it can then properly answer many interesting questions. For example:
* "What time is it now?" or "What day is it today?" or "What's the date yesterday?"
* "Read me a story please." or "Tell me a joke." It can then present stories and jokes randomly and not being limited by the sequence length of the decoder.
* "How much is twelve thousand three hundred four plus two hundred fifty six?" or "What is the sum of five and six?" or "How much is twelve thousand three-hundred and four divided by two-hundred-fifty-six?" or "If x=55 and y=19, how much is y - x?" or "How much do you get if you subtract eight from one hundred?" or even "If x = 99 and y = 228 / x, how much is y?"
If you are not interested in rules, you can easily remove those lines related to knowledgebase.py and functiondata.py.
5. A SOAP-based web service (and a REST-API-based alternative, if you don't like to use SOAP) allows you to present the GUI in Java, while the model is trained and running in Python and TensorFlow.
6. A simple solution (in-graph) to convert a string tensor to lower case in TensorFlow. It is required if you utilize the new DataSet API (tf.data.TextLineDataSet) in TensorFlow to load training data from text files.
7. The repository also contains a chatbot implementation based on the legacy seq2seq model. In case you are interested in that, please check the Legacy_Chatbot branch at https://github.com/bshao001/ChatLearner/tree/Legacy_Chatbot.
## Papaya Conversational Data Set
Papaya Data Set is the best (cleanest and well-organized) free English conversational data you can find on the web for training a chatbot. Here are some details:
1. The data are composed of two sets: the first set was handcrafted, and we created the samples in order to maintain a consistent role of the chatbot, who can therefore be trained to be polite, patient, humorous, philosophical, and aware that he is a robot, but pretend to be a 9-year old boy named Papaya; the second set was cleaned from some online resources, including the scenario conversations designed for training robots, the Cornell movie dialogs, and cleaned Reddit data.
2. The training data set is split into three categories: two subsets will be augmented/repeated during the training, with different levels or times, while the third will not. The augmented subsets are to train the model with rules to follow, and some knowledge and common senses, while the third subset is just to help to train the language model.
3. The scenario conversations were extracted and reorganized from http://www.eslfast.com/robot/. If your model can support context, it would work much better by utilizing these conversations.
4. The original Cornell data set can be found at [here](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). We cleaned it using a Python script (the script can also be found in the Corpus folder); we then cleaned it manually by quickly searching certain patterns.
5. For the Reddit data, a cleaned subset (about 110K pairs) is included in this repository. The vocab file and model parameters are created and adjusted based on all the included data files. In case you need a larger set, you can also find scripts to parse and clean the Reddit comments in the Corpus/RedditData folder. In order to use those scripts, you need to download a torrent of Reddit comments from a torrent link [here](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/).
Normally a single month of comments is big enough (can generated 3M pairs of training samples roughly). You can tune the parameters in the scripts based on your needs.
6. The data files in this data set were already preprocessed with NLTK tokenizer so that they are ready to feed into the model using new tf.data API in TensorFlow.
## Before You Proceed
1. Please make sure you have the correct TensorFlow version. It works only with TensorFlow 1.4, not any earlier releases because the tf.data API used here was newly updated in TF 1.4.
2. Please make sure you have environment variable PYTHONPATH setup. It needs to point to the project root directory, in which you have chatbot, Data, and webui folder. If you are running in an IDE, such as PyCharm, it will create that for you. But if you run any python scripts in a command line, you have to have that environment variable, otherwise, you get module import errors.
3. Please make sure you are using the same vocab.txt file for both training and inference/prediction. Keep in mind that your model will never see any words as we do. It's all integers in, integers out, while the words and their orders in vocab.txt help to map between the words and integers.
4. Spend a little bit time thinking of how big your model should be, what should be the maximum length of the encoder/decoder, the size of the vocabulary set, and how many pairs of the training data you want to use. Be advised that a model has a capacity limit: how much data it can learn or remember. When you have a fixed number of layers, number of units, type of RNN cell (such as GRU), and you decided the encoder/decoder length, it is mainly the vocabulary size that impacts your model's ability to learn, not the number of training samples. If you can manage not to let the vocabulary size to grow when you make use of more training data, it probably will work, but the reality is when you have more training samples, the vocabulary size also increases very quickly, and you may then notice your model cannot accommodate that siz
没有合适的资源?快使用搜索试试~ 我知道了~
ChatLearner:基于seq2seq模型在TensorFlow中实现的聊天机器人,其中集成了某些规则
共109个文件
py:28个
jpg:24个
txt:17个
需积分: 50 8 下载量 42 浏览量
2021-02-04
23:30:51
上传
评论
收藏 23.08MB ZIP 举报
温馨提示
聊天学习者 在TensorFlow中基于新的序列到序列(NMT)模型实现的聊天机器人,具有无缝集成的某些规则。 对于那些对中文聊天机器人感兴趣的人,请。 ChatLearner(Papaya)的核心是基于NMT模型( )构建的,此处已对其进行了调整以适应聊天机器人的需求。 由于TensorFlow 1.4中tf.data API的更改以及自TensorFlow 1.12以来的许多其他更改,此ChatLearner版本仅支持TF版本1.4至1.11。 如果您需要支持TensorFlow 1.12,可以在tokenizeddata.py文件中进行轻松更新。 在开始其他一切之前,您可能需要
资源详情
资源评论
资源推荐
收起资源包目录
ChatLearner:基于seq2seq模型在TensorFlow中实现的聊天机器人,其中集成了某些规则 (109个子文件)
style.css 3KB
style.css 3KB
index.html 823B
index.html 822B
webservices-rt.jar 13.66MB
javax.json-1.0.2.jar 77KB
chatClient-20171229_083105.jar 7KB
chatClient-20171227_030709.jar 2KB
ChatService.java 4KB
ObjectFactory.java 3KB
SessionSentence.java 3KB
ParamsTypes.java 3KB
ChatClient.java 2KB
ChatClient.java 2KB
ChatServicePortType.java 2KB
package-info.java 888B
good_afternoon_2.jpg 15KB
good_morning_4.jpg 15KB
good_evening_4.jpg 15KB
good_morning_2.jpg 14KB
good_evening_5.jpg 13KB
good_night_3.jpg 12KB
good_afternoon_5.jpg 12KB
good_morning_1.jpg 12KB
good_evening_2.jpg 11KB
good_afternoon_1.jpg 11KB
good_evening_1.jpg 10KB
good_afternoon_3.jpg 10KB
good_evening_3.jpg 9KB
good_afternoon_4.jpg 9KB
good_night_2.jpg 9KB
good_night_4.jpg 8KB
good_morning_3.jpg 8KB
good_night_5.jpg 8KB
good_morning_5.jpg 7KB
good_night_1.jpg 7KB
papaya.jpg 6KB
papaya.jpg 6KB
bjb.jpg 3KB
bjb.jpg 3KB
jquery-3.2.1.min.js 85KB
jquery-3.2.1.min.js 85KB
index.js 3KB
index.js 1KB
redditparser_config.json 725B
hparams.json 671B
getChatReply.jsp 2KB
getChatReply.jsp 1KB
README.md 11KB
README.md 2KB
README.md 2KB
webui.png 656KB
chinese_chat56.png 121KB
chinese_chat34.png 119KB
chinese_chat12.png 115KB
chat.png 57KB
functiondata.py 17KB
complextypes.py 17KB
tokenizeddata.py 16KB
modelcreator.py 14KB
soaphandler.py 12KB
redditparser.py 10KB
wsdl.py 9KB
patternutils.py 8KB
cornelldatacleaner.py 7KB
bottrainer.py 7KB
xmltypes.py 6KB
redditdatacleaner.py 6KB
vocabgenerator.py 6KB
botpredictor.py 6KB
webservices.py 4KB
soap.py 3KB
sessiondata.py 3KB
secondcleaner.py 3KB
modelhelper.py 3KB
chatservice.py 3KB
preprocesser.py 3KB
knowledgebase.py 3KB
testdemo.py 3KB
botui.py 2KB
chatservice.py 2KB
hparams.py 2KB
settings.py 742B
__init__.py 607B
reddit_cleaned_part.txt 18.69MB
cornell_cleaned_new.txt 5.05MB
vocab.txt 401KB
a_papaya_new.txt 85KB
scenarios_new.txt 82KB
responses.txt 30KB
misc_new.txt 15KB
LICENSE.txt 11KB
rule4_arithmetic_ops_new.txt 9KB
samples.txt 8KB
stories.txt 7KB
rule3_stories_jokes_new.txt 4KB
rule2_date_time_new.txt 3KB
rule5_simple_context_new.txt 3KB
jokes.txt 2KB
rule1_unk_new.txt 1KB
共 109 条
- 1
- 2
DGGs
- 粉丝: 16
- 资源: 4645
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0