ChatLearner：基于seq2seq模型在TensorFlow中实现的聊天机器人，其中集成了某些规则

共109个文件

py：28个

jpg：24个

txt：17个

python

deep-learning

tensorflow

chatbot

beam-search

需积分: 50 170 浏览量 2021-02-04 23:30:51 上传评论收藏 23.08MB ZIP 举报

资源详情

资源评论

资源推荐

收起资源包目录

ChatLearner：基于seq2seq模型在TensorFlow中实现的聊天机器人，其中集成了某些规则（109个子文件）

style.css 3KB

index.html 823B

index.html 822B

webservices-rt.jar 13.66MB

javax.json-1.0.2.jar 77KB

chatClient-20171229_083105.jar 7KB

chatClient-20171227_030709.jar 2KB

ChatService.java 4KB

ObjectFactory.java 3KB

SessionSentence.java 3KB

ParamsTypes.java 3KB

ChatClient.java 2KB

ChatServicePortType.java 2KB

package-info.java 888B

good_afternoon_2.jpg 15KB

good_morning_4.jpg 15KB

good_evening_4.jpg 15KB

good_morning_2.jpg 14KB

good_evening_5.jpg 13KB

good_night_3.jpg 12KB

good_afternoon_5.jpg 12KB

good_morning_1.jpg 12KB

good_evening_2.jpg 11KB

good_afternoon_1.jpg 11KB

good_evening_1.jpg 10KB

good_afternoon_3.jpg 10KB

good_evening_3.jpg 9KB

good_afternoon_4.jpg 9KB

good_night_2.jpg 9KB

good_night_4.jpg 8KB

good_morning_3.jpg 8KB

good_night_5.jpg 8KB

good_morning_5.jpg 7KB

good_night_1.jpg 7KB

papaya.jpg 6KB

bjb.jpg 3KB

jquery-3.2.1.min.js 85KB

index.js 3KB

index.js 1KB

redditparser_config.json 725B

hparams.json 671B

getChatReply.jsp 2KB

getChatReply.jsp 1KB

README.md 11KB

README.md 2KB

webui.png 656KB

chinese_chat56.png 121KB

chinese_chat34.png 119KB

chinese_chat12.png 115KB

chat.png 57KB

functiondata.py 17KB

complextypes.py 17KB

tokenizeddata.py 16KB

modelcreator.py 14KB

soaphandler.py 12KB

redditparser.py 10KB

wsdl.py 9KB

patternutils.py 8KB

cornelldatacleaner.py 7KB

bottrainer.py 7KB

xmltypes.py 6KB

redditdatacleaner.py 6KB

vocabgenerator.py 6KB

botpredictor.py 6KB

webservices.py 4KB

soap.py 3KB

sessiondata.py 3KB

secondcleaner.py 3KB

modelhelper.py 3KB

chatservice.py 3KB

preprocesser.py 3KB

knowledgebase.py 3KB

testdemo.py 3KB

botui.py 2KB

chatservice.py 2KB

hparams.py 2KB

settings.py 742B

__init__.py 607B

reddit_cleaned_part.txt 18.69MB

cornell_cleaned_new.txt 5.05MB

vocab.txt 401KB

a_papaya_new.txt 85KB

scenarios_new.txt 82KB

responses.txt 30KB

misc_new.txt 15KB

LICENSE.txt 11KB

rule4_arithmetic_ops_new.txt 9KB

samples.txt 8KB

stories.txt 7KB

rule3_stories_jokes_new.txt 4KB

rule2_date_time_new.txt 3KB

rule5_simple_context_new.txt 3KB

jokes.txt 2KB

rule1_unk_new.txt 1KB

共 109 条

# ChatLearner ![](https://img.shields.io/badge/python-3.6.2-brightgreen.svg) ![](https://img.shields.io/badge/tensorflow-1.4.0-yellowgreen.svg?sanitize=true) A chatbot implemented in TensorFlow based on the new sequence to sequence (NMT) model, with certain rules seamlessly integrated. **For those who are interested in chatbots in Chinese, please check [here](#chinese_chatbots).** The core of ChatLearner (Papaya) was built on the NMT model(https://github.com/tensorflow/nmt), which has been adapted here to fit the needs of a chatbot. Due to the changes made on tf.data API in TensorFlow 1.4 and many other changes since TensorFlow 1.12, this ChatLearner version only supports TF version 1.4 through 1.11. Easy updates can be made in tokenizeddata.py file if you need to support TensorFlow 1.12. Before starting everything else, you may want to get a feeling of how ChatLearner behaves. Take a look at the sample conversation below or [here](https://github.com/bshao001/ChatLearner/blob/master/Data/Test/responses.txt), or if you prefer to try my trained model, download it [here](https://drive.google.com/file/d/1mVWFScBHFeA7oVxQzWb8QbKfTi3TToUr/view?usp=sharing). Unzip the downloaded .rar file, and copy the Result folder into the Data folder under your project root. A vocab.txt file is also included in case I update it without updating the trained model in the future. ![](/Data/Test/chat.png) ## Highlights and Specialties: Why do you want to spend time checking this repository? Here are some possible reasons: 1. The Papaya Data Set for training the chatbot. You can easily find tons of training data online, but you cannot find any with such high quality. See the detailed description below about the data set. 2. The concise code style and clear implementation of the new seq2seq model based on dynamic RNN (a.k.a. the new NMT model). It is customized for chatbots and much easier to understand compared with the official tutorial. 3. The idea of using seamlessly integrated ChatSession to handle basic conversational context. 4. Some rules are integrated to demo how to combine traditional rule-based chatbots with new deep learning models. No matter how powerful a deep learning model can be, it cannot even answer questions requiring simple arithmetic calculations, and many others. The approach demonstrated here can be easily adapted to retrieve news or other online information. With the rules implemented, it can then properly answer many interesting questions. For example: * "What time is it now?" or "What day is it today?" or "What's the date yesterday?" * "Read me a story please." or "Tell me a joke." It can then present stories and jokes randomly and not being limited by the sequence length of the decoder. * "How much is twelve thousand three hundred four plus two hundred fifty six?" or "What is the sum of five and six?" or "How much is twelve thousand three-hundred and four divided by two-hundred-fifty-six?" or "If x=55 and y=19, how much is y - x?" or "How much do you get if you subtract eight from one hundred?" or even "If x = 99 and y = 228 / x, how much is y?" If you are not interested in rules, you can easily remove those lines related to knowledgebase.py and functiondata.py. 5. A SOAP-based web service (and a REST-API-based alternative, if you don't like to use SOAP) allows you to present the GUI in Java, while the model is trained and running in Python and TensorFlow. 6. A simple solution (in-graph) to convert a string tensor to lower case in TensorFlow. It is required if you utilize the new DataSet API (tf.data.TextLineDataSet) in TensorFlow to load training data from text files. 7. The repository also contains a chatbot implementation based on the legacy seq2seq model. In case you are interested in that, please check the Legacy_Chatbot branch at https://github.com/bshao001/ChatLearner/tree/Legacy_Chatbot. ## Papaya Conversational Data Set Papaya Data Set is the best (cleanest and well-organized) free English conversational data you can find on the web for training a chatbot. Here are some details: 1. The data are composed of two sets: the first set was handcrafted, and we created the samples in order to maintain a consistent role of the chatbot, who can therefore be trained to be polite, patient, humorous, philosophical, and aware that he is a robot, but pretend to be a 9-year old boy named Papaya; the second set was cleaned from some online resources, including the scenario conversations designed for training robots, the Cornell movie dialogs, and cleaned Reddit data. 2. The training data set is split into three categories: two subsets will be augmented/repeated during the training, with different levels or times, while the third will not. The augmented subsets are to train the model with rules to follow, and some knowledge and common senses, while the third subset is just to help to train the language model. 3. The scenario conversations were extracted and reorganized from http://www.eslfast.com/robot/. If your model can support context, it would work much better by utilizing these conversations. 4. The original Cornell data set can be found at [here](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). We cleaned it using a Python script (the script can also be found in the Corpus folder); we then cleaned it manually by quickly searching certain patterns. 5. For the Reddit data, a cleaned subset (about 110K pairs) is included in this repository. The vocab file and model parameters are created and adjusted based on all the included data files. In case you need a larger set, you can also find scripts to parse and clean the Reddit comments in the Corpus/RedditData folder. In order to use those scripts, you need to download a torrent of Reddit comments from a torrent link [here](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/). Normally a single month of comments is big enough (can generated 3M pairs of training samples roughly). You can tune the parameters in the scripts based on your needs. 6. The data files in this data set were already preprocessed with NLTK tokenizer so that they are ready to feed into the model using new tf.data API in TensorFlow. ## Before You Proceed 1. Please make sure you have the correct TensorFlow version. It works only with TensorFlow 1.4, not any earlier releases because the tf.data API used here was newly updated in TF 1.4. 2. Please make sure you have environment variable PYTHONPATH setup. It needs to point to the project root directory, in which you have chatbot, Data, and webui folder. If you are running in an IDE, such as PyCharm, it will create that for you. But if you run any python scripts in a command line, you have to have that environment variable, otherwise, you get module import errors. 3. Please make sure you are using the same vocab.txt file for both training and inference/prediction. Keep in mind that your model will never see any words as we do. It's all integers in, integers out, while the words and their orders in vocab.txt help to map between the words and integers. 4. Spend a little bit time thinking of how big your model should be, what should be the maximum length of the encoder/decoder, the size of the vocabulary set, and how many pairs of the training data you want to use. Be advised that a model has a capacity limit: how much data it can learn or remember. When you have a fixed number of layers, number of units, type of RNN cell (such as GRU), and you decided the encoder/decoder length, it is mainly the vocabulary size that impacts your model's ability to learn, not the number of training samples. If you can manage not to let the vocabulary size to grow when you make use of more training data, it probably will work, but the reality is when you have more training samples, the vocabulary size also increases very quickly, and you may then notice your model cannot accommodate that siz