没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
试读
193页
The authors demonstrate how treating text as data frames enables you to manipulate, summarize, and visualize characteristics of text. You’ll also learn how to integrate natural language processing (NLP) into effective workflows. Practical code examples and data explorations will help you generate real insights from literature, news, and social media.
资源推荐
资源详情
资源评论
Julia Silge & David Robinson
Text Mining
with R
A TIDY APPROACH
Julia Silge and David Robinson
Text Mining with R
A Tidy Approach
Boston Farnham Sebastopol
Tokyo
Beijing Boston Farnham Sebastopol
Tokyo
Beijing
978-1-491-98165-8
[LSI]
Text Mining with R
by Julia Silge and David Robinson
Copyright © 2017 Julia Silge, David Robinson. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐
tutional sales department: 800-998-9938 or corporate@oreilly.com.
Editor: Nicole Tache
Production Editor: Nicholas Adams
Copyeditor: Sonia Saruba
Proofreader: Charles Roumeliotis
Indexer: WordCo Indexing Services, Inc.
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest
June 2017: First Edition
Revision History for the First Edition
2017-06-08: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491981658 for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Text Mining with R, the cover image,
and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1.
The Tidy Text Format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Contrasting Tidy Text with Other Data Structures 2
The unnest_tokens Function 2
Tidying the Works of Jane Austen 4
The gutenbergr Package 7
Word Frequencies 8
Summary 12
2.
Sentiment Analysis with Tidy Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
The sentiments Dataset 14
Sentiment Analysis with Inner Join 16
Comparing the Three Sentiment Dictionaries 19
Most Common Positive and Negative Words 22
Wordclouds 25
Looking at Units Beyond Just Words 27
Summary 29
3.
Analyzing Word and Document Frequency: tf-idf. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Term Frequency in Jane Austen’s Novels 32
Zipf’s Law 34
The bind_tf_idf Function 37
A Corpus of Physics Texts 40
Summary 44
4.
Relationships Between Words: N-grams and Correlations. . . . . . . . . . . . . . . . . . . . . . . . 45
Tokenizing by N-gram 45
iii
Counting and Filtering N-grams 46
Analyzing Bigrams 48
Using Bigrams to Provide Context in Sentiment Analysis 51
Visualizing a Network of Bigrams with ggraph 54
Visualizing Bigrams in Other Texts 59
Counting and Correlating Pairs of Words with the widyr Package 61
Counting and Correlating Among Sections 62
Examining Pairwise Correlation 63
Summary 67
5. Converting to and from Nontidy Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Tidying a Document-Term Matrix 70
Tidying DocumentTermMatrix Objects 71
Tidying dfm Objects 74
Casting Tidy Text Data into a Matrix 77
Tidying Corpus Objects with Metadata 79
Example: Mining Financial Articles 81
Summary 87
6.
Topic Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Latent Dirichlet Allocation 90
Word-Topic Probabilities 91
Document-Topic Probabilities 95
Example: The Great Library Heist 96
LDA on Chapters 97
Per-Document Classification 100
By-Word Assignments: augment 103
Alternative LDA Implementations 107
Summary 108
7.
Case Study: Comparing Twitter Archives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Getting the Data and Distribution of Tweets 109
Word Frequencies 110
Comparing Word Usage 114
Changes in Word Use 116
Favorites and Retweets 120
Summary 124
8.
Case Study: Mining NASA Metadata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
How Data Is Organized at NASA 126
Wrangling and Tidying the Data 126
Some Initial Simple Exploration 129
iv | Table of Contents
剩余192页未读,继续阅读
资源评论
向来痴SAS
- 粉丝: 30
- 资源: 71
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功