TextMiningwithR:ATidyApproach[TruePDF]资源-CSDN文库

Text

Mining,

需积分: 22 69 浏览量 2017-10-06 14:30:18 上传评论 3 收藏 9.74MB PDF 举报

资源推荐

资源详情

资源评论

Julia Silge & David Robinson

Text Mining

with R

A TIDY APPROACH

Table of Contents

Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

The Tidy Text Format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Contrasting Tidy Text with Other Data Structures 2

The unnest_tokens Function 2

Tidying the Works of Jane Austen 4

The gutenbergr Package 7

Word Frequencies 8

Summary 12

Sentiment Analysis with Tidy Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

The sentiments Dataset 14

Sentiment Analysis with Inner Join 16

Comparing the Three Sentiment Dictionaries 19

Most Common Positive and Negative Words 22

Wordclouds 25

Looking at Units Beyond Just Words 27

Summary 29

Analyzing Word and Document Frequency: tf-idf. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Term Frequency in Jane Austen’s Novels 32

Zipf’s Law 34

The bind_tf_idf Function 37

A Corpus of Physics Texts 40

Summary 44

Relationships Between Words: N-grams and Correlations. . . . . . . . . . . . . . . . . . . . . . . . 45

Tokenizing by N-gram 45

iii

Counting and Filtering N-grams 46

Analyzing Bigrams 48

Using Bigrams to Provide Context in Sentiment Analysis 51

Visualizing a Network of Bigrams with ggraph 54

Visualizing Bigrams in Other Texts 59

Counting and Correlating Pairs of Words with the widyr Package 61

Counting and Correlating Among Sections 62

Examining Pairwise Correlation 63

Summary 67

5. Converting to and from Nontidy Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Tidying a Document-Term Matrix 70

Tidying DocumentTermMatrix Objects 71

Tidying dfm Objects 74

Casting Tidy Text Data into a Matrix 77

Tidying Corpus Objects with Metadata 79

Example: Mining Financial Articles 81

Summary 87

Topic Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Latent Dirichlet Allocation 90

Word-Topic Probabilities 91

Document-Topic Probabilities 95

Example: The Great Library Heist 96

LDA on Chapters 97

Per-Document Classification 100

By-Word Assignments: augment 103

Alternative LDA Implementations 107

Summary 108

Case Study: Comparing Twitter Archives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Getting the Data and Distribution of Tweets 109

Word Frequencies 110

Comparing Word Usage 114

Changes in Word Use 116

Favorites and Retweets 120

Summary 124

Case Study: Mining NASA Metadata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

How Data Is Organized at NASA 126

Wrangling and Tidying the Data 126

Some Initial Simple Exploration 129

iv | Table of Contents

剩余192页未读，继续阅读

评论收藏

内容反馈

向来痴SAS

粉丝: 30
资源: 71

Text Mining with R: A Tidy Approach [True PDF]

最新资源