APracticalIntroductiontoInformationRetrievalandTextMining资源-CSDN文库

需积分: 7 50 浏览量 2018-08-29 21:15:56 上传评论收藏 27.22MB PDF 举报

自然语言文本数据在最近几年来急剧增长，涵盖了网页、新闻文章、科学文献、电子邮件、企业文件以及社交媒体上的博客文章、论坛帖子、产品评论和推文等。这些数据的增长导致了对强大的软件工具的需求，以帮助人们有效地、高效地管理和分析大量的文本数据。与由计算机系统或传感器生成的数据不同，文本数据通常由人类直接生成，并捕获了丰富的语义内容。因此，文本数据特别有价值，可以用于发现关于人类观点和偏好的知识，以及我们编码进文本的许多其他类型的知识。与符合明确定义模式（因此计算机较容易处理）的结构化数据不同，文本数据的结构不那么明确，需要计算机处理以理解文本中编码的内容。自然语言处理的当前技术水平尚未达到使计算机能够精确理解自然语言文本的程度，但是过去几十年里已经开发出一系列广泛使用的统计和启发式方法，用于管理和分析文本数据。这些方法通常非常健壮，可以应用于任何自然语言以及任何主题的文本数据。本书提供了这些方法的系统性介绍，重点介绍了构建各种实用文本信息系统所需的知识和技能。由于人类比计算机更能理解自然语言，因此在文本信息系统中有效地涉及人类通常是必需的，而文本信息系统通常作为人类的智能助手。根据文本信息系统与人类合作的方式，我们可以区分两种类型的文本信息系统。第一种是信息检索系统，包括搜索引擎和推荐系统；它们协助用户从大量文本数据中找到最相关的数据，这些数据实际上对于解决特定问题实际上是必需的。信息检索系统的目标是帮助用户快速准确地找到他们所寻找的信息。这涉及到了解用户的查询意图，评估大量文本中哪些内容与查询相关，以及以有序的方式向用户展示这些内容。为了做到这些，信息检索系统使用各种技术和算法，包括但不限于布尔搜索、向量空间模型、概率搜索、排名算法和自然语言处理技术。这些技术结合起来能够使系统更好地理解语言的复杂性以及用户的真正需求。另一方面，文本挖掘系统关注的是从未加工的文本数据集中提取有用信息的过程。这个过程包括识别模式、发现趋势、建立模型和进行预测。通过文本挖掘，可以提取隐藏在大量文本中的知识，这对于市场分析、情感分析、安全监控、医疗诊断和许多其他领域都是极为重要的。文本挖掘技术利用统计分析、机器学习和自然语言处理，将数据中的文字转化为可用于决策的洞察。而在这个电子书中，作者Cheng Xiang Zhai和Sean Massung提供了对这些技术和方法的系统性介绍，重点介绍了构建实用文本信息系统所需的核心知识和技能。信息检索和文本挖掘是数据科学中的重要分支，它们的结合为从文本数据中提取知识提供了更全面的视角。该书对理论基础和实际应用都进行了深入的探讨，为专业人士和学生提供了宝贵的资源。

资源推荐

资源详情

资源评论

ABOUT ACM BOOKS

ACM Books is a new series of high quality books for

the computer science community, published by ACM

in collaboration with Morgan & Claypool Publishers.

ACM Books publications are widely distributed in

both print and digital formats through booksellers

and to libraries (and library consortia) and individual ACM members via the ACM

Digital Library platform.

BOOKS.ACM.ORG • WWW.MORGANCLAYPOOL.COM

ACM | MORGAN & CLAYPOOL

Text Data Management and Analysis

ZHAI • MASSUNG

ChengXiang Zhai

Sean Massung

Text Data

Management

and Analysis

A Practical Introduction

to Information Retrieval

and Text Mining

ISBN: 978-1-97000-116-7

9 78 1 970 001 1 67

9 00 00

Recent years have seen a dramatic growth of natural language text data, including web pages,

news articles, scientific literature, emails, enterprise documents, and social media such as

blog articles, forum posts, product reviews, and tweets. This has led to an increasing demand

for powerful software tools to help people manage and analyze vast amounts of text data ef-

fectively and efficiently. Unlike data generated by a computer system or sensors, text data are

usually generated directly by humans, and capture semantically rich content. As such, text

data are especially valuable for discovering knowledge about human opinions and preferenc-

es, in addition to many other kinds of knowledge that we encode in text. In contrast to struc-

tured data, which conform to well-defined schemas (thus are relatively easy for computers to

handle), text has less explicit structure, requiring computer processing toward understanding

of the content encoded in text. The current technology of natural language processing has

not yet reached a point to enable a computer to precisely understand natural language text,

but a wide range of statistical and heuristic approaches to management and analysis of text

data have been developed over the past few decades. They are usually very robust and can be

applied to analyze and manage text data in any natural language, and about any topic.

This book provides a systematic introduction to many of these approaches, with an em-

phasis on covering the most useful knowledge and skills required to build a variety of prac-

tically useful text information systems. Because humans can understand natural languages

far better than computers can, effective involvement of humans in a text information system

is generally needed and text information systems often serve as intelligent assistants for hu-

mans. Depending on how a text information system collaborates with humans, we distinguish

two kinds of text information systems. The first is information retrieval systems which include

search engines and recommender systems; they assist users in finding from a large collection

of text data the most relevant text data that are actually needed for solving a specific applica-

tion problem, thus effectively turning big raw text data into much smaller relevant text data

that can be more easily processed by humans. The second is text mining application systems;

they can assist users in analyzing patterns in text data to extract and discover useful action-

able knowledge directly useful for task completion or decision making, thus providing more

direct task support for users.

Text Data Management and Analysis

A Practical Introduction to Information Retrieval and Text Mining

ChengXiang Zhai and Sean Massung

剩余530页未读，继续阅读

评论收藏

内容反馈

J-10

粉丝: 18
资源: 483

A Practical Introduction to Information Retrieval and Text Minin...

最新资源

A Practical Introduction to Information Retrieval and Text Minin...

Introduction to Information Retrieval

introduction to information retrieval

Introduction to Information Retrieval Solution-Manual

Introduction to Information Retrieval-2009

Introduction to Information retrieval

An Introduction to Information Retrieval

斯坦福大学Introduction to Information Retrieval

An introduction to information retrieval

《Introduction to Information Retrieval》中爬虫课件

Introduction To Information Retrieval

Learning to Rank for Information Retrieval and Natural Language Processing

信息检索领域的经典之作《introduction to information retrieval》

Practical Text Mining With Perl (zipped)

An Introduction to Information Retrieval 信息检索lucene

an introduction to information retrieval

《Introduction to Information Retrieval》 链接分析技术课件

博客中聚类算法（K-means、FCM、DBSCAN、DPC）的数据集（免积分）

机器学习期末复习题及答案

神经网络回归预测--气温数据集

Mathwork+Matlab+编程手册

中文短信数据集-带标签

时间序列预测模型实战案例(Xgboost)(Python)(机器学习)包括时间序列预测和时间序列分类，点击即可运行！

Ollama软件windows安装包(版本0.3.10)

亚博K210模型训练部署

Plecs电力电子仿真PLECS41.64 电力系统仿真软件免安装版本

多尺度Retinex 高对比场景的自适应对数映射 基于Retinex的自适应局部色调映射用于HDR图像

hugging face的models-openai-clip-vit-large-patch14文件夹

shape_predictor_68_face_landmarks.zip

Stable-Diffusion WEBUI 简体中文语言包（2023.05.30更新）

最新资源

《Introduction to Information Retrieval》链接分析技术课件

多尺度Retinex 高对比场景的自适应对数映射基于Retinex的自适应局部色调映射用于HDR图像