news_summarization.rar_news_summarization_textrank_处理新闻文本_自动摘要_自

共57个文件

py：26个

sample：11个

xml：4个

版权申诉

textrank

自动摘要

102 浏览量 2022-09-20 10:39:30 上传评论 1 收藏 107KB RAR 举报

新闻自动摘要是一种重要的自然语言处理技术，它能够帮助人们快速理解和掌握大量新闻信息。在这个“news_summarization.rar”压缩包中，包含了针对新闻文本处理和自动摘要的相关资源，特别是利用了Textrank和Lexrank这两种算法。接下来，我们将深入探讨这些知识点。我们来看“处理新闻文本”。新闻文本通常包含丰富的信息，如标题、导语、主体段落以及可能的图片或视频说明。处理新闻文本涉及多个步骤，包括分词、去除停用词、词性标注、命名实体识别等。这些预处理步骤旨在简化文本，使其更适合后续的分析和理解。接着，我们来了解Textrank算法。Textrank是PageRank算法的一个变体，最初应用于网页排名。在新闻自动摘要中，它通过构建文本的图结构，将每个句子视为一个节点，根据句子之间的相似度（通常是基于词语共现）来建立边。然后，算法会计算每个节点的重要性，通常采用迭代方法，直至节点权重稳定。重要性高的句子被选为摘要的主要内容，能有效地捕捉新闻的关键信息。接下来是Lexrank，它与Textrank有相似的原理，但更强调句子的语义相似度。Lexrank除了考虑相邻句子的共现关系，还引入了句子内容的相似度计算，例如使用余弦相似度或基于词向量的方法。这样可以确保摘要不仅包含频繁出现的信息，还包含了新闻中的核心观点。自动摘要的生成过程通常是这样的：先对新闻文本进行预处理，然后利用Textrank或Lexrank等算法计算每个句子的权重，接着选择权重最高的若干句子组成摘要。这个过程中可能还会涉及其他策略，如保持摘要的连贯性和多样性，以及确保摘要长度适合阅读。在“news_summarization”文件夹中，可能包含了实现这些算法的代码示例、数据集、预处理工具和评估方法。通过研究这些内容，你可以深入了解如何应用这些技术到实际的新闻自动摘要任务中，并对其进行优化和改进。新闻自动摘要是一项复杂而实用的技术，结合了自然语言处理、机器学习和信息检索等多个领域的知识。Textrank和Lexrank是其中的两种关键算法，它们通过分析文本结构和内容，帮助我们快速提炼出新闻的核心要点，极大地提高了信息处理的效率。通过深入学习和实践，我们可以更好地理解和应用这些技术，以应对日益增长的新闻信息量。

资源推荐

资源详情

资源评论

收起资源包目录

news_summarization.rar （57个子文件）

news_summarization

.git

index 3KB

hooks

fsmonitor-watchman.sample 3KB

pre-push.sample 1KB

prepare-commit-msg.sample 1KB

applypatch-msg.sample 478B

pre-commit.sample 2KB

pre-receive.sample 544B

pre-applypatch.sample 424B

commit-msg.sample 896B

pre-rebase.sample 5KB

update.sample 4KB

post-update.sample 189B

config 312B

description 73B

refs

tags

heads

master 41B

remotes

origin

HEAD 32B

logs

refs

heads

master 193B

remotes

origin

HEAD 193B

packed-refs 172B

objects

info

pack

pack-11460aa070e3cc9017c3053411454be3a87005dc.idx 6KB

pack-11460aa070e3cc9017c3053411454be3a87005dc.pack 54KB

info

exclude 240B

HEAD 23B

tests

test_parsers.py 2KB

test_visualization.py 906B

test_feature_extraction.py 772B

test_summarizers.py 8KB

tuning.py 8KB

.idea

misc.xml 288B

vcs.xml 180B

modules.xml 288B

news_summarization.iml 459B

workspace.xml 7KB

inspectionProfiles

setup.py 776B

newssum

models

core.py 9KB

__init__.py 24B

graph.py 1KB

definitions.py 166B

utils.py 3KB

summarizers

core_rank.py 9KB

info_filter.py 11KB

__init__.py 31B

feature_extraction

matrix.py 3KB

sentence_feature.py 25KB

__init__.py 157B

sentence_importance_detector.py 1KB

__init__.py 0B

evaluation

rouge.py 2KB

__init__.py 24B

parsers

parser.py 2KB

plaintext.py 633B

__init__.py 104B

story.py 2KB

tweets.py 2KB

.gitignore 1KB

README.md 7KB

# Automatic text summarizer for news Simple library for extracting summary from [Deepmind news dataset](https://cs.nyu.edu/~kcho/DMQA/) or plain texts. The package also contains simple evaluation framework for text summaries. Inspired by: - **CoreRank** - [Combining Graph Degeneracy and Submodularity for Unsupervised Extractive Summarization](http://www.aclweb.org/anthology/W17-4507) - **GoWvis** - [GoWvis: a web application for Graph-of-Words-based text visualization and summarization](http://www.aclweb.org/anthology/P16-4026) - **RASR** - [A Redundancy-Aware Sentence Regression Framework for Extractive Summarization](http://www.aclweb.org/anthology/C16-1004) - **Kam-Fai** Supervised method - [Extractive Summarization Using Supervised and Semi-supervised Learning](http://www.aclweb.org/anthology/C08-1124) - **InfoFilter** - [Detecting (Un)Important Content for Single-Document News Summarization](http://aclweb.org/anthology/E17-2112) ## Installation ## ```python python setup.py install ``` ## Usage ## ```python from newssum.parsers import PlaintextParser from newssum.summarizers import CoreRank TEXT = "Thomas appeared in 15 games (14 starts) for Cleveland this season, averaging 14.7 points, 4.5 assists and 2.1 rebounds in 27.1 minutes. The two-time NBA All-Star (2015-17) owns career averages of 19.0 points (.441 FG%), 5.1 assists, 2.6 rebounds and 1.0 steals in 456 career games (323 starts). In 2016-17, Thomas earned All-NBA Second Team honors when he averaged a career-high 28.9 points (.463 FG%) per game." if __name__ == "__main__": parser = PlaintextParser(TEXT) cr_summarizer = CoreRank(parser) summary = cr_summarizer.get_best_sents(w_threshold=25) print(summary) ``` ## InfoRank Features Introduction ## #### 1. Surface Features #### Surface features are based on structure of documents or sentences. | Name | Description | |------------|-------------------------------------------------| | Position | 1/sentence no. | | Doc_First | Whether it is the first sentence of a document | | Para_First | Whether it is the first sentence of a paragraph | | Length | The number of words in a sentence | | Quote | The number of quoted words in a sentence | #### 2. Content Features #### | Name | Description | |------------|-------------------------------------------------| | Position | 1/sentence no. | | Doc_First | Whether it is the first sentence of a document | | Para_First | Whether it is the first sentence of a paragraph | | Length | The number of words in a sentence | | Quote | The number of quoted words in a sentence | #### 3. Relevance Features #### Relevance features are incorporated to exploit inter-sentence relationships. | Name | Description | |------------|-------------------------------------------------| | Position | 1/sentence no. | | Doc_First | Whether it is the first sentence of a document | | Para_First | Whether it is the first sentence of a paragraph | | Length | The number of words in a sentence | | Quote | The number of quoted words in a sentence | ## CoreRank Notes ## Since the CoreRank algorithm need get core number for each vertex considering the weight of each edge and networkX itself doesn't take it into account. The networkX source code need to be modified. The modified file had been place at `$news_summarization_INSTALLATION_HOME/newssum/models/core.py`, you don't need to modify source code of networkX which may cause running error when using networkX for other jobs. ```python def core_number(G, weight=None): """Return the core number for each vertex. A k-core is a maximal subgraph that contains nodes of degree k or more. The core number of a node is the largest value k of a k-core containing that node. Parameters ---------- G : NetworkX graph A graph or directed graph Returns ------- core_number : dictionary A dictionary keyed by node to the core number. Raises ------ NetworkXError The k-core is not defined for graphs with self loops or parallel edges. Notes ----- Not implemented for graphs with parallel edges or self loops. For directed graphs the node degree is defined to be the in-degree + out-degree. References ---------- .. [1] An O(m) Algorithm for Cores Decomposition of Networks Vladimir Batagelj and Matjaz Zaversnik, 2003. http://arxiv.org/abs/cs.DS/0310049 """ if G.is_multigraph(): raise nx.NetworkXError( 'MultiGraph and MultiDiGraph types not supported.') if G.number_of_selfloops()>0: raise nx.NetworkXError( 'Input graph has self loops; the core number is not defined.', 'Consider using G.remove_edges_from(G.selfloop_edges()).') if G.is_directed(): import itertools def neighbors(v): return itertools.chain.from_iterable([G.predecessors_iter(v), G.successors_iter(v)]) else: neighbors=G.neighbors_iter # modifed start degrees=G.degree(weight=weight) if weight: for k in degrees: degrees[k] = int(degrees[k]) # modifed end # sort nodes by degree nodes=sorted(degrees,key=degrees.get) bin_boundaries=[0] curr_degree=0 for i,v in enumerate(nodes): if degrees[v]>curr_degree: bin_boundaries.extend([i]*(degrees[v]-curr_degree)) curr_degree=degrees[v] node_pos = dict((v,pos) for pos,v in enumerate(nodes)) # initial guesses for core is degree core=degrees nbrs=dict((v,set(neighbors(v))) for v in G) for v in nodes: for u in nbrs[v]: if core[u] > core[v]: nbrs[u].remove(v) pos=node_pos[u] bin_start=bin_boundaries[core[u]] node_pos[u]=bin_start node_pos[nodes[bin_start]]=pos nodes[bin_start],nodes[pos]=nodes[pos],nodes[bin_start] bin_boundaries[core[u]]+=1 core[u]-=1 return core ``` ## Complete Project Structure ## ``` ├───.idea ├───build ├───data <- The original, immutable data dump. ├───dist ├───external ├───figures <- Figures saved by notebooks and scripts. ├───newssum <- Python package with source code. │ ├───evaluation │ ├───feature_extraction │ ├───models │ ├───parsers │ ├───summarizers ├───newssum.egg-info ├───notebooks ├───output <- Processed data, models, logs, etc. ├───tests <- Tests for Python package. ├── README.md <- README with info of the project. ├── server.py <- Simple server for online demo. └── setup.py <- Install and distribute module. ```

评论收藏

内容反馈

版权申诉