PyPI官网下载|holmes-extractor-2.0.2.tar.gz资源-CSDN文库

版权申诉

Python库

6 浏览量 2022-01-12 14:15:42 上传评论收藏 92KB GZ 举报

共35个文件

py：27个

txt：4个

pkg-info：2个

资源推荐

资源详情

资源评论

收起资源包目录

holmes-extractor-2.0.2.tar.gz （35个子文件）

holmes-extractor-2.0.2

setup.cfg 1KB

README.md 97KB

holmes_extractor

structural_matching.py 64KB

errors.py 980B

ontology.py 12KB

tests

testing_utils.py 2KB

test_supervised_topic_classification_EN.py 16KB

test_ontology.py 9KB

test_matching_modes.py 4KB

test_phraselet_production_DE.py 4KB

test_serialization.py 2KB

test_structural_matching_with_coreference_EN.py 37KB

test_topic_matching_EN.py 9KB

test_topic_matching_DE.py 3KB

test_errors.py 9KB

test_structural_matching_EN.py 22KB

test_word_level_matching.py 11KB

__init__.py 0B

test_phraselet_production_EN.py 11KB

test_supervised_topic_classification_DE.py 14KB

test_semantics_EN.py 38KB

test_structural_matching_DE.py 20KB

test_semantics_DE.py 26KB

semantics.py 73KB

extensive_matching.py 41KB

consoles.py 10KB

__init__.py 117B

manager.py 16KB

PKG-INFO 6KB

setup.py 309B

holmes_extractor.egg-info

dependency_links.txt 1B

PKG-INFO 6KB

SOURCES.txt 1KB

top_level.txt 17B

requires.txt 95B

Holmes ====== Author: <a href="mailto:richard.hudson@msg.group">Richard Paul Hudson, msg systems ag</a> - [1. Introduction](#introduction) - [1.1 The basic idea](#the-basic-idea) - [1.2 Installation](#installation) - [1.2.1 Prerequisites](#prerequisites) - [1.2.2 Library installation](#library-installation) - [1.2.3 Installing the spaCy models](#installing-the-spacy-models) - [1.2.4 Comments about deploying Holmes in an enterprise environment](#comments-about-deploying-holmes-in-an-enterprise-environment) - [1.3 Getting started](#getting-started) - [2. Word-level matching strategies](#word-level-matching-strategies) - [2.1 Direct matching](#direct-matching) - [2.2 Named entity matching](#named-entity-matching) - [2.3 Ontology-based matching](#ontology-based-matching) - [2.4 Embedding-based matching](#embedding-based-matching) - [3. Coreference resolution](#coreference-resolution) - [4. Writing effective search phrases](#writing-effective-search-phrases) - 4.1 [General comments](#general-comments) - [4.1.1 Lexical versus grammatical words](#lexical-versus-grammatical-words) - [4.1.2 Use of the present active](#use-of-the-present-active) - [4.1.3 Generic pronouns](#generic-pronouns) - [4.1.4 Prepositions](#prepositions) - [4.2 Structures not permitted in search phrases](#structures-not-permitted-in-search-phrases) - [4.2.1 Multiple clauses](#multiple-clauses) - [4.2.2 Negation](#negation) - [4.2.3 Conjunction](#conjunction) - [4.2.4 Lack of lexical words](#lack-of-lexical-words) - [4.2.5 Coreferring pronouns](#coreferring-pronouns) - [4.3 Structures strongly discouraged in search phrases](#structures-strongly-discouraged-in-search-phrases) - [4.3.1 Ungrammatical expressions](#ungrammatical-expressions) - [4.3.2 Complex verb tenses](#complex-verb-tenses) - [4.3.3 Questions](#questions) - [4.4 Structures to be used with caution in search phrases](#structures-to-be-used-with-caution-in-search-phrases) - [4.4.1 Very complex structures](#very-complex-structures) - [4.4.2 Deverbal noun phrases](#deverbal-noun-phrases) - [5. Use cases and examples](#use-cases-and-examples) - [5.1 Chatbot](#chatbot) - [5.2 Structural matching](#structural-matching) - [5.3 Topic matching](#topic-matching) - [5.4 Supervised document classification](#supervised-document-classification) - [6 Interfaces intended for public use](#interfaces-intended-for-public-use) - [6.1 `Manager`](#manager) - [6.2 `Ontology`](#ontology) - [6.3 `SupervisedTopicTrainingBasis`](#supervised-topic-training-basis) (returned from `Manager.get_supervised_topic_training_basis()`) - [6.4 `SupervisedTopicModelTrainer`](#supervised-topic-model-trainer) (returned from `SupervisedTopicTrainingBasis.train()`) - [6.5 `SupervisedTopicClassifier`](#supervised-topic-classifier) (returned from `SupervisedTopicModelTrainer.classifier()` and `Manager.deserialize_supervised_topic_classifier()`) - [6.6 `Match` (returned from `Manager.match()`)](#match) - [6.7 `WordMatch` (returned from `Manager.match().word_matches`)](#wordmatch) - [6.8 Dictionary returned from `Manager.match_returning_dictionaries()`)](#dictionary) - [6.9 `TopicMatch`](#topic-match) (returned from `Manager.topic_match_documents_against()`) - [7 A note on the license](#a-note-on-the-license) - [8 Information for developers](#information-for-developers) - [8.1 How it works](#how-it-works) - [8.1.1 Structural matching](#how-it-works-structural-matching) - [8.1.2 Topic matching](#how-it-works-topic-matching) - [8.1.3 Supervised document classification](#how-it-works-supervised-document-classification) - [8.2 Development and testing guidelines](#development-and-testing-guidelines) - [8.3 Areas for further development](#areas-for-further-development) - [8.3.1 Incorporation into the spaCy multithreading architecture](#incorporation-into-the-spacy-multithreading-architecture) - [8.3.2 Additional languages](#additional-languages) - [8.3.3 Use of machine learning to improve matching](#use-of-machine-learning-to-improve-matching) - [8.3.4 Upgrade to latest library versions](#upgrade-to-latest-library-versions) - [8.3.5 Remove names from supervised document classification models](#remove-names-from-supervised-document-classification-models) - [8.3.6 Improve the performance of supervised document classification training](#improve-performance-of-supervised-document-classification-training) - [8.3.7 Explore the optimal hyperparameters for topic matching and supervised document classification](#explore-hyperparameters) <a id="introduction"></a> ### 1. Introduction <a id="the-basic-idea"></a> #### 1.1 The basic idea **Holmes** is a Python 3 library (tested with version 3.7.2) that supports a number of use cases involving information extraction from English and German texts. In all use cases, the information extraction is based on analysing the semantic relationships expressed by the component parts of each sentence: - In the [chatbot](#getting-started) use case, the system is configured using one or more **search phrases**. Holmes then looks for structures whose meanings correspond to those of these search phrases within a searched **document**, which in this case corresponds to an individual snippet of text or speech entered or uttered by the end user. Within a match, each non-grammatical word in the search phrase corresponds to one or more non-grammatical words in the document, which can then be extracted as structured information. - The [structural matching](#structural-matching) use case uses exactly the same technological basis as the chatbot use case, but searching takes place with respect to a pre-existing document or documents that are typically much longer than the snippets analysed in the chatbot use case. - The [topic matching](#topic-matching) use case aims to find passages in a document or documents whose meaning is close to that of another document, which takes on the role of the **query document**, or to that of a **query phrase** entered ad-hoc by the user. Holmes extracts a number of small **phraselets** from the query phrase or query document, matches the documents being searched against each phraselet, and conflates the results to find the most relevant passages within the documents. Because there is no strict requirement that every non-grammatical word in the query document match a specific word or words in the searched documents, more matches are found than in the structural matching use case, but the matches do not contain structured information that can be used in subsequent processing. - The [supervised document classification](#supervised-document-classification) use case uses training data to learn a classifier that assigns one or more **classification labels** to new documents based on what they are about. It classifies a new document by matching it against phraselets that were extracted from the training documents in the same way that phraselets are extracted from the query document in the topic matching use case. The technique is inspired by bag-of-words-based classification algorithms that use n-grams, but aims to derive n-grams whose component words are related semantically rather than that just happen to be neighbours in the surface representation of a language. In all four use cases, the **individual

评论收藏

内容反馈

版权申诉