Holmes
======
Author: <a href="mailto:richard.hudson@msg.group">Richard Paul Hudson, msg systems ag</a>
- [1. Introduction](#introduction)
- [1.1 The basic idea](#the-basic-idea)
- [1.2 Installation](#installation)
- [1.2.1 Prerequisites](#prerequisites)
- [1.2.2 Library installation](#library-installation)
- [1.2.3 Installing the spaCy models](#installing-the-spacy-models)
- [1.2.4 Comments about deploying Holmes in an
enterprise
environment](#comments-about-deploying-holmes-in-an-enterprise-environment)
- [1.3 Getting started](#getting-started)
- [2. Word-level matching strategies](#word-level-matching-strategies)
- [2.1 Direct matching](#direct-matching)
- [2.2 Named entity matching](#named-entity-matching)
- [2.3 Ontology-based matching](#ontology-based-matching)
- [2.4 Embedding-based matching](#embedding-based-matching)
- [3. Coreference resolution](#coreference-resolution)
- [4. Writing effective search
phrases](#writing-effective-search-phrases)
- 4.1 [General comments](#general-comments)
- [4.1.1 Lexical versus grammatical words](#lexical-versus-grammatical-words)
- [4.1.2 Use of the present active](#use-of-the-present-active)
- [4.1.3 Generic pronouns](#generic-pronouns)
- [4.1.4 Prepositions](#prepositions)
- [4.2 Structures not permitted in search
phrases](#structures-not-permitted-in-search-phrases)
- [4.2.1 Multiple clauses](#multiple-clauses)
- [4.2.2 Negation](#negation)
- [4.2.3 Conjunction](#conjunction)
- [4.2.4 Lack of lexical words](#lack-of-lexical-words)
- [4.2.5 Coreferring pronouns](#coreferring-pronouns)
- [4.3 Structures strongly discouraged in search
phrases](#structures-strongly-discouraged-in-search-phrases)
- [4.3.1 Ungrammatical
expressions](#ungrammatical-expressions)
- [4.3.2 Complex verb tenses](#complex-verb-tenses)
- [4.3.3 Questions](#questions)
- [4.4 Structures to be used with caution in search
phrases](#structures-to-be-used-with-caution-in-search-phrases)
- [4.4.1 Very complex
structures](#very-complex-structures)
- [4.4.2 Deverbal noun phrases](#deverbal-noun-phrases)
- [5. Use cases and examples](#use-cases-and-examples)
- [5.1 Chatbot](#chatbot)
- [5.2 Structural matching](#structural-matching)
- [5.3 Topic matching](#topic-matching)
- [5.4 Supervised document classification](#supervised-document-classification)
- [6 Interfaces intended for public
use](#interfaces-intended-for-public-use)
- [6.1 `Manager`](#manager)
- [6.2 `Ontology`](#ontology)
- [6.3 `SupervisedTopicTrainingBasis`](#supervised-topic-training-basis)
(returned from `Manager.get_supervised_topic_training_basis()`)
- [6.4 `SupervisedTopicModelTrainer`](#supervised-topic-model-trainer)
(returned from `SupervisedTopicTrainingBasis.train()`)
- [6.5 `SupervisedTopicClassifier`](#supervised-topic-classifier)
(returned from `SupervisedTopicModelTrainer.classifier()` and
`Manager.deserialize_supervised_topic_classifier()`)
- [6.6 `Match` (returned from
`Manager.match()`)](#match)
- [6.7 `WordMatch` (returned from
`Manager.match().word_matches`)](#wordmatch)
- [6.8 Dictionary returned from
`Manager.match_returning_dictionaries()`)](#dictionary)
- [6.9 `TopicMatch`](#topic-match)
(returned from `Manager.topic_match_documents_against()`)
- [7 A note on the license](#a-note-on-the-license)
- [8 Information for developers](#information-for-developers)
- [8.1 How it works](#how-it-works)
- [8.1.1 Structural matching](#how-it-works-structural-matching)
- [8.1.2 Topic matching](#how-it-works-topic-matching)
- [8.1.3 Supervised document classification](#how-it-works-supervised-document-classification)
- [8.2 Development and testing
guidelines](#development-and-testing-guidelines)
- [8.3 Areas for further
development](#areas-for-further-development)
- [8.3.1 Incorporation into the spaCy
multithreading architecture](#incorporation-into-the-spacy-multithreading-architecture)
- [8.3.2 Additional languages](#additional-languages)
- [8.3.3 Use of machine learning to improve
matching](#use-of-machine-learning-to-improve-matching)
- [8.3.4 Upgrade to latest library versions](#upgrade-to-latest-library-versions)
- [8.3.5 Remove names from supervised document classification models](#remove-names-from-supervised-document-classification-models)
- [8.3.6 Improve the performance of supervised document classification training](#improve-performance-of-supervised-document-classification-training)
- [8.3.7 Explore the optimal hyperparameters for topic matching and supervised document classification](#explore-hyperparameters)
<a id="introduction"></a>
### 1. Introduction
<a id="the-basic-idea"></a>
#### 1.1 The basic idea
**Holmes** is a Python 3 library (tested with version 3.7.2) that supports a number of
use cases involving information extraction from English and German texts. In all use cases, the information extraction
is based on analysing the semantic relationships expressed by the component parts of each sentence:
- In the [chatbot](#getting-started) use case, the system is configured using one or more **search phrases**.
Holmes then looks for structures whose meanings correspond to those of these search phrases within
a searched **document**, which in this case corresponds to an individual snippet of text or speech
entered or uttered by the end user. Within a match, each non-grammatical word in the search phrase
corresponds to one or more non-grammatical words in the document, which can then be extracted as structured information.
- The [structural matching](#structural-matching) use case uses exactly the same technological basis as the chatbot use
case, but searching takes place with respect to a pre-existing document or documents that are typically much
longer than the snippets analysed in the chatbot use case.
- The [topic matching](#topic-matching) use case aims to find passages in a document or documents whose meaning
is close to that of another document, which takes on the role of the **query document**, or to that of a
**query phrase** entered ad-hoc by the user. Holmes extracts a number of small **phraselets** from the query phrase or
query document, matches the documents being searched against each phraselet, and conflates the results to find
the most relevant passages within the documents. Because there is no strict requirement that every non-grammatical
word in the query document match a specific word or words in the searched documents, more matches are found
than in the structural matching use case, but the matches do not contain structured information that can be
used in subsequent processing.
- The [supervised document classification](#supervised-document-classification) use case uses training data to
learn a classifier that assigns one or more **classification labels** to new documents based on what they are about.
It classifies a new document by matching it against phraselets that were extracted from the training documents in the
same way that phraselets are extracted from the query document in the topic matching use case. The technique is
inspired by bag-of-words-based classification algorithms that use n-grams, but aims to derive n-grams whose component
words are related semantically rather than that just happen to be neighbours in the surface representation of a language.
In all four use cases, the **individual
没有合适的资源?快使用搜索试试~ 我知道了~
PyPI 官网下载 | holmes-extractor-2.0.2.tar.gz
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 6 浏览量
2022-01-12
14:15:42
上传
评论
收藏 92KB GZ 举报
温馨提示
共35个文件
py:27个
txt:4个
pkg-info:2个
资源来自pypi官网。 资源全名:holmes-extractor-2.0.2.tar.gz
资源推荐
资源详情
资源评论
收起资源包目录
holmes-extractor-2.0.2.tar.gz (35个子文件)
holmes-extractor-2.0.2
setup.cfg 1KB
README.md 97KB
holmes_extractor
structural_matching.py 64KB
errors.py 980B
ontology.py 12KB
tests
testing_utils.py 2KB
test_supervised_topic_classification_EN.py 16KB
test_ontology.py 9KB
test_matching_modes.py 4KB
test_phraselet_production_DE.py 4KB
test_serialization.py 2KB
test_structural_matching_with_coreference_EN.py 37KB
test_topic_matching_EN.py 9KB
test_topic_matching_DE.py 3KB
test_errors.py 9KB
test_structural_matching_EN.py 22KB
test_word_level_matching.py 11KB
__init__.py 0B
test_phraselet_production_EN.py 11KB
test_supervised_topic_classification_DE.py 14KB
test_semantics_EN.py 38KB
test_structural_matching_DE.py 20KB
test_semantics_DE.py 26KB
semantics.py 73KB
extensive_matching.py 41KB
consoles.py 10KB
__init__.py 117B
manager.py 16KB
PKG-INFO 6KB
setup.py 309B
holmes_extractor.egg-info
dependency_links.txt 1B
PKG-INFO 6KB
SOURCES.txt 1KB
top_level.txt 17B
requires.txt 95B
共 35 条
- 1
资源评论
挣扎的蓝藻
- 粉丝: 12w+
- 资源: 15万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功