fscrawler-2.4资源-CSDN文库

共153个文件

jar：148个

notice：1个

bat：1个

文本内容检索

需积分: 10 169 浏览量 2018-06-21 17:19:32 上传评论收藏 68.34MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

fscrawler-2.4 （153个子文件）

fscrawler.bat 1KB

fscrawler 1KB

elasticsearch-6.0.0-beta1.jar 9.22MB

sqlite-jdbc-3.19.3.jar 5.93MB

poi-ooxml-schemas-3.17-beta1.jar 5.64MB

bcprov-jdk15on-1.54.jar 3.13MB

lucene-core-7.0.0-snapshot-00142c9.jar 2.65MB

xmlbeans-2.6.0.jar 2.6MB

poi-3.17-beta1.jar 2.57MB

pdfbox-2.0.6.jar 2.35MB

guava-16.0.1.jar 2.12MB

elasticsearch-rest-client-6.0.0-beta1.jar 1.91MB

language-detector-0.5.jar 1.79MB

lucene-analyzers-common-7.0.0-snapshot-00142c9.jar 1.44MB

fontbox-2.0.6.jar 1.41MB

poi-ooxml-3.17-beta1.jar 1.41MB

log4j-core-2.8.1.jar 1.34MB

poi-scratchpad-3.17-beta1.jar 1.32MB

jackson-databind-2.8.6.jar 1.18MB

hppc-0.7.1.jar 1.09MB

opennlp-tools-1.6.0.jar 1.04MB

isoparser-1.1.18.jar 1.01MB

tika-parsers-1.16.jar 1MB

jersey-guava-2.25.1.jar 949KB

jersey-server-2.25.1.jar 919KB

grizzly-framework-2.3.28.jar 880KB

jackcess-2.1.8.jar 842KB

commons-collections4-4.1.jar 734KB

javassist-3.20.0-GA.jar 733KB

jersey-common-2.25.1.jar 699KB

bcpkix-jdk15on-1.54.jar 658KB

tika-core-1.16.jar 632KB

sis-utility-0.6.jar 630KB

joda-time-2.9.5.jar 617KB

sis-referencing-0.6.jar 598KB

jai-imageio-core-1.3.1.jar 588KB

sis-metadata-0.6.jar 534KB

commons-compress-1.14.jar 518KB

jna-4.4.0-1.jar 508KB

woodstox-core-5.0.3.jar 501KB

jai-imageio-jpeg2000-1.3.0.jar 450KB

commons-vfs2-2.0.jar 406KB

lucene-queryparser-7.0.0-snapshot-00142c9.jar 375KB

metadata-extractor-2.9.1.jar 369KB

grizzly-http-2.3.28.jar 331KB

apache-mime4j-dom-0.8.1.jar 321KB

grizzly-http-server-2.3.28.jar 281KB

commons-lang-2.6.jar 278KB

commons-codec-1.10.jar 278KB

jackson-core-2.8.6.jar 275KB

jsch-0.1.53.jar 274KB

snakeyaml-1.15.jar 263KB

jsr-275-0.9.3.jar 254KB

plexus-utils-1.5.6.jar 245KB

lucene-suggest-7.0.0-snapshot-00142c9.jar 243KB

rome-1.5.1.jar 237KB

lucene-queries-7.0.0-snapshot-00142c9.jar 237KB

lucene-spatial3d-7.0.0-snapshot-00142c9.jar 232KB

pdfbox-debugger-2.0.6.jar 230KB

jhighlight-1.0.2.jar 229KB

gson-2.8.1.jar 227KB

log4j-api-2.8.1.jar 223KB

juniversalchardet-1.0.3.jar 216KB

geoapi-3.0.0.jar 209KB

commons-io-2.5.jar 204KB

lucene-spatial-extras-7.0.0-snapshot-00142c9.jar 198KB

lucene-highlighter-7.0.0-snapshot-00142c9.jar 193KB

hk2-locator-2.5.0-b32.jar 183KB

hk2-api-2.5.0-b32.jar 181KB

jersey-client-2.25.1.jar 165KB

stax2-api-3.1.4.jar 158KB

jsonic-1.2.11.jar 158KB

lucene-sandbox-7.0.0-snapshot-00142c9.jar 154KB

junrar-0.7.jar 152KB

jansi-1.13.jar 148KB

fscrawler-2.4.jar 146KB

lucene-join-7.0.0-snapshot-00142c9.jar 143KB

levigo-jbig2-imageio-2.0.jar 136KB

hk2-utils-2.5.0-b32.jar 132KB

vorbis-java-core-0.8.jar 118KB

xmpcore-5.1.2.jar 115KB

javax.ws.rs-api-2.0.1.jar 113KB

HdrHistogram-2.1.9.jar 111KB

apache-mime4j-core-0.8.1.jar 101KB

xz-1.6.jar 101KB

bcmail-jdk15on-1.54.jar 100KB

lucene-backward-codecs-7.0.0-snapshot-00142c9.jar 97KB

curvesapi-1.04.jar 96KB

jackson-dataformat-xml-2.8.6.jar 93KB

maven-scm-api-1.4.jar 92KB

boilerpipe-1.1.0.jar 90KB

lucene-misc-7.0.0-snapshot-00142c9.jar 90KB

tagsoup-1.2.1.jar 89KB

jackson-datatype-jsr310-2.8.6.jar 88KB

lucene-grouping-7.0.0-snapshot-00142c9.jar 85KB

jackcess-encrypt-2.1.2.jar 84KB

java-libpst-0.8.1.jar 83KB

jopt-simple-5.0.2.jar 76KB

parent-join-client-6.0.0-beta1.jar 74KB

jackson-dataformat-smile-2.8.6.jar 72KB

共 153 条

# FileSystem Crawler for Elasticsearch Welcome to the FS Crawler for [Elasticsearch](https://elastic.co/) This crawler helps to index documents from your local file system and over SSH. It crawls your file system and index new files, update existing ones and removes old ones. You need to install a version matching your Elasticsearch version: | Elasticsearch | FS Crawler | Released | Docs | |--------------------|-------------|----------|------------------------------------------------------------------------------| | 2.x, 5.x, 6.x | 2.4-SNAPSHOT| |See below | | 2.x, 5.x, 6.x | **2.3** |2017-07-10|[2.3](https://github.com/dadoonet/fscrawler/blob/fscrawler-2.3/README.md) | | 1.x, 2.x, 5.x | 2.2 |2017-02-03|[2.2](https://github.com/dadoonet/fscrawler/blob/fscrawler-2.2/README.md) | | 1.x, 2.x, 5.x | 2.1 |2016-07-26|[2.1](https://github.com/dadoonet/fscrawler/blob/fscrawler-2.1/README.md) | | es-2.0 | 2.0.0 |2015-10-30|[2.0.0](https://github.com/dadoonet/fscrawler/blob/fscrawler-2.0.0/README.md) | ## Build Status Thanks to Travis for the [build status](https://travis-ci.org/dadoonet/fscrawler): [![Build Status](https://travis-ci.org/dadoonet/fscrawler.svg)](https://travis-ci.org/dadoonet/fscrawler) # Table of content * [Installation guide](#installation-guide) * [Download fscrawler](#download-fscrawler) * [Upgrade fscrawler](#upgrade-fscrawler) * [User guide](#user-guide) * [Getting Started](#getting-started) * [Searching for docs](#searching-for-docs) * [Crawler options](#crawler-options) * [Starting with a REST gateway](#starting-with-a-rest-gateway) * [Supported formats](#supported-formats) * [Administration guide](#administration-guide) * [CLI options](#cli-options) * [JVM Settings](#jvm-settings) * [Job file specification](#job-file-specification) * [Local FS settings](#local-fs-settings) * [SSH settings](#ssh-settings) * [Elasticsearch settings](#elasticsearch-settings) * [REST service](#rest-service) * [Tips and tricks](#tips-and-tricks) * [License](#license) # Installation Guide ## Download fscrawler FS Crawler binary is available on [Maven Central](https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/). Just download the latest release (or any other specific version you want to try). The filename ends with `.zip`. For example, if you wish to download [fscrawler-2.3](https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/2.3/fscrawler-2.3.zip): ```sh wget https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/2.3/fscrawler-2.3.zip unzip fscrawler-2.3.zip ``` The distribution contains: ``` $ tree . ├── LICENSE ├── NOTICE ├── README.md ├── bin │ ├── fscrawler │ └── fscrawler.bat └── lib ├── ... All needed jars ``` Note that you can also download a SNAPSHOT version [from sonatype](https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler/2.3-SNAPSHOT/) without needing to build it by yourself. ## Upgrade fscrawler It can happen that you need to [upgrade a mapping](#upgrading-an-existing-mapping) before starting fscrawler after a version upgrade. Read carefully the following update instructions. To update fscrawler, just download the new version, unzip it in another directory and launch it as usual. It will still pick up settings from the configuration directory. Of course, you need to stop first the existing running instances. ### Upgrade to 2.2 * fscrawler comes with new default mappings for files. They have better defaults as they consume less disk space and CPU at index time. You should remove existing files in `~/.fscrawler/_default/_mappings` before starting the new version so default mappings will be updated. If you modified manually mapping files, apply the modification you made on sample files. * `excludes` is now set by default for new jobs to `["~*"]`. In previous versions, any file or directory containing a `~` was excluded. Which means that if in your jobs, you are defining any exclusion rule, you need to add `*~*` if you want to get back the exact previous behavior. * If you were indexing `json` or `xml` documents with the `filename_as_id` option set, we were previously removing the suffix of the file name, like indexing `1.json` was indexed as `1`. With this new version, we don't remove anymore the suffix. So the `_id` for your document will be now `1.json`. ### Upgrade to 2.3 * fscrawler comes with new mapping for folders. The change is really tiny so you can skip this step if you wish. We basically removed `name` field in the folder mapping as it was unused. * The way FSCrawler computes now `path.virtual` for docs has changed. It now includes the filename. Instead of `/path/to` you will now get `/path/to/file.txt`. * The way FSCrawler computes now `virtual` for folders is now consistent with what you can see for folders. * `path.encoded` in documents and `encoded` in folders have been removed as not needed by FSCrawler after all. * [OCR](#ocr-integration) is now properly activated for PDF documents. This can be time, cpu and memory consuming though. You can disable explicitly it by setting `fs.pdf_ocr` to `false`. * All dates are now indexed in elasticsearch in UTC instead of without any time zone. For example, we were indexing previously a date like `2017-05-19T13:24:47.000`. Which was producing bad results when you were located in a time zone other than UTC. It's now indexed as `2017-05-19T13:24:47.000+0000`. * In order to be compatible with the coming 6.0 elasticsearch version, we need to get rid of types as only one type per index is still supported. Which means that we now create index named `job_name` and `job_name_folder` instead of one index `job_name` with two types `doc` and `folder`. If you are upgrading from FSCrawler 2.2, it requires that you reindex your existing data either by deleting the old index and running again FSCrawler or by using the [reindex API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html) as follows: ``` # Create folder index job_name_folder based on existing folder data POST _reindex { "source": { "index": "job_name", "type": "folder" }, "dest": { "index": "job_name_folder" } } # Remove old folder data from job_name index POST job_name/folder/_delete_by_query { "query": { "match_all": {} } } ``` Note that you will need first to create the right settings and mappings so you can then run the reindex job. You can do that by launching `bin/fscrawler job_name --loop 0`. Better, you can run `bin/fscrawler job_name --upgrade` and let FSCrawler do all that for you. Note that this can take a loooong time. Also please be aware that some APIs used by the upgrade action are only available from elasticsearch 2.3 (reindex) or elasticsearch 5.0 (delete by query). If you are running an older version than 5.0 you need first to upgrade elasticsearch. This procedure only applies if you did not set previously `elasticsearch.type` setting (default value was `doc`). If you did, then you also need to reindex the existing documents to the default `doc` type as per elasticsearch 6.0: ``` # Copy old type doc to the default doc type POST _reindex { "source": { "index": "job_name", "type": "your_type_here" }, "dest": { "index": "job_name", "type": "doc" } } # Remove old type data from job_name index POST job_name/your_type_here/_delete_by_query { "query": { "match_all": {} } } ``` But note that this last step can take a very loooong time and will generate a lot of IO on your disk. It might be easier in such case to restart fscrawler from scratch. *

评论收藏

内容反馈