# FileSystem Crawler for Elasticsearch
Welcome to the FS Crawler for [Elasticsearch](https://elastic.co/)
This crawler helps to index documents from your local file system and over SSH.
It crawls your file system and index new files, update existing ones and removes old ones.
You need to install a version matching your Elasticsearch version:
| Elasticsearch | FS Crawler | Released | Docs |
|--------------------|-------------|----------|------------------------------------------------------------------------------|
| 2.x, 5.x, 6.x | 2.4-SNAPSHOT| |See below |
| 2.x, 5.x, 6.x | **2.3** |2017-07-10|[2.3](https://github.com/dadoonet/fscrawler/blob/fscrawler-2.3/README.md) |
| 1.x, 2.x, 5.x | 2.2 |2017-02-03|[2.2](https://github.com/dadoonet/fscrawler/blob/fscrawler-2.2/README.md) |
| 1.x, 2.x, 5.x | 2.1 |2016-07-26|[2.1](https://github.com/dadoonet/fscrawler/blob/fscrawler-2.1/README.md) |
| es-2.0 | 2.0.0 |2015-10-30|[2.0.0](https://github.com/dadoonet/fscrawler/blob/fscrawler-2.0.0/README.md) |
## Build Status
Thanks to Travis for the [build status](https://travis-ci.org/dadoonet/fscrawler):
[![Build Status](https://travis-ci.org/dadoonet/fscrawler.svg)](https://travis-ci.org/dadoonet/fscrawler)
# Table of content
* [Installation guide](#installation-guide)
* [Download fscrawler](#download-fscrawler)
* [Upgrade fscrawler](#upgrade-fscrawler)
* [User guide](#user-guide)
* [Getting Started](#getting-started)
* [Searching for docs](#searching-for-docs)
* [Crawler options](#crawler-options)
* [Starting with a REST gateway](#starting-with-a-rest-gateway)
* [Supported formats](#supported-formats)
* [Administration guide](#administration-guide)
* [CLI options](#cli-options)
* [JVM Settings](#jvm-settings)
* [Job file specification](#job-file-specification)
* [Local FS settings](#local-fs-settings)
* [SSH settings](#ssh-settings)
* [Elasticsearch settings](#elasticsearch-settings)
* [REST service](#rest-service)
* [Tips and tricks](#tips-and-tricks)
* [License](#license)
# Installation Guide
## Download fscrawler
FS Crawler binary is available on [Maven Central](https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/).
Just download the latest release (or any other specific version you want to try).
The filename ends with `.zip`.
For example, if you wish to download [fscrawler-2.3](https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/2.3/fscrawler-2.3.zip):
```sh
wget https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/2.3/fscrawler-2.3.zip
unzip fscrawler-2.3.zip
```
The distribution contains:
```
$ tree
.
├── LICENSE
├── NOTICE
├── README.md
├── bin
│ ├── fscrawler
│ └── fscrawler.bat
└── lib
├── ... All needed jars
```
Note that you can also download a SNAPSHOT version
[from sonatype](https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler/2.3-SNAPSHOT/)
without needing to build it by yourself.
## Upgrade fscrawler
It can happen that you need to [upgrade a mapping](#upgrading-an-existing-mapping) before starting fscrawler after a
version upgrade.
Read carefully the following update instructions.
To update fscrawler, just download the new version, unzip it in another directory and launch it as usual.
It will still pick up settings from the configuration directory. Of course, you need to stop first the existing
running instances.
### Upgrade to 2.2
* fscrawler comes with new default mappings for files. They have better defaults as they consume less disk space
and CPU at index time. You should remove existing files in `~/.fscrawler/_default/_mappings` before starting the new
version so default mappings will be updated. If you modified manually mapping files, apply the modification you made
on sample files.
* `excludes` is now set by default for new jobs to `["~*"]`. In previous versions, any file or directory containing a
`~` was excluded. Which means that if in your jobs, you are defining any exclusion rule, you need to add `*~*` if
you want to get back the exact previous behavior.
* If you were indexing `json` or `xml` documents with the `filename_as_id` option set, we were previously removing the
suffix of the file name, like indexing `1.json` was indexed as `1`. With this new version, we don't remove anymore the
suffix. So the `_id` for your document will be now `1.json`.
### Upgrade to 2.3
* fscrawler comes with new mapping for folders. The change is really tiny so you can skip this step if you wish.
We basically removed `name` field in the folder mapping as it was unused.
* The way FSCrawler computes now `path.virtual` for docs has changed. It now includes the filename.
Instead of `/path/to` you will now get `/path/to/file.txt`.
* The way FSCrawler computes now `virtual` for folders is now consistent with what you can see for folders.
* `path.encoded` in documents and `encoded` in folders have been removed as not needed by FSCrawler after all.
* [OCR](#ocr-integration) is now properly activated for PDF documents. This can be time, cpu and memory consuming though.
You can disable explicitly it by setting `fs.pdf_ocr` to `false`.
* All dates are now indexed in elasticsearch in UTC instead of without any time zone. For example, we were indexing
previously a date like `2017-05-19T13:24:47.000`. Which was producing bad results when you were located in a time zone
other than UTC. It's now indexed as `2017-05-19T13:24:47.000+0000`.
* In order to be compatible with the coming 6.0 elasticsearch version, we need to get rid of types as only one type
per index is still supported. Which means that we now create index named `job_name` and `job_name_folder` instead
of one index `job_name` with two types `doc` and `folder`. If you are upgrading from FSCrawler 2.2, it requires that
you reindex your existing data either by deleting the old index and running again FSCrawler or by using the
[reindex API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html) as follows:
```
# Create folder index job_name_folder based on existing folder data
POST _reindex
{
"source": {
"index": "job_name",
"type": "folder"
},
"dest": {
"index": "job_name_folder"
}
}
# Remove old folder data from job_name index
POST job_name/folder/_delete_by_query
{
"query": {
"match_all": {}
}
}
```
Note that you will need first to create the right settings and mappings so you can then run the reindex job.
You can do that by launching `bin/fscrawler job_name --loop 0`.
Better, you can run `bin/fscrawler job_name --upgrade` and let FSCrawler do all that for you. Note that this can take
a loooong time.
Also please be aware that some APIs used by the upgrade action are only available from elasticsearch 2.3 (reindex) or
elasticsearch 5.0 (delete by query). If you are running an older version than 5.0 you need first to upgrade elasticsearch.
This procedure only applies if you did not set previously `elasticsearch.type` setting (default value was `doc`).
If you did, then you also need to reindex the existing documents to the default `doc` type as per elasticsearch 6.0:
```
# Copy old type doc to the default doc type
POST _reindex
{
"source": {
"index": "job_name",
"type": "your_type_here"
},
"dest": {
"index": "job_name",
"type": "doc"
}
}
# Remove old type data from job_name index
POST job_name/your_type_here/_delete_by_query
{
"query": {
"match_all": {}
}
}
```
But note that this last step can take a very loooong time and will generate a lot of IO on your disk.
It might be easier in such case to restart fscrawler from scratch.
*
没有合适的资源?快使用搜索试试~ 我知道了~
fscrawler-2.4
共153个文件
jar:148个
notice:1个
bat:1个
需积分: 10 1 下载量 169 浏览量
2018-06-21
17:19:32
上传
评论
收藏 68.34MB ZIP 举报
温馨提示
fscrawler是ES的一个文件导入插件,只需要简单的配置就可以实现将本地文件系统的文件导入到ES中进行检索,同时支持丰富的文件格式(txt.pdf,html,word...); 启动: bin/fscrawler job_name job_name需要自己设定,第一次启动这个job会创建一个相关的_setting.json用来配置文件和es相关的信息
资源推荐
资源详情
资源评论
收起资源包目录
fscrawler-2.4 (153个子文件)
fscrawler.bat 1KB
fscrawler 1KB
elasticsearch-6.0.0-beta1.jar 9.22MB
sqlite-jdbc-3.19.3.jar 5.93MB
poi-ooxml-schemas-3.17-beta1.jar 5.64MB
bcprov-jdk15on-1.54.jar 3.13MB
lucene-core-7.0.0-snapshot-00142c9.jar 2.65MB
xmlbeans-2.6.0.jar 2.6MB
poi-3.17-beta1.jar 2.57MB
pdfbox-2.0.6.jar 2.35MB
guava-16.0.1.jar 2.12MB
elasticsearch-rest-client-6.0.0-beta1.jar 1.91MB
language-detector-0.5.jar 1.79MB
lucene-analyzers-common-7.0.0-snapshot-00142c9.jar 1.44MB
fontbox-2.0.6.jar 1.41MB
poi-ooxml-3.17-beta1.jar 1.41MB
log4j-core-2.8.1.jar 1.34MB
poi-scratchpad-3.17-beta1.jar 1.32MB
jackson-databind-2.8.6.jar 1.18MB
hppc-0.7.1.jar 1.09MB
opennlp-tools-1.6.0.jar 1.04MB
isoparser-1.1.18.jar 1.01MB
tika-parsers-1.16.jar 1MB
jersey-guava-2.25.1.jar 949KB
jersey-server-2.25.1.jar 919KB
grizzly-framework-2.3.28.jar 880KB
jackcess-2.1.8.jar 842KB
commons-collections4-4.1.jar 734KB
javassist-3.20.0-GA.jar 733KB
jersey-common-2.25.1.jar 699KB
bcpkix-jdk15on-1.54.jar 658KB
tika-core-1.16.jar 632KB
sis-utility-0.6.jar 630KB
joda-time-2.9.5.jar 617KB
sis-referencing-0.6.jar 598KB
jai-imageio-core-1.3.1.jar 588KB
sis-metadata-0.6.jar 534KB
commons-compress-1.14.jar 518KB
jna-4.4.0-1.jar 508KB
woodstox-core-5.0.3.jar 501KB
jai-imageio-jpeg2000-1.3.0.jar 450KB
commons-vfs2-2.0.jar 406KB
lucene-queryparser-7.0.0-snapshot-00142c9.jar 375KB
metadata-extractor-2.9.1.jar 369KB
grizzly-http-2.3.28.jar 331KB
apache-mime4j-dom-0.8.1.jar 321KB
grizzly-http-server-2.3.28.jar 281KB
commons-lang-2.6.jar 278KB
commons-codec-1.10.jar 278KB
jackson-core-2.8.6.jar 275KB
jsch-0.1.53.jar 274KB
snakeyaml-1.15.jar 263KB
jsr-275-0.9.3.jar 254KB
plexus-utils-1.5.6.jar 245KB
lucene-suggest-7.0.0-snapshot-00142c9.jar 243KB
rome-1.5.1.jar 237KB
lucene-queries-7.0.0-snapshot-00142c9.jar 237KB
lucene-spatial3d-7.0.0-snapshot-00142c9.jar 232KB
pdfbox-debugger-2.0.6.jar 230KB
jhighlight-1.0.2.jar 229KB
gson-2.8.1.jar 227KB
log4j-api-2.8.1.jar 223KB
juniversalchardet-1.0.3.jar 216KB
geoapi-3.0.0.jar 209KB
commons-io-2.5.jar 204KB
lucene-spatial-extras-7.0.0-snapshot-00142c9.jar 198KB
lucene-highlighter-7.0.0-snapshot-00142c9.jar 193KB
hk2-locator-2.5.0-b32.jar 183KB
hk2-api-2.5.0-b32.jar 181KB
jersey-client-2.25.1.jar 165KB
stax2-api-3.1.4.jar 158KB
jsonic-1.2.11.jar 158KB
lucene-sandbox-7.0.0-snapshot-00142c9.jar 154KB
junrar-0.7.jar 152KB
jansi-1.13.jar 148KB
fscrawler-2.4.jar 146KB
lucene-join-7.0.0-snapshot-00142c9.jar 143KB
levigo-jbig2-imageio-2.0.jar 136KB
hk2-utils-2.5.0-b32.jar 132KB
vorbis-java-core-0.8.jar 118KB
xmpcore-5.1.2.jar 115KB
javax.ws.rs-api-2.0.1.jar 113KB
HdrHistogram-2.1.9.jar 111KB
apache-mime4j-core-0.8.1.jar 101KB
xz-1.6.jar 101KB
bcmail-jdk15on-1.54.jar 100KB
lucene-backward-codecs-7.0.0-snapshot-00142c9.jar 97KB
curvesapi-1.04.jar 96KB
jackson-dataformat-xml-2.8.6.jar 93KB
maven-scm-api-1.4.jar 92KB
boilerpipe-1.1.0.jar 90KB
lucene-misc-7.0.0-snapshot-00142c9.jar 90KB
tagsoup-1.2.1.jar 89KB
jackson-datatype-jsr310-2.8.6.jar 88KB
lucene-grouping-7.0.0-snapshot-00142c9.jar 85KB
jackcess-encrypt-2.1.2.jar 84KB
java-libpst-0.8.1.jar 83KB
jopt-simple-5.0.2.jar 76KB
parent-join-client-6.0.0-beta1.jar 74KB
jackson-dataformat-smile-2.8.6.jar 72KB
共 153 条
- 1
- 2
资源评论
xsdu
- 粉丝: 0
- 资源: 20
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 论文(最终)_20240430235101.pdf
- 基于python编写的Keras深度学习框架开发,利用卷积神经网络CNN,快速识别图片并进行分类
- 最全空间计量实证方法(空间杜宾模型和检验以及结果解释文档).txt
- 5uonly.apk
- 蓝桥杯Python组的历年真题
- 2023-04-06-项目笔记 - 第一百十九阶段 - 4.4.2.117全局变量的作用域-117 -2024.04.30
- 2023-04-06-项目笔记 - 第一百十九阶段 - 4.4.2.117全局变量的作用域-117 -2024.04.30
- 前端开发技术实验报告:内含4四实验&实验报告
- Highlight Plus v20.0.1
- 林周瑜-论文.docx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功