Welcome to the FS Crawler for Elasticsearch
This crawler helps to index binary documents such as PDF, Open Office, MS Office.
Main features:
- Local file system (or a mounted drive) crawling and index new files, update existing ones and removes old ones.
- Remote file system over SSH crawling.
- REST interface to let you "upload" your binary documents to elasticsearch.
You need to install a version matching your Elasticsearch version:
Elasticsearch | FS Crawler | Released | Docs |
---|---|---|---|
2.x, 5.x, 6.x | 2.5-SNAPSHOT | See below | |
2.x, 5.x, 6.x | 2.4 | 2017-08-11 | 2.4 |
2.x, 5.x, 6.x | 2.3 | 2017-07-10 | 2.3 |
1.x, 2.x, 5.x | 2.2 | 2017-02-03 | 2.2 |
1.x, 2.x, 5.x | 2.1 | 2016-07-26 | 2.1 |
es-2.0 | 2.0.0 | 2015-10-30 | 2.0.0 |
The guide has been moved to ReadTheDocs.
This software is licensed under the Apache 2 license, quoted below.
Copyright 2011-2018 David Pilato
Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.
Some libraries are not Apache2 compatible. Therefore they are not packaged with FSCrawler so you need
to download and add manually them to the lib
directory:
jbig2
: com.levigo.jbig2:levigo-jbig2-imageio:2.0tiff
: com.github.jai-imageio:jai-imageio-core:1.3.1JPEG2000
: com.github.jai-imageio:jai-imageio-jpeg2000:1.3.0
See pdfbox for more details.