BIG WARNING : Elasticsearch 2.0.0 doesn't support anymore rivers.
This project has moved to a standalone application which crawls your file system and index new files, update existing ones and removes old ones.
Please use it with caution and don't put it in production without a lot of tests...
Welcome to the FS Crawler for Elasticsearch
This crawler helps to index documents from your local file system and over SSH.
You need to install a version matching your Elasticsearch version:
Elasticsearch | FS Crawler | Docs |
---|---|---|
es-2.0 | 2.0.0 | See below |
Thanks to Travis for the build status:
You need to have at least Java 1.8.
fscrawler job_name
FS crawler will read a local file (default to ~/.fscrawler/{job_name}.json
).
If the file does not exist, FS crawler will propose to create your first job.
Once the crawler is running, it will write status information and statistics in:
~/.fscrawler/{job_name}.json
~/.fscrawler/{job_name}_status.json
~/.fscrawler/{job_name}_stats.json
It means that if you stop the job at some point, FS crawler will restart it from where it stops. If needed, you can manually edit / remove those files to restart.
You can also run:
fscrawler
It will give you the list of existing jobs and will allow you to choose one.
--help
displays help
--silent
runs in silent mode. No output is generated.
--debug
runs in debug mode.
--trace
runs in trace mode (more verbose than debug).
--config_dir
defines directory where jobs are stored instead of default ~/.fscrawler
.
The job file must comply to the following json
specifications:
{
"name" : "job_name",
"fs" : {
"url" : "/path/to/data/dir",
"update_rate" : "5s",
"includes": [
"*.*"
],
"excludes": [
"*.json"
],
"json_support" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : true,
"store_source" : false,
"indexed_chars" : "10000"
},
"server" : {
"hostname" : null,
"port" : 22,
"username" : null,
"password" : null,
"protocol" : "local",
"pem_path" : null
},
"elasticsearch" : {
"nodes" : [ {
"host" : "127.0.0.1",
"port" : 9300
} ],
"index" : "docs",
"type" : "doc",
"bulk_size" : 100,
"flush_interval" : "5s"
}
}
Here is a full list of existing settings:
Name | Default value | Documentation |
---|---|---|
name |
the job name (mandatory field) | |
fs.url |
"/tmp/es" |
Root directory |
fs.update_rate |
"15m" |
Update Rate |
fs.includes |
null |
Includes and Excludes |
fs.excludes |
null |
Includes and Excludes |
fs.json_support |
false |
Indexing JSon docs |
fs.filename_as_id |
false |
Using Filename as _id |
fs.add_filesize |
true |
Disabling file size field |
fs.remove_deleted |
true |
Ignore deleted files |
fs.store_source |
false |
Storing binary source document |
fs.indexed_chars |
0.0 |
Extracted characters |
server.hostname |
null |
Indexing using SSH |
server.port |
22 |
Indexing using SSH |
server.username |
null |
Indexing using SSH |
server.password |
null |
Indexing using SSH |
server.protocol |
"local" |
Indexing using SSH |
server.pem_path |
null |
Using Username / PEM file |
elasticsearch.index |
job name | Index Name |
elasticsearch.type |
"doc" |
Type Name |
elasticsearch.bulk_size |
100 |
Bulk settings |
elasticsearch.flush_interval |
"5s" |
Bulk settings |
elasticsearch.nodes |
127.0.0.1:9300 | Node settings |
You can define the most simple crawler job by writing a ~/.fscrawler/test.json
file as follow:
{
"name" : "test"
}
This will scan every 15 minutes all documents available in /tmp/es
dir and will index them into test
index with
doc
type. It will connect to an elasticsearch cluster running on 127.0.0.1
, port 9300
.
Note: name
is a mandatory field.
Define fs.url
property in your ~/.fscrawler/test.json
file:
{
"name" : "test",
"fs" : {
"url" : "/path/to/data/dir"
}
}
For Windows users, use a form like c:/tmp
or c:\\tmp
.
Let's say you want to index only docs like *.doc
and *.pdf
but resume*
. So resume_david.pdf
won't be indexed.
Define fs.includes
and fs.excludes
properties in your ~/.fscrawler/test.json
file:
{
"name" : "test",
"fs": {
"includes": [
"*.doc",
"*.pdf"
],
"excludes": [
"resume*"
]
}
}
By default, update_rate
is set to 15m
. You can modify this value using any compatible
time unit.
For example, here is a 15 minutes update rate:
{
"name": "test",
"fs": {
"update_rate": "15m"
}
}
Or a 3 hours update rate:
{
"name": "test",
"fs": {
"update_rate": "3h"
}
}
You can index files remotely using SSH.
Let's say you want to index from a remote server using SSH:
- FS URL:
/path/to/data/dir/on/server
- Server:
mynode.mydomain.com
- Username:
username
- Password:
password
- Protocol:
ssh
(default tolocal
) - Port:
22
(default to22
)
{
"name" : "test",
"fs" : {
"url" : "/path/to/data/dir/on/server"
},
"server" : {
"hostname" : "mynode.mydomain.com",
"port" : 22,
"username" : "username",
"password" : "password",
"protocol" : "ssh"
}
}
Let's say you want to index from a remote server using SSH:
- FS URL:
/path/to/data/dir/on/server
- Server:
mynode.mydomain.com
- Username:
username
- PEM File:
/path/to/private_key.pem
- Protocol:
ssh
(default tolocal
) - Port:
22
(default to22
)
{
"name" : "test",
"fs" : {
"url" : "/path/to/data/dir/on/server"
},
"server" : {
"hostname" : "mynode.mydomain.com",
"port" : 22,
"username" : "username",
"protocol" : "ssh",
"pem_path": "/path/to/private_key.pem"
}
}
This is a common use case in elasticsearch, we want to search for something! ;-)
GET docs/doc/_search
{
"query" : {
"match" : {
"_all" : "I am searching for something !"
}
}
}
If you want to index JSon files directly without parsing with Tika, you can set json_support
to true
.
{
"name" : "test",
"fs" : {
"json_support" : true
}
}
Of course, if you did not define a mapping before launching the crawler, Elasticsearch will auto guess the mapping.
If you have more than one type, create as many crawlers as types:
~/.fscrawler/test_type1.json
:
{
"name": "test_type1",
"fs": {
"url": "/tmp/type1",
"json_support" : true
},
"elasticsearch": {
"index": "mydocs",
"type": "type1"
}
}
~/.fscrawler/test_type2.json
:
{
"name": "test_type2",
"fs": {
"url": "/tmp/type2",
"json_support" : true
},
"elasticsearch": {
"index": "mydocs",
"type": "type2"
}
}
You can also index many types from one single dir using two crawlers scanning the same dir and by setting
includes
parameter:
~/.fscrawler/test_type1.json
:
{
"name": "test_type1",
"fs": {
"url": "/tmp",
"includes": [ "type1*.json" ],
"json_support" : true
},
"elasticsearch": {
"index": "mydocs",
"type": "type1"
}
}
~/.fscrawler/test_type2.json
:
{
"name": "test_type2",
"fs": {
"url": "/tmp",
"includes": [ "type2*.json" ],
"json_support" : true
},
"elasticsearch": {
"index": "mydocs",
"type": "type2"
}
}
Please note that the document _id
is always generated (hash value) from the JSon filename to avoid issues with
special characters in filename.
You can force to use the _id
to be the filename using filename_as_id
attribute:
{
"name" : "test",
"fs" : {
"json_support" : true,
"filename_as_id" : true
}
}
By default, FS crawler will create a field to store the original file size in octets. You can disable it using `add_filesize' option:
{
"name" : "test",
"fs" : {
"add_filesize" : false
}
}
If you don't want to remove indexed documents when you remove a file or a directory, you can
set remove_deleted
to false
(default to true
):
{
"name" : "test",
"fs" : {
"remove_deleted" : false
}
}
If you don't want to extract file content but only index filesystem metadata such as filename, date, size and path,
you can set index_content
to false
(default to true
):
{
"name" : "test",
"fs" : {
"index_content" : false
}
}
When the FS crawler detects a new type, it creates automatically a mapping for this type:
{
"doc" : {
"properties" : {
"content" : {
"type" : "string",
"store" : "yes"
},
"meta" : {
"properties" : {
"author" : {
"type" : "string",
"store" : "yes"
},
"title" : {
"type" : "string",
"store" : "yes"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime",
"store" : "yes"
},
"keywords" : {
"type" : "string",
"store" : "yes"
}
}
},
"file" : {
"properties" : {
"content_type" : {
"type" : "string",
"analyzer" : "not_analyzed",
"store" : "yes"
},
"last_modified" : {
"type" : "date",
"format" : "dateOptionalTime",
"store" : "yes"
},
"indexing_date" : {
"type" : "date",
"format" : "dateOptionalTime",
"store" : "yes"
},
"filesize" : {
"type" : "long",
"store" : "yes"
},
"indexed_chars" : {
"type" : "long",
"store" : "yes"
},
"filename" : {
"type" : "string",
"analyzer" : "not_analyzed",
"store" : "yes"
},
"url" : {
"type" : "string",
"store" : "yes",
"index" : "no"
}
}
},
"path" : {
"properties" : {
"encoded" : {
"type" : "string",
"store" : "yes",
"index" : "not_analyzed"
},
"virtual" : {
"type" : "string",
"store" : "yes",
"index" : "not_analyzed"
},
"root" : {
"type" : "string",
"store" : "yes",
"index" : "not_analyzed"
},
"real" : {
"type" : "string",
"store" : "yes",
"index" : "not_analyzed"
}
}
}
}
}
}
If you want to define your own mapping to set analyzers for example, you can push the mapping before starting the FS crawler.
# Create index
PUT docs
# Create the mapping
PUT docs/doc/_mapping
{
"doc" : {
"properties" : {
"content" : {
"type" : "string",
"store" : "yes",
"analyzer" : "french"
},
"meta" : {
"properties" : {
"author" : {
"type" : "string",
"store" : "yes"
},
"title" : {
"type" : "string",
"store" : "yes"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime",
"store" : "yes"
},
"keywords" : {
"type" : "string",
"store" : "yes"
}
}
},
"file" : {
"properties" : {
"content_type" : {
"type" : "string",
"analyzer" : "not_analyzed",
"store" : "yes"
},
"last_modified" : {
"type" : "date",
"format" : "dateOptionalTime",
"store" : "yes"
},
"indexing_date" : {
"type" : "date",
"format" : "dateOptionalTime",
"store" : "yes"
},
"filesize" : {
"type" : "long",
"store" : "yes"
},
"indexed_chars" : {
"type" : "long",
"store" : "yes"
},
"filename" : {
"type" : "string",
"analyzer" : "not_analyzed",
"store" : "yes"
},
"url" : {
"type" : "string",
"store" : "yes",
"index" : "no"
}
}
},
"path" : {
"properties" : {
"encoded" : {
"type" : "string",
"store" : "yes",
"index" : "not_analyzed"
},
"virtual" : {
"type" : "string",
"store" : "yes",
"index" : "not_analyzed"
},
"root" : {
"type" : "string",
"store" : "yes",
"index" : "not_analyzed"
},
"real" : {
"type" : "string",
"store" : "yes",
"index" : "not_analyzed"
}
}
}
}
}
}
FS crawler creates the following fields :
Field | Description | Example |
---|---|---|
content |
Extracted content | "This is my text!" |
attachment |
BASE64 encoded binary file | BASE64 Encoded document |
meta.author |
Author if any in document metadata | "David Pilato" |
meta.title |
Title if any in document metadata | "My document title" |
meta.date |
Document date if any in document metadata | "2013-04-04T15:21:35" |
meta.keywords |
Keywords if any in document metadata | ["river","fs","elasticsearch"] |
file.content_type |
Content Type | "application/vnd.oasis.opendocument.text" |
file.last_modified |
Last modification date | 1386855978000 |
file.indexing_date |
Indexing date | "2013-12-12T13:50:58.758Z" |
file.filesize |
File size in bytes | 1256362 |
file.indexed_chars |
Extracted chars if fs.indexed_chars > 0 |
100000 |
file.filename |
Original file name | "mydocument.pdf" |
file.url |
Original file url | "file://tmp/mydir/otherdir/mydocument.pdf" |
path.encoded |
MD5 encoded file path (for internal use) | "112aed83738239dbfe4485f024cd4ce1" |
path.virtual |
Relative path from root path | "mydir/otherdir" |
path.root |
MD5 encoded root path (for internal use) | "112aed83738239dbfe4485f024cd4ce1" |
path.real |
Actual real path name | "/tmp/mydir/otherdir/mydocument.pdf" |
Here is a typical JSON document generated by the crawler:
{
"file":{
"filename":"test.odt",
"last_modified":1386855978000,
"indexing_date":"2013-12-12T13:50:58.758Z",
"content_type":"application/vnd.oasis.opendocument.text",
"url":"file:///tmp/testfs_metadata/test.odt",
"indexed_chars":100000,
"filesize":8355
},
"path":{
"encoded":"bceb3913f6d793e915beb70a4735592",
"root":"bceb3913f6d793e915beb70a4735592",
"virtual":"",
"real":"/tmp/testfs_metadata/test.odt"
},
"meta":{
"author":"David Pilato",
"title":"Mon titre",
"date":"2013-04-04T15:21:35",
"keywords":[
"fs",
"elasticsearch",
"crawler"
]
},
"content":"Bonjour David\n\n\n"
}
You can use meta fields to perform search on.
GET docs/doc/_search
{
"query" : {
"term" : {
"file.filename" : "mydocument.pdf"
}
}
}
You can store in elasticsearch itself the binary document using store_source
option:
{
"name" : "test",
"fs" : {
"store_source" : true
}
}
In that case, a new stored field named attachment
is added to the generated JSon document.
If you let FS crawler generates the mapping, FS crawler will exclude attachment
field from
_source
to save some disk space.
That means you need to ask for field attachment
when querying:
GET mydocs/doc/_search
{
"fields" : ["attachment", "_source"],
"query":{
"match_all" : {}
}
}
Default generated mapping in this case is:
{
"doc" : {
"_source" : {
"excludes" : [ "attachment" ]
},
"properties" : {
"attachment" : {
"type" : "binary"
}
// ... Other properties here
}
}
}
You can force not to store attachment
field and keep attachment
in _source
:
# Create index
PUT docs
# Create the mapping
PUT docs/doc/_mapping
{
"doc" : {
"properties" : {
"attachment" : {
"type" : "binary",
"store" : "no"
}
// ... Other properties here
}
}
}
By default FS crawler will extract only the first 100 000 characters.
But, you can set indexed_chars
to 5000
in FS crawler settings in order to overwrite this default settings.
{
"name": "test",
"fs": {
"indexed_chars": "5000"
}
}
This number can be either a fixed size, number of characters that is, or a percent using %
sign.
The percentage value will be applied to the filesize to determine the number of character the crawler needs
to extract.
If you want to index only 80%
of filesize, define indexed_chars
to "80%"
.
Of course, if you want to index the full document, you can set this property to "100%"
. Double values are also
supported so "0.01%"
is also a correct value.
Compressed files: If your file is compressed, you might need to increase indexed_chars
to more than "100%"
.
For example, "150%"
.
If you want to extract the full content, define indexed_chars
to "-1"
.
Note: Tika requires to allocate in memory a data structure to extract text. Setting indexed_chars
to a high
number will require more memory!
You can change elasticsearch settings within elasticsearch
settings object.
By default, FS crawler will index your data in an index which name is the same as the crawler name (name
property).
You can change it by setting index
field:
{
"name" : "test",
"elasticsearch" : {
"index" : "docs"
}
}
By default, FS crawler will index your data using doc
as the type name.
You can change it by setting type
field:
{
"name" : "test",
"elasticsearch" : {
"type" : "mydocument"
}
}
FS crawler is using bulks to send data to elasticsearch. By default the bulk is executed every 100 operations or
every 5 seconds. You can change default settings using bulk_size
and flush_interval
:
{
"name" : "test",
"elasticsearch" : {
"bulk_size" : 1000,
"flush_interval" : "2s"
}
}
FS crawler is using elasticsearch transport client to send data to your running cluster.
By default, it connects to 127.0.0.1
on port 9300
which are the default settings when
running a local node on your machine.
Of course, in production, you would probably change this and connect to a production cluster:
{
"name" : "test",
"elasticsearch" : {
"nodes" : [
{ "host" : "mynode1.mycompany.com", "port" : 9300 }
]
}
}
You can define multiple nodes:
{
"name" : "test",
"elasticsearch" : {
"nodes" : [
{ "host" : "mynode1.mycompany.com", "port" : 9300 },
{ "host" : "mynode2.mycompany.com", "port" : 9300 },
{ "host" : "mynode3.mycompany.com", "port" : 9300 }
]
}
}
Note: the cluster.name
does not have to be set as it's ignored.
This software is licensed under the Apache 2 license, quoted below.
Copyright 2011-2015 David Pilato
Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.