Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedded API raises exception during search after restart #18

Open
lsemel opened this issue Jan 15, 2012 · 6 comments
Open

Embedded API raises exception during search after restart #18

lsemel opened this issue Jan 15, 2012 · 6 comments

Comments

@lsemel
Copy link

lsemel commented Jan 15, 2012

If I start the embedded api and index some documents, I'm able to query them. If I stop and restart the embedded api, if I query for any document that was previously in the index, the embedded API throws the IndextankException below. Searching for a term that wasn't previously in the index returns a correct json result of zero matches.

I am using the default sample-engine-config and running on OS X.

Is there something I'm doing wrong here? Do I have to do something to trigger a reload of the previously indexed documents?

/var/www/indextank/indextank-engine$ java -cp target/indextank-engine-1.0.0-jar-with-dependencies.jar com.flaptor.indextank.api.Launcher 
WARN  [main] com.flaptor.indextank.api.EmbeddedIndexEngine - [log4j.properties not found on classpath!] 2012-01-15 12:56:30,351
INFO  [main] com.flaptor.indextank.api.EmbeddedIndexEngine - [Command line option 'environment-prefix' set to TEST] 2012-01-15 12:56:30,359
INFO  [main] com.flaptor.indextank.api.EmbeddedIndexEngine - [Command line option 'facets' set to true] 2012-01-15 12:56:30,359
INFO  [main] com.flaptor.indextank.api.EmbeddedIndexEngine - [Command line option 'index-code' set to dbajo] 2012-01-15 12:56:30,359
INFO  [main] com.flaptor.indextank.api.EmbeddedIndexEngine - [Command line option 'conf-file' set to sample-engine-config] 2012-01-15 12:56:30,365
INFO  [main] com.flaptor.indextank.suggest.NewPopularityIndex - [Loading popularity index terms from disk.] 2012-01-15 12:56:30,724
INFO  [main] com.flaptor.indextank.suggest.NewPopularityIndex - [Terms loaded] 2012-01-15 12:56:30,725
INFO  [main] com.flaptor.indextank.api.EmbeddedIndexEngine - [Index recovery configuration set to recover index from simpleDB] 2012-01-15 12:56:30,725
INFO  [main] com.flaptor.indextank.index.storage.InMemoryStorage - [Starting a new(empty) InMemoryStorage.] 2012-01-15 12:56:30,726
INFO  [main] com.flaptor.indextank.api.EmbeddedIndexEngine - [Using in-memory storage] 2012-01-15 12:56:30,727
INFO  [main] org.eclipse.jetty.util.log - [jetty-7.x.y-SNAPSHOT] 2012-01-15 12:56:30,790
INFO  [main] org.eclipse.jetty.util.log - [started o.e.j.s.ServletContextHandler{/,null}] 2012-01-15 12:56:30,821
INFO  [main] org.eclipse.jetty.util.log - [Started [email protected]:20220 STARTING] 2012-01-15 12:56:30,849
IndextankException(message:null)
    at com.flaptor.indextank.api.IndexEngineApi.search(IndexEngineApi.java:94)
    at com.flaptor.indextank.api.resources.Search.run(Search.java:79)
    at com.ghosthack.turismo.servlet.Servlet.service(Servlet.java:55)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:538)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:478)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:937)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:406)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:183)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:871)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:110)
    at org.eclipse.jetty.server.Server.handle(Server.java:346)
    at org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:589)
    at org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1048)
    at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:601)
    at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:214)
    at org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:411)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:535)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:40)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:529)
    at java.lang.Thread.run(Thread.java:637)
@MikeQG
Copy link

MikeQG commented Jun 8, 2012

I have a similar problem, after the restart and search previously indexed documents: curl "http://localhost:20220/v1/indexes/idx/search?q=ipsum", I get correct result. But if I try to do advanced searches, like the following: curl "http://localhost:20220/v1/indexes/idx/search?q=ipsum&snippet=text", I get "Service unavailable" and the following exception in IndexTank:

com.flaptor.indextank.api.IndexEngineApiException: java.lang.NullPointerException
at com.flaptor.indextank.api.IndexEngineApi.search(IndexEngineApi.java:90)
at com.flaptor.indextank.api.resources.Search.run(Search.java:76)
at com.ghosthack.turismo.servlet.Servlet.service(Servlet.java:55)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:538)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:478)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:937)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:406)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:183)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:871)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:110)
at org.eclipse.jetty.server.Server.handle(Server.java:346)
at org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:589)
at org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1048)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:601)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:214)
at org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:411)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:535)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:40)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:529)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.lang.NullPointerException
at java.io.ByteArrayInputStream.(ByteArrayInputStream.java:106)
at com.flaptor.indextank.index.storage.DocumentBinaryStorage.decompress(DocumentBinaryStorage.java:98)
at com.flaptor.indextank.index.storage.DocumentBinaryStorage.getDocument(DocumentBinaryStorage.java:70)
at com.flaptor.indextank.search.SnippetSearcher.search(SnippetSearcher.java:96)
at com.flaptor.indextank.api.IndexEngineApi.search(IndexEngineApi.java:83)
... 22 more

@myroslav
Copy link

myroslav commented Jun 8, 2012

Does anyone have workaround for the issue? Reindexing whole dataset after restart is, well, waste of resources and does not help in availability.

@jasonpolites
Copy link

I think I have tracked down the cause of this.

The problem occurs because IndexTank is expecting the InMemoryStorage instance to be in a particular state after startup however depending on how the engine is bootstrapped it may not have been initialized correctly.

When starting an instance of the EmbeddedIndexEngine you MUST specify the parameter:

--load-state true

for example:

final String base = realPath + "/indextank/";
final String [] params = new String[]{
        "--facets", 
        "--rti-size", "500", 
        "--conf-file", realPath + "/sample-engine-config", 
        "--port", Configuration.port + "", // indexer port+1, searcher port+2, suggestor port+3
        "--environment-prefix", "UTOPIO", 
        "--recover", 
        "--dir", base, 
        "--load-state", "true", 
        "--snippets", 
        "--suggest", "documents", 
        "--boosts", "3", 
        "--index-code", Configuration.indexCode, 
        "--functions", "0:-age", 
        };
new File(base).mkdirs();
engine = EmbeddedIndexEngine.instantiate(params);

However I did notice that if I had an index that was already in a "bad" state providing this parameter resulted in a lot of other errors. It seems that if you start from scratch with this parameter it's all good.

Unfortunately this seems like an incredibly flakey/unreliable situation. If for any reason the InMemoryStorage instance fails to load successfully your whole index is basically useless.

I'm still tracing through the code to try to work out how this can be made more robust. It may be that it IS a robust solution and I'm just missing the point of course

@jasonpolites
Copy link

Follow up...

The default implementation of the EmbeddedIndexEngine seems to only allow the use of this InMemoryStorage instance:

Snippet from EmbeddedIndexEngine

switch (storageValue) {
    case RAM:
        storage = new InMemoryStorage(baseDir, load);
        logger.info("Using in-memory storage");
        break;
    case NO:
        storage = null;
        logger.info("NOT Using storage");
        break;
}

I'm assuming IndexTank is using this Document storage to maintain a complete copy of the original document that was indexed, presumably because the underlying Lucene instance has been instructed to only index document fields and not store them. Index only would seem to be a sensible option however I would also assume that in almost all cases the user of IndexTank will already have a document storage system and would not need IndexTank to manage this itself.

Unfortunately there also does not seem to be an easy way to instruct the engine to NOT use storage. Despite the snippet above, the engine also has this:

StorageValues storageValue = StorageValues.RAM;
int bdbCache = 0;
if (line.hasOption("storage")){
    String storageType = line.getOptionValue("storage");
    if ("bdb".equals(storageType)) {
        storageValue = StorageValues.BDB;
        bdbCache = Integer.parseInt(line.getOptionValue("bdb-cache", String.valueOf(DEFAULT_BDB_CACHE)));
    } else if ("cassandra".equals(storageType)) {
        storageValue = StorageValues.CASSANDRA;
    } else if ("ram".equals(storageType)) {
        storageValue = StorageValues.RAM;
    } else {
        throw new IllegalArgumentException("storage has to be 'cassandra', 'bdb' or 'ram'. '" + storageType + "' given.");
    }
}

Of course none of these other values will every actually work because of the code in the first snippet.

Confusing...

@santiagovillegasg
Copy link

I need help with this issue. I have edited the file and added --load-state true but when i start the service i receive an
Starting a new(empty) InMemoryStorage. Load was requested but no file was found.

i just need to start my index, add documents, stop it, start it again y be able to search the past documents.
Help!

@jhandl
Copy link

jhandl commented Oct 1, 2012

Please keep in mind the IndexTank-engine was created as a way to make IndexTank easy to use as it was open-sourced. Originally it was part of IndexTank-service, and as such the recovery was provided by the "LogStorage" (a component of the service which is absent in the stand-alone engine) and indexes were killed and respawned routinely by the "Nebu" component, transparently for the user. You can still use this setup if you want to venture in that direction.

So the stand-alone engine needs to be fed all the documents again after restart. The recovery time typically depends on the speed of the data source, as the engine can take documents much faster than a normal disk-based source can spew them. But unless you have a really large number of documents, it should take only a few seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants