Tuesday, June 24, 2008

Creating Lucene Index in a Database - Apache Lucene

My previous post, Indexing a database and searching the content using Lucene, shows how to index records (or stored files) in a database. In that case the index is created in the local file system. However in real scenarios most of the applications run on clustered environments. Then the problem comes where to create the search index.

Creating the index in the local file system is not a solution for the particular situation as the index should be synchronized and shared by every node. One solution is clustering the JVM while using a Lucene RAMDirectory (keep in mind it disappears after a node failure) instead of a FSDirectory. Terracotta framework can be used to cluster the JVM. This blog entry shows a code snippet.

Anyway I thought not to go that far and decided to create the index in the database so that it can be shared by everyone. Lucence contains the JdbcDirectory interface for this purpose. However the implementation of this interface is not shipped with Lucene itself. I found a third party implementation of that. Compass project provides the implementation of JdbcDirectory. (No need to worry about compass configurations etc. JdbcDirectory can be used with pure Lucene without bothering about Compass Lucene stuff).

Here is a simple example
//you need to include lucene and jdbc jars 
import org.apache.lucene.store.jdbc.JdbcDirectory;

import org.apache.lucene.store.jdbc.dialect.MySQLDialect;

import com.mysql.jdbc.jdbc2.optional.MysqlDataSource;
.
//code snippet to create index
MysqlDataSource dataSource = new MysqlDataSource();

dataSource.setUser("root");

dataSource.setPassword("password");

dataSource.setDatabaseName("test");

dataSource.setEmulateLocators(true); //This is important because we are dealing with a blob type data field

JdbcDirectory jdbcDir = new JdbcDirectory(dataSource, new MySQLDialect(), "indexTable");

jdbcDir.create(); // creates the indexTable in the DB (test). No need to create it manually
.

//code snippet for indexing
StandardAnalyzer analyzer = new StandardAnalyzer();

IndexWriter writer = new IndexWriter(jdbcDir, analyzer, true);

indexDocs(writer, dataSource.getConnection());

System.out.println("Optimizing...");

writer.optimize();

writer.close();


static void indexDocs(IndexWriter writer, Connection conn)
throws Exception {
    String sql = "select id, name, color from pet";
    Statement stmt = conn.createStatement();
    ResultSet rs = stmt.executeQuery(sql);

    while (rs.next()) {
        Document d = new Document();
        d.add(new Field("id", rs.getString("id"), Field.Store.YES, Field.Index.NO));
        d.add(new Field("name", rs.getString("name"), Field.Store.YES, Field.Index.TOKENIZED));
        d.add(new Field("color", rs.getString("color"), Field.Store.YES,  Field.Index.TOKENIZED));
        writer.addDocument(d);
    }
}

This is the indexing part. Searching part is same as the one in my previous post.

65 comments:

  1. Hi Kalani,
    This is an interesting post. Its great that we can directly write indexes into a database. I was doing it in a different way though. I create multimedia desktop applications using Adobe Flex and AIR. Flex/AIR supports SQLite database. So, for search, we index the content, read and write into a My SQL DB. Export MySQL db, and make SQLite DB in commandline. This SQLite DB is used to search from the flash/flex/AIR application. The search code is written by me. I didnt follow Lucene logic for search. I have only implemented single term, AND, OR and NOT searches. Am I doing something really wrong, or is it a long way to come to the result. How can I get the logic to build search code in Actionscript, which can support all the search methods which Lucene supports? I have been using this for quiet a long time, but not a Lucene expert. Can these be done?

    ReplyDelete
  2. Hi
    Very interesting. Are there any comparisons about the performance between a jdbc index and a normal file index.

    ReplyDelete
  3. Performance is not comparable. Here the advantage is that you can have fulltext search in a CD or desktop application without having any server installed.

    ReplyDelete
  4. Hi Vipin,
    is there any possibility to deal with a RAMDirectory(which keeps the index in the memory)? I mean if you can, then you can only use the MySql DB and RAMDirectory.You can periodically flush the updates in the RAMDirectory to MySql DB(to avoid losses in failures). When you start the app you should load the index to memory. I guess then you will be able to directly deal with the index in the RAM with pure Lucene logics for search.

    If you are further interested you can refer to this post.

    ReplyDelete
  5. Hi Ralph,
    I think jdbc index has a performance issue if it is used for a massive purpose. If the index is being frequently updated, it's obvious that writing to a database takes more time than writing to a normal file sys. But the advantage of jdbc index is it is good for clustered environments.

    To overcome this performance issue I use a RAMDirectory as a synapse :)

    ReplyDelete
  6. Hi Kalani,

    I cannot deal with RAM Directory from my desktop application. The Flex/AIR framework doesn't support all those. I got interested in this particular post, because using this we can directly save the index into a database instead of indexing it into a file and then read it to translate into a database. This helps. I need to also check whether we can write directly into an SQLite database. Will that be possible?

    I cannot load the index into my application while I search. The search can be little slower compared to online lucene search also. Right now I was thinking to re-write the Searching logic of lucene in Actionscript 3.0. Where should I look for the logic (I am not a Java guy, so reading and finding the logic is bit difficult for me)? If I get the logic behind it, I can write the classes for that.

    Thanks for this great post!

    ReplyDelete
  7. Vipin,

    I am not sure about making the index in a SQLite DB because seems the particular dialect is not available. You can find the supported dialects here.

    ReplyDelete
  8. Anonymous10:41 AM

    Nice blog with valuable content.
    Happy blogging.

    ReplyDelete
  9. Hi Kalani,
    I have benefited a lot from this blog... thanks.Do you know any good tutorial where I can start learning Compass ...I referred compass home page documentation but somehow didn't get the flow... It would be helpful if the concepts are explained using examples...

    ReplyDelete
  10. Hi Aditya,

    Followings may be helpful with Compass docs.
    http://www.geocities.com/sja085/community/tutorials/compass/compass01/
    http://www.infoq.com/articles/compass-search-tutorial

    Sorry I can't give a clear guide as I haven't used pure Compass.

    ReplyDelete
  11. Thanks a lot for those links :). I will be following your blogs closely. Keep the good work going.

    ReplyDelete
  12. Thank you for help :) Great post!!!

    ReplyDelete
  13. Hey Kalani, your blogs sounds very good. Couple of things:
    1. Can we migrate the index files to DB?
    2. How is the performance?

    Thanks,
    Pavan

    ReplyDelete
  14. yes Pavan, by following this approach the index files will be created in the DB. Definitely there is a performance hit but it varies on the index size and how often you read and write from/to the index. (I used RamDirectory to reduce the performance hit)

    ReplyDelete
  15. Hey Kalani,

    My question was if I have a file based index then can we migrate this file to DB index file?

    Appreciate your quick response.

    Thanks,
    Pavan

    ReplyDelete
  16. Kalani, please ignore my previous query. I figured it out.

    Thanks,
    Pavan

    ReplyDelete
  17. Nice Post.
    But i was looking in C#
    The below post helped me to create
    Create Lucene Index

    Anyway thanks.

    ReplyDelete
  18. Well guys, following is my sample code....

    MysqlDataSource dataSource = new MysqlDataSource();

    dataSource.setUser("root");
    dataSource.setPassword("ncl");
    dataSource.setDatabaseName("userdb");
    dataSource.setEmulateLocators(true); //This is important because we are dealing with a blob type data field

    try{

    JdbcDirectory jdbcDir = new JdbcDirectory(dataSource, new MySQLDialect(), "tttable");

    StandardAnalyzer analyzer = new StandardAnalyzer();

    IndexWriter writer = new IndexWriter(jdbcDir, analyzer,false);


    writer.optimize();
    writer.close();
    }catch(Exception e){

    System.out.print(e);
    }



    i am stuck at ------IndexWriter writer = new IndexWriter(jdbcDir, analyzer,false);---------

    this line .Everytime i try to run this code i recieve following exception..

    ------"org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: PhantomReadLock[write.lock/tttable]"------------

    i cant find wats wrong with the code....may be its a jar comaptible issue....

    i am unable to get IndexWriter object..

    any help regading this would be greately appreciated.


    i am looking forwad for your replies.
    thanx.

    ReplyDelete
  19. Hi,

    This LockObtainFailedException occurs if a previously created index writer is not closed properly. Creating an indexWriter creates a lock file for the directory. If you tries to create another writer in the same directory, this exception may occur.

    ReplyDelete
  20. Hello...First of all thanx for your reply....

    i succeccfully run this code on oracle with the small change..

    as.........
    oracle.jdbc.pool.OracleDataSource dataSource
    = new oracle.jdbc.pool.OracleDataSource();
    dataSource.setDriverType("thin");
    dataSource.setServerName("localhost");
    dataSource.setPortNumber(1521);
    dataSource.setDatabaseName("OraHome"); // Oracle SID
    dataSource.setUser("home");
    dataSource.setPassword("home");


    JdbcDirectory jdbcDir = new JdbcDirectory(dataSource, new OracleDialect(), "indextable");

    ....................

    this time without any error.

    i still want to ask is there any version problem with compass jar file.....or with my mysql version..

    coz line of the error is this only.


    JdbcDirectory jdbcDir = new JdbcDirectory(dataSource, new MySQLDialect(), "tttable");

    .....hope you understood my problem

    ReplyDelete
  21. I don't think that there's a problem with your Compass or Mysql versions. I doubt that your db is crashed so that previously created lock file is not deleted. Did you clear the table and try it?

    ReplyDelete
  22. Yes... i cleared up for several time.
    but no luck...

    this error gets generated even if i run this code for the very first time....

    so i don't think there is issue to clear lock files for the first time, when there is nothing generated..

    ReplyDelete
  23. could any one help me the same for plucene search

    ReplyDelete
  24. Hi, thank you for this good article. I have a question how to use JdbcDirectory. I am not using Compass Framework, than how can I use JdbcDirectory? Which jar file needed?

    ReplyDelete
  25. JdbcDirectory is just an interface in Lucene. I couldn't find any framework which implements JdbcDirectory other than Compass. Pure Lucene has not implemented that functionality.

    ReplyDelete
  26. Hi Kalani,
    I am trying to implement index and search logic. Did you get a chance to compare with any other search engines like Sphinx. If you have done any comparision ..want to know the results and recommendations.Thanks
    Vijay

    ReplyDelete
  27. Sorry Vijay, I didn't get a chance to compare Lucene performance with other search libs.

    ReplyDelete
  28. Thank you very much Kalani ,
    The issue for me i need to index 30 million records. Ita taking 15 million for indexing 1 million records, the speed is not good for indexing 30 million records. Please do suggest any inputs for optimization . My mail id vijaykumar.ravva@gmail.com .Thanks
    Vijay

    ReplyDelete
  29. H Vijay, A suggession I can think of is, Using RAM as much as possible and reducing frequent disk accesses while indexing. You have the control for these factors via mergeFactor and maxMergeDocs instance variables of IndexWriter. Making those values larger, is good for batch indexing scenarios like yours.

    ReplyDelete
  30. Hey Kalani ..thanks much .. i did all those chanes already.. the problem with my code while accessign the Data base.. solved the issue ..now we are indexing 1 million in 6 mins ..Thanks for all your inputs and sharing your valuable thoughts. your blog is very helpful. Please add my email vijaykumar.ravva@gmail.com .

    Thanks
    Vijay
    .

    ReplyDelete
  31. Very interesting article and very timely , currently we are trying to see how can we use lucene to store in db, you post is great help
    thanks
    Rajesh Surve

    ReplyDelete
  32. Hi Kalani,

    Firstly thanks for the wonderful post. It's an amazing piece of work which has helped a newbie like me to understand lucene. I just have a small query. I want to store the lucene indices in a local file system. could you please provide any idea as to how could that be done?

    Thanks & Regards,

    Amitava

    ReplyDelete
  33. @Amitava: Thanks for the comment. Storing Lucene index in the file system is an easy thing. You just have to remove all database specific code and use "FSDirectory" instead of JDBCDirectory. Hope this will help.

    ReplyDelete
  34. Hi Kalani,

    Thanks for your reply. Firstly,I am using Lucene 3.01. I had tried storing the data in local file system using FSDirectory. but the problem is I am having realtime data and the older Lucene indices are getting deleted from the local directory that I am using. Only the latest ones persist. I am putting down my code here.

    public class LucenePOC {

    StandardAnalyzer analyzer;
    MMapDirectory index;
    IndexWriter w;

    public void createIndex(JSONObject userInfo) throws IOException,ParseException{
    String text = null;
    long userid = 0;
    String creation_date = null;
    JSONObject userdetails = new JSONObject();
    String user_screen_name = null;
    String profile_image_url= null;
    String location= null;
    String language = null;

    File lucenefile = new File("C:/lucene");
    lucenefile.createNewFile();

    try{

    analyzer = new StandardAnalyzer(Version.LUCENE_30);

    index = new MMapDirectory(lucenefile.getAbsoluteFile());

    w = new IndexWriter(index, analyzer, false,IndexWriter.MaxFieldLength.UNLIMITED);

    userdetails = userInfo.getJSONObject("user");

    if(userInfo.has("text") && userInfo.getString("text")!= null && userInfo.getString("text")!= "" ){
    text = userInfo.getString("text");
    addDoc(w, text);
    }
    if(userInfo.has("id")&& !userInfo.isNull("id")){
    userid = userInfo.getLong("id");
    addDoc(w,String.valueOf(userid));
    }
    if(userInfo.has("created_at") && (!userInfo.isNull("created_at"))){
    creation_date = userInfo.getString("created_at");
    addDoc(w,creation_date);
    }
    if((userdetails.has("screen_name")) && (!userdetails.isNull("screen_name")) ){

    user_screen_name = userdetails.getString("screen_name");
    addDoc(w,user_screen_name);
    }
    if((userdetails.has("profile_image_url")) && (!userdetails.isNull("profile_image_url")) ){
    profile_image_url = userdetails.getString("profile_image_url");
    addDoc(w,profile_image_url);
    }
    if((userdetails.has("location")) && (!userdetails.isNull("location")) ){
    location = userdetails.getString("location");
    addDoc(w,location);
    }
    if((userdetails.has("lang")) && (!userdetails.isNull("lang")) ){
    language = userdetails.getString("lang");
    addDoc(w,language);
    }
    w.close();

    }catch (Exception e) {
    System.out.println("Exception in createIndex of TwitterLucenePOC "+e);
    }

    }

    private static void addDoc(IndexWriter w, String value) throws IOException {
    Document doc = new Document();
    doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
    w.addDocument(doc);
    }

    }

    Is there anything wrong that I am doing? Or am I missing something crucial? Any idea or suggestion on this would be of great help.

    Thanks & Regards,

    Amitava.

    ReplyDelete
  35. Hi, I roughly went through the code and was wondering whether you are calling createIndex method, every time you are indexing a record. There is "index creation" code inside that method. If you are calling it over and over, the index will be recreated and it will loose old information I guess.

    As I understand indexing code should be separated from index creating code and index creation should be done only once.

    ReplyDelete
  36. Hi Kalani,

    Thanks for your reply. I am invoking the createIndex method every time I get a new set of JSON data. So as per your observation, every time I invoke the createIndex method, the older data which had been indexed previously are replaced by newly indexed data. So how can this issue be resolved, since I am getting real time data, which needs to be indexed? Also as you had suggested "indexing code should be separated from index creating code and index creation should be done only once" Can you please clarify on this?

    ReplyDelete
  37. May be I am not much familiar with your scenario but I think you can provide two methods for that instead of one.

    createIndex method should include this code (index creating code)

    File lucenefile = new File("C:/lucene");
    lucenefile.createNewFile();

    try{

    analyzer = new StandardAnalyzer(Version.LUCENE_30);

    index = new MMapDirectory(lucenefile.getAbsoluteFile());
    }

    put the remaining code(indexing code)to another method and you can call that method whenever you gets a new set of JSON data. createIndex method should be called only once.

    ReplyDelete
  38. Hi Kalani,

    Thanks for your inputs. I guess I've got it where I had gone wrong. I'll let you know once I have implementd your suggestion.

    Thanks & Regards,
    Amitava

    ReplyDelete
  39. Hi Kalani,

    thanks for your suggestion. It worked for me.

    Regards,
    Amitava

    ReplyDelete
  40. You are welcome. Happy to hear that.

    ReplyDelete
  41. Hi Kalani,

    bothering you once more with my silly questions. Do you have any idea as to how we can search in lucene for indexes stored within a particular date range? For example, I want to search for indexes stored from April 5 to April 10. I have seen the "TermRangeQuery" class, but do not have any idea as to how to use it. Any suggestions would be of great help.

    Thanks & Regards,
    Amitava.

    ReplyDelete
  42. You may get an idea from this thread. I haven't worked with that but seems it is just another simple query type.

    ReplyDelete
  43. Very interesting post.
    My concerns are related to performance. Would storing in non-relational DB, like Mongo, improve performance?

    ReplyDelete
  44. Hi Kalani,

    I am maintaining code of an existing product. This happens to refer lucene-db.jar. But I am not able to find this jar anywhere. Would you know from where I can obtain this jar.

    Your inputs would be appreciated.

    Regards,
    Deepali

    ReplyDelete
  45. Hi Deeps,

    Sorry, I am not aware of a lucene-db.jar. If there is such jar then it should be available here I guess.

    ReplyDelete
  46. Mahesh said...
    Hello kalani i created the indexing and searching mechanism as mentioned by you above but when i m trying to fetch the records for any id i m getting the data related to the first id only.
    According to you what may the problem and solution to that problem...

    ReplyDelete
  47. Hi Mahesh, I didn't get your question. Did indexing went ok? Then you are searching on which field?

    ReplyDelete
  48. Hi Kalani,
    I'm using lucene for indexing arabic text.
    can you help me about using Tika for parsing different(pdf, word,Exel)arabic documents
    Thank youuuu

    ReplyDelete
  49. Hi Kalani,

    your solutions are usefull, thanks a lot and excuse me for my English,

    I'm interressed by lucene but for indexing arabic text, the problem is hox extract arabic text from pdf file and xml file,

    Any idea about indexing data base in arabic??

    thanks

    ReplyDelete
  50. Hi ounich,

    Sorry, I haven't worked with Arabic text but regarding the text extracting did you try PDFBox? I am not sure though.

    ReplyDelete
  51. Hi Kalani,
    thanks for your answer, T tried with Tika witch including pdfbox, poi...
    it's usefull for MicrosoftOffice document and html document but not for pdf, xml, rtf,
    thanks a lot

    ReplyDelete
  52. Hi kilani,
    thanks a lot for your solutions about indexing database with lucene,
    I tried to exucute query from console but no suceess,
    do you have any idea??
    thanks a lot

    ReplyDelete
  53. Hi Ounich,

    Are you going to Lucene index a database. Would you be able to let me know what you did and what error are you getting?

    ReplyDelete
  54. Anonymous3:22 PM

    Thank's for the article. Is this possible with a Derby DB as well? How would the connection look like?

    ReplyDelete
  55. hi rleuthold
    Sorry i worked with sql database

    ReplyDelete
  56. Hi
    I am a final year student and I am trying to implement resume parser that will scan the resume(.docx and.pdf) and store relevant informayion in a database using java/j2ee.please help me implement this.My email id is mnksinha70@gmail.com
    Thanks in advance.

    ReplyDelete
  57. Hi
    I am a final year student and I am trying to implement resume parser that will scan the resume(.docx and.pdf) and store relevant informayion in a database using java/j2ee.please help me implement this.My email id is mnksinha70@gmail.com
    Thanks in advance.

    ReplyDelete
  58. Hi.I am a final year student and trying to implement resume parser that will scan resume(.docx and .pdf) and store relevant information in database.I am trying to implement this using java and j2ee but not getting how to do.Please help me implement this feature.
    Thanks
    Reply me on mnksinha70@gmail.com

    ReplyDelete
  59. This comment has been removed by the author.

    ReplyDelete
  60. Hi

    I’m using lucene version 4.0.0 and compass 2.2.0 to store lucene index in Mysql.

    Am getting the following error and am stuck with that.

    Exception in thread “main” java.lang.NoSuchMethodError: org.apache.lucene.store.IndexInput: method ()V not found
    at org.apache.lucene.store.jdbc.handler.NoOpFileEntryHandler$NoOpIndexInput.(NoOpFileEntryHandler.java:34)
    at org.apache.lucene.store.jdbc.handler.NoOpFileEntryHandler$NoOpIndexInput.(NoOpFileEntryHandler.java:34)
    at org.apache.lucene.store.jdbc.handler.NoOpFileEntryHandler.(NoOpFileEntryHandler.java:86)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Unknown Source)
    at org.apache.lucene.store.jdbc.JdbcFileEntrySettings.getSettingAsClass(JdbcFileEntrySettings.java:175)
    at org.apache.lucene.store.jdbc.JdbcFileEntrySettings.getSettingAsClass(JdbcFileEntrySettings.java:157)
    at org.apache.lucene.store.jdbc.JdbcDirectory.initialize(JdbcDirectory.java:173)
    at org.apache.lucene.store.jdbc.JdbcDirectory.(JdbcDirectory.java:122)
    at CreateIndexToDB.createIndex(CreateIndexToDB.java:38)
    at CreateIndexToDB.main(CreateIndexToDB.java:87)

    And also can you tell me which will perform better.
    Storing index in FS or DB?

    Thanks in advance

    ReplyDelete
  61. hi
    excuse me, I didot use Mysql with lucene. I used some indexes created by Lucene. in my opinion, you should verify this problem with the version 3.0.2

    ReplyDelete
  62. Dear Kalani Ruwanpathirana,

    i want to create lucene index from my mysql db value then want to search.. please help me out this difficulties i am new one in lucene..

    ReplyDelete
  63. i just found this article and it seems pretty good, but i hv some question:
    1. what is the basic things and a must that we need to include everytime when we want to add lucene to our java program?
    2. is there any updated way to learn and understanding lucene ?
    i already post my question on stackoverflow https://stackoverflow.com/questions/58424819/the-updated-way-of-using-apache-lucene
    please feel free to comment me. i am new to lucene and really want to learn but dont know the right source:)

    ReplyDelete
  64. Hi Blog St,
    I guess Lucene should be evolved a lot now as my post is from 2008. If you are new to Lucene probably you want to begin with their documentation? https://lucene.apache.org/core/quickstart.html

    ReplyDelete
  65. Hi
    can you please share Lucene version 8 indexing store in database example.
    customDirector class with implementation of rename method.

    ReplyDelete