kalani's Tech blog: Creating Lucene Index in a Database

Tuesday, June 24, 2008

Creating Lucene Index in a Database - Apache Lucene

My previous post, Indexing a database and searching the content using Lucene, shows how to index records (or stored files) in a database. In that case the index is created in the local file system. However in real scenarios most of the applications run on clustered environments. Then the problem comes where to create the search index.

Creating the index in the local file system is not a solution for the particular situation as the index should be synchronized and shared by every node. One solution is clustering the JVM while using a Lucene RAMDirectory (keep in mind it disappears after a node failure) instead of a FSDirectory. Terracotta framework can be used to cluster the JVM. This blog entry shows a code snippet.

Anyway I thought not to go that far and decided to create the index in the database so that it can be shared by everyone. Lucence contains the JdbcDirectory interface for this purpose. However the implementation of this interface is not shipped with Lucene itself. I found a third party implementation of that. Compass project provides the implementation of JdbcDirectory. (No need to worry about compass configurations etc. JdbcDirectory can be used with pure Lucene without bothering about Compass Lucene stuff).

Here is a simple example

//you need to include lucene and jdbc jars 
import org.apache.lucene.store.jdbc.JdbcDirectory;

import org.apache.lucene.store.jdbc.dialect.MySQLDialect;

import com.mysql.jdbc.jdbc2.optional.MysqlDataSource;

//code snippet to create index
MysqlDataSource dataSource = new MysqlDataSource();

dataSource.setUser("root");

dataSource.setPassword("password");

dataSource.setDatabaseName("test");

dataSource.setEmulateLocators(true); //This is important because we are dealing with a blob type data field

JdbcDirectory jdbcDir = new JdbcDirectory(dataSource, new MySQLDialect(), "indexTable");

jdbcDir.create(); // creates the indexTable in the DB (test). No need to create it manually

//code snippet for indexing
StandardAnalyzer analyzer = new StandardAnalyzer();

IndexWriter writer = new IndexWriter(jdbcDir, analyzer, true);

indexDocs(writer, dataSource.getConnection());

System.out.println("Optimizing...");

writer.optimize();

writer.close();


static void indexDocs(IndexWriter writer, Connection conn)
throws Exception {
    String sql = "select id, name, color from pet";
    Statement stmt = conn.createStatement();
    ResultSet rs = stmt.executeQuery(sql);

    while (rs.next()) {
        Document d = new Document();
        d.add(new Field("id", rs.getString("id"), Field.Store.YES, Field.Index.NO));
        d.add(new Field("name", rs.getString("name"), Field.Store.YES, Field.Index.TOKENIZED));
        d.add(new Field("color", rs.getString("color"), Field.Store.YES,  Field.Index.TOKENIZED));
        writer.addDocument(d);
    }
}

This is the indexing part. Searching part is same as the one in my previous post.

65 comments:

Vipin12:16 AM
Hi Kalani,
This is an interesting post. Its great that we can directly write indexes into a database. I was doing it in a different way though. I create multimedia desktop applications using Adobe Flex and AIR. Flex/AIR supports SQLite database. So, for search, we index the content, read and write into a My SQL DB. Export MySQL db, and make SQLite DB in commandline. This SQLite DB is used to search from the flash/flex/AIR application. The search code is written by me. I didnt follow Lucene logic for search. I have only implemented single term, AND, OR and NOT searches. Am I doing something really wrong, or is it a long way to come to the result. How can I get the logic to build search code in Actionscript, which can support all the search methods which Lucene supports? I have been using this for quiet a long time, but not a Lucene expert. Can these be done?
ReplyDelete
Replies
Ralph11:55 AM
Hi
Very interesting. Are there any comparisons about the performance between a jdbc index and a normal file index.
ReplyDelete
Replies
Vipin2:09 PM
Performance is not comparable. Here the advantage is that you can have fulltext search in a CD or desktop application without having any server installed.
ReplyDelete
Replies
kalani Ruwanpathirana5:04 PM
Hi Vipin,
is there any possibility to deal with a RAMDirectory(which keeps the index in the memory)? I mean if you can, then you can only use the MySql DB and RAMDirectory.You can periodically flush the updates in the RAMDirectory to MySql DB(to avoid losses in failures). When you start the app you should load the index to memory. I guess then you will be able to directly deal with the index in the RAM with pure Lucene logics for search.

If you are further interested you can refer to this post.
ReplyDelete
Replies
kalani Ruwanpathirana5:22 PM
Hi Ralph,
I think jdbc index has a performance issue if it is used for a massive purpose. If the index is being frequently updated, it's obvious that writing to a database takes more time than writing to a normal file sys. But the advantage of jdbc index is it is good for clustered environments.

To overcome this performance issue I use a RAMDirectory as a synapse :)
ReplyDelete
Replies
Vipin1:19 AM
Hi Kalani,

I cannot deal with RAM Directory from my desktop application. The Flex/AIR framework doesn't support all those. I got interested in this particular post, because using this we can directly save the index into a database instead of indexing it into a file and then read it to translate into a database. This helps. I need to also check whether we can write directly into an SQLite database. Will that be possible?

I cannot load the index into my application while I search. The search can be little slower compared to online lucene search also. Right now I was thinking to re-write the Searching logic of lucene in Actionscript 3.0. Where should I look for the logic (I am not a Java guy, so reading and finding the logic is bit difficult for me)? If I get the logic behind it, I can write the classes for that.

Thanks for this great post!
ReplyDelete
Replies
kalani Ruwanpathirana3:44 PM
Vipin,

I am not sure about making the index in a SQLite DB because seems the particular dialect is not available. You can find the supported dialects here.
ReplyDelete
Replies
Anonymous10:41 AM
Nice blog with valuable content.
Happy blogging.
ReplyDelete
Replies
Aditya.R4:05 PM
Hi Kalani,
I have benefited a lot from this blog... thanks.Do you know any good tutorial where I can start learning Compass ...I referred compass home page documentation but somehow didn't get the flow... It would be helpful if the concepts are explained using examples...
ReplyDelete
Replies
kalani Ruwanpathirana6:43 PM
Hi Aditya,

Followings may be helpful with Compass docs.
http://www.geocities.com/sja085/community/tutorials/compass/compass01/
http://www.infoq.com/articles/compass-search-tutorial

Sorry I can't give a clear guide as I haven't used pure Compass.
ReplyDelete
Replies
Aditya.R6:50 PM
Thanks a lot for those links :). I will be following your blogs closely. Keep the good work going.
ReplyDelete
Replies
max5:00 PM
Thank you for help :) Great post!!!
ReplyDelete
Replies
Unknown3:09 AM
Hey Kalani, your blogs sounds very good. Couple of things:
1. Can we migrate the index files to DB?
2. How is the performance?

Thanks,
Pavan
ReplyDelete
Replies
kalani Ruwanpathirana11:12 AM
yes Pavan, by following this approach the index files will be created in the DB. Definitely there is a performance hit but it varies on the index size and how often you read and write from/to the index. (I used RamDirectory to reduce the performance hit)
ReplyDelete
Replies
Unknown9:05 PM
Hey Kalani,

My question was if I have a file based index then can we migrate this file to DB index file?

Appreciate your quick response.

Thanks,
Pavan
ReplyDelete
Replies
Unknown4:24 AM
Kalani, please ignore my previous query. I figured it out.

Thanks,
Pavan
ReplyDelete
Replies
Yogesh12:25 PM
Nice Post.
But i was looking in C#
The below post helped me to create
Create Lucene Index

Anyway thanks.
ReplyDelete
Replies
shontu11:45 AM
Well guys, following is my sample code....

MysqlDataSource dataSource = new MysqlDataSource();

dataSource.setUser("root");
dataSource.setPassword("ncl");
dataSource.setDatabaseName("userdb");
dataSource.setEmulateLocators(true); //This is important because we are dealing with a blob type data field

try{

JdbcDirectory jdbcDir = new JdbcDirectory(dataSource, new MySQLDialect(), "tttable");

StandardAnalyzer analyzer = new StandardAnalyzer();

IndexWriter writer = new IndexWriter(jdbcDir, analyzer,false);

writer.optimize();
writer.close();
}catch(Exception e){

System.out.print(e);
}

i am stuck at ------IndexWriter writer = new IndexWriter(jdbcDir, analyzer,false);---------

this line .Everytime i try to run this code i recieve following exception..

------"org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: PhantomReadLock[write.lock/tttable]"------------

i cant find wats wrong with the code....may be its a jar comaptible issue....

i am unable to get IndexWriter object..

any help regading this would be greately appreciated.

i am looking forwad for your replies.
thanx.
ReplyDelete
Replies
kalani Ruwanpathirana4:25 PM
Hi,

This LockObtainFailedException occurs if a previously created index writer is not closed properly. Creating an indexWriter creates a lock file for the directory. If you tries to create another writer in the same directory, this exception may occur.
ReplyDelete
Replies
shontu4:37 PM
Hello...First of all thanx for your reply....

i succeccfully run this code on oracle with the small change..

as.........
oracle.jdbc.pool.OracleDataSource dataSource
= new oracle.jdbc.pool.OracleDataSource();
dataSource.setDriverType("thin");
dataSource.setServerName("localhost");
dataSource.setPortNumber(1521);
dataSource.setDatabaseName("OraHome"); // Oracle SID
dataSource.setUser("home");
dataSource.setPassword("home");

JdbcDirectory jdbcDir = new JdbcDirectory(dataSource, new OracleDialect(), "indextable");

....................

this time without any error.

i still want to ask is there any version problem with compass jar file.....or with my mysql version..

coz line of the error is this only.

JdbcDirectory jdbcDir = new JdbcDirectory(dataSource, new MySQLDialect(), "tttable");

.....hope you understood my problem
ReplyDelete
Replies
kalani Ruwanpathirana5:16 PM
I don't think that there's a problem with your Compass or Mysql versions. I doubt that your db is crashed so that previously created lock file is not deleted. Did you clear the table and try it?
ReplyDelete
Replies
shontu5:22 PM
Yes... i cleared up for several time.
but no luck...

this error gets generated even if i run this code for the very first time....

so i don't think there is issue to clear lock files for the first time, when there is nothing generated..
ReplyDelete
Replies
vijayant3:35 PM
could any one help me the same for plucene search
ReplyDelete
Replies
Volkan2:27 PM
Hi, thank you for this good article. I have a question how to use JdbcDirectory. I am not using Compass Framework, than how can I use JdbcDirectory? Which jar file needed?
ReplyDelete
Replies
kalani Ruwanpathirana7:06 PM
JdbcDirectory is just an interface in Lucene. I couldn't find any framework which implements JdbcDirectory other than Compass. Pure Lucene has not implemented that functionality.
ReplyDelete
Replies
Vijay Kumar4:03 AM
Hi Kalani,
I am trying to implement index and search logic. Did you get a chance to compare with any other search engines like Sphinx. If you have done any comparision ..want to know the results and recommendations.Thanks
Vijay
ReplyDelete
Replies
kalani Ruwanpathirana4:38 AM
Sorry Vijay, I didn't get a chance to compare Lucene performance with other search libs.
ReplyDelete
Replies
Vijay Kumar8:57 PM
Thank you very much Kalani ,
The issue for me i need to index 30 million records. Ita taking 15 million for indexing 1 million records, the speed is not good for indexing 30 million records. Please do suggest any inputs for optimization . My mail id vijaykumar.ravva@gmail.com .Thanks
Vijay
ReplyDelete
Replies
kalani Ruwanpathirana12:36 AM
H Vijay, A suggession I can think of is, Using RAM as much as possible and reducing frequent disk accesses while indexing. You have the control for these factors via mergeFactor and maxMergeDocs instance variables of IndexWriter. Making those values larger, is good for batch indexing scenarios like yours.
ReplyDelete
Replies
Vijay Kumar12:42 AM
Hey Kalani ..thanks much .. i did all those chanes already.. the problem with my code while accessign the Data base.. solved the issue ..now we are indexing 1 million in 6 mins ..Thanks for all your inputs and sharing your valuable thoughts. your blog is very helpful. Please add my email vijaykumar.ravva@gmail.com .

Thanks
Vijay
.
ReplyDelete
Replies
rajsays10:10 AM
Very interesting article and very timely , currently we are trying to see how can we use lucene to store in db, you post is great help
thanks
Rajesh Surve
ReplyDelete
Replies
me_amitava3:29 PM
Hi Kalani,

Firstly thanks for the wonderful post. It's an amazing piece of work which has helped a newbie like me to understand lucene. I just have a small query. I want to store the lucene indices in a local file system. could you please provide any idea as to how could that be done?

Thanks & Regards,

Amitava
ReplyDelete
Replies
kalani Ruwanpathirana8:35 PM
@Amitava: Thanks for the comment. Storing Lucene index in the file system is an easy thing. You just have to remove all database specific code and use "FSDirectory" instead of JDBCDirectory. Hope this will help.
ReplyDelete
Replies
me_amitava8:46 PM
Hi Kalani,

Thanks for your reply. Firstly,I am using Lucene 3.01. I had tried storing the data in local file system using FSDirectory. but the problem is I am having realtime data and the older Lucene indices are getting deleted from the local directory that I am using. Only the latest ones persist. I am putting down my code here.

public class LucenePOC {

StandardAnalyzer analyzer;
MMapDirectory index;
IndexWriter w;

public void createIndex(JSONObject userInfo) throws IOException,ParseException{
String text = null;
long userid = 0;
String creation_date = null;
JSONObject userdetails = new JSONObject();
String user_screen_name = null;
String profile_image_url= null;
String location= null;
String language = null;

File lucenefile = new File("C:/lucene");
lucenefile.createNewFile();

try{

analyzer = new StandardAnalyzer(Version.LUCENE_30);

index = new MMapDirectory(lucenefile.getAbsoluteFile());

w = new IndexWriter(index, analyzer, false,IndexWriter.MaxFieldLength.UNLIMITED);

userdetails = userInfo.getJSONObject("user");

if(userInfo.has("text") && userInfo.getString("text")!= null && userInfo.getString("text")!= "" ){
text = userInfo.getString("text");
addDoc(w, text);
}
if(userInfo.has("id")&& !userInfo.isNull("id")){
userid = userInfo.getLong("id");
addDoc(w,String.valueOf(userid));
}
if(userInfo.has("created_at") && (!userInfo.isNull("created_at"))){
creation_date = userInfo.getString("created_at");
addDoc(w,creation_date);
}
if((userdetails.has("screen_name")) && (!userdetails.isNull("screen_name")) ){

user_screen_name = userdetails.getString("screen_name");
addDoc(w,user_screen_name);
}
if((userdetails.has("profile_image_url")) && (!userdetails.isNull("profile_image_url")) ){
profile_image_url = userdetails.getString("profile_image_url");
addDoc(w,profile_image_url);
}
if((userdetails.has("location")) && (!userdetails.isNull("location")) ){
location = userdetails.getString("location");
addDoc(w,location);
}
if((userdetails.has("lang")) && (!userdetails.isNull("lang")) ){
language = userdetails.getString("lang");
addDoc(w,language);
}
w.close();

}catch (Exception e) {
System.out.println("Exception in createIndex of TwitterLucenePOC "+e);
}

}

private static void addDoc(IndexWriter w, String value) throws IOException {
Document doc = new Document();
doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}

}

Is there anything wrong that I am doing? Or am I missing something crucial? Any idea or suggestion on this would be of great help.

Thanks & Regards,

Amitava.
ReplyDelete
Replies
kalani Ruwanpathirana9:02 PM
Hi, I roughly went through the code and was wondering whether you are calling createIndex method, every time you are indexing a record. There is "index creation" code inside that method. If you are calling it over and over, the index will be recreated and it will loose old information I guess.

As I understand indexing code should be separated from index creating code and index creation should be done only once.
ReplyDelete
Replies
me_amitava9:17 PM
Hi Kalani,

Thanks for your reply. I am invoking the createIndex method every time I get a new set of JSON data. So as per your observation, every time I invoke the createIndex method, the older data which had been indexed previously are replaced by newly indexed data. So how can this issue be resolved, since I am getting real time data, which needs to be indexed? Also as you had suggested "indexing code should be separated from index creating code and index creation should be done only once" Can you please clarify on this?
ReplyDelete
Replies
kalani Ruwanpathirana9:29 PM
May be I am not much familiar with your scenario but I think you can provide two methods for that instead of one.

createIndex method should include this code (index creating code)

File lucenefile = new File("C:/lucene");
lucenefile.createNewFile();

try{

analyzer = new StandardAnalyzer(Version.LUCENE_30);

index = new MMapDirectory(lucenefile.getAbsoluteFile());
}

put the remaining code(indexing code)to another method and you can call that method whenever you gets a new set of JSON data. createIndex method should be called only once.
ReplyDelete
Replies
me_amitava9:55 PM
Hi Kalani,

Thanks for your inputs. I guess I've got it where I had gone wrong. I'll let you know once I have implementd your suggestion.

Thanks & Regards,
Amitava
ReplyDelete
Replies
me_amitava10:18 AM
Hi Kalani,

thanks for your suggestion. It worked for me.

Regards,
Amitava
ReplyDelete
Replies
kalani Ruwanpathirana10:23 AM
You are welcome. Happy to hear that.
ReplyDelete
Replies
me_amitava3:38 PM
Hi Kalani,

bothering you once more with my silly questions. Do you have any idea as to how we can search in lucene for indexes stored within a particular date range? For example, I want to search for indexes stored from April 5 to April 10. I have seen the "TermRangeQuery" class, but do not have any idea as to how to use it. Any suggestions would be of great help.

Thanks & Regards,
Amitava.
ReplyDelete
Replies
kalani Ruwanpathirana8:04 PM
You may get an idea from this thread. I haven't worked with that but seems it is just another simple query type.
ReplyDelete
Replies
Vinko Buble5:01 AM
Very interesting post.
My concerns are related to performance. Would storing in non-relational DB, like Mongo, improve performance?
ReplyDelete
Replies
Deeps M9:06 PM
Hi Kalani,

I am maintaining code of an existing product. This happens to refer lucene-db.jar. But I am not able to find this jar anywhere. Would you know from where I can obtain this jar.

Your inputs would be appreciated.

Regards,
Deepali
ReplyDelete
Replies
kalani Ruwanpathirana9:31 PM
Hi Deeps,

Sorry, I am not aware of a lucene-db.jar. If there is such jar then it should be available here I guess.
ReplyDelete
Replies
Mahesh5:51 PM
Mahesh said...
Hello kalani i created the indexing and searching mechanism as mentioned by you above but when i m trying to fetch the records for any id i m getting the data related to the first id only.
According to you what may the problem and solution to that problem...
ReplyDelete
Replies
kalani Ruwanpathirana1:12 AM
Hi Mahesh, I didn't get your question. Did indexing went ok? Then you are searching on which field?
ReplyDelete
Replies
ounich5:50 PM
Hi Kalani,
I'm using lucene for indexing arabic text.
can you help me about using Tika for parsing different(pdf, word,Exel)arabic documents
Thank youuuu
ReplyDelete
Replies
ounich2:56 PM
Hi Kalani,

your solutions are usefull, thanks a lot and excuse me for my English,

I'm interressed by lucene but for indexing arabic text, the problem is hox extract arabic text from pdf file and xml file,

Any idea about indexing data base in arabic??

thanks
ReplyDelete
Replies
kalani Ruwanpathirana8:17 PM
Hi ounich,

Sorry, I haven't worked with Arabic text but regarding the text extracting did you try PDFBox? I am not sure though.
ReplyDelete
Replies
ounich3:36 PM
Hi Kalani,
thanks for your answer, T tried with Tika witch including pdfbox, poi...
it's usefull for MicrosoftOffice document and html document but not for pdf, xml, rtf,
thanks a lot
ReplyDelete
Replies
ounich4:59 PM
Hi kilani,
thanks a lot for your solutions about indexing database with lucene,
I tried to exucute query from console but no suceess,
do you have any idea??
thanks a lot
ReplyDelete
Replies
kalani Ruwanpathirana10:17 AM
Hi Ounich,

Are you going to Lucene index a database. Would you be able to let me know what you did and what error are you getting?
ReplyDelete
Replies
Anonymous3:22 PM
Thank's for the article. Is this possible with a Derby DB as well? How would the connection look like?
ReplyDelete
Replies
ounich4:05 AM
hi rleuthold
Sorry i worked with sql database
ReplyDelete
Replies
Unknown11:37 AM
Hi
I am a final year student and I am trying to implement resume parser that will scan the resume(.docx and.pdf) and store relevant informayion in a database using java/j2ee.please help me implement this.My email id is mnksinha70@gmail.com
Thanks in advance.
ReplyDelete
Replies
Unknown11:41 AM
Hi
I am a final year student and I am trying to implement resume parser that will scan the resume(.docx and.pdf) and store relevant informayion in a database using java/j2ee.please help me implement this.My email id is mnksinha70@gmail.com
Thanks in advance.
ReplyDelete
Replies
Unknown11:44 AM
Hi.I am a final year student and trying to implement resume parser that will scan resume(.docx and .pdf) and store relevant information in database.I am trying to implement this using java and j2ee but not getting how to do.Please help me implement this feature.
Thanks
Reply me on mnksinha70@gmail.com
ReplyDelete
Replies
Unknown10:08 AM
This comment has been removed by the author.
ReplyDelete
Replies
Unknown10:15 AM
Hi

I’m using lucene version 4.0.0 and compass 2.2.0 to store lucene index in Mysql.

Am getting the following error and am stuck with that.

Exception in thread “main” java.lang.NoSuchMethodError: org.apache.lucene.store.IndexInput: method ()V not found
at org.apache.lucene.store.jdbc.handler.NoOpFileEntryHandler$NoOpIndexInput.(NoOpFileEntryHandler.java:34)
at org.apache.lucene.store.jdbc.handler.NoOpFileEntryHandler$NoOpIndexInput.(NoOpFileEntryHandler.java:34)
at org.apache.lucene.store.jdbc.handler.NoOpFileEntryHandler.(NoOpFileEntryHandler.java:86)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at org.apache.lucene.store.jdbc.JdbcFileEntrySettings.getSettingAsClass(JdbcFileEntrySettings.java:175)
at org.apache.lucene.store.jdbc.JdbcFileEntrySettings.getSettingAsClass(JdbcFileEntrySettings.java:157)
at org.apache.lucene.store.jdbc.JdbcDirectory.initialize(JdbcDirectory.java:173)
at org.apache.lucene.store.jdbc.JdbcDirectory.(JdbcDirectory.java:122)
at CreateIndexToDB.createIndex(CreateIndexToDB.java:38)
at CreateIndexToDB.main(CreateIndexToDB.java:87)

And also can you tell me which will perform better.
Storing index in FS or DB?

Thanks in advance
ReplyDelete
Replies
ounich4:34 PM
hi
excuse me, I didot use Mysql with lucene. I used some indexes created by Lucene. in my opinion, you should verify this problem with the version 3.0.2
ReplyDelete
Replies
mugeesh1:43 PM
Dear Kalani Ruwanpathirana,

i want to create lucene index from my mysql db value then want to search.. please help me out this difficulties i am new one in lucene..
ReplyDelete
Replies
Blog St2:47 PM
i just found this article and it seems pretty good, but i hv some question:
1. what is the basic things and a must that we need to include everytime when we want to add lucene to our java program?
2. is there any updated way to learn and understanding lucene ?
i already post my question on stackoverflow https://stackoverflow.com/questions/58424819/the-updated-way-of-using-apache-lucene
please feel free to comment me. i am new to lucene and really want to learn but dont know the right source:)
ReplyDelete
Replies
kalani Ruwanpathirana7:06 PM
Hi Blog St,
I guess Lucene should be evolved a lot now as my post is from 2008. If you are new to Lucene probably you want to begin with their documentation? https://lucene.apache.org/core/quickstart.html
ReplyDelete
Replies
jyothsna4:24 PM
Hi
can you please share Lucene version 8 indexing store in database example.
customDirector class with implementation of rename method.

ReplyDelete
Replies

Add comment