Tuesday, June 24, 2008

Creating Lucene Index in a Database - Apache Lucene

My previous post, Indexing a database and searching the content using Lucene, shows how to index records (or stored files) in a database. In that case the index is created in the local file system. However in real scenarios most of the applications run on clustered environments. Then the problem comes where to create the search index.

Creating the index in the local file system is not a solution for the particular situation as the index should be synchronized and shared by every node. One solution is clustering the JVM while using a Lucene RAMDirectory (keep in mind it disappears after a node failure) instead of a FSDirectory. Terracotta framework can be used to cluster the JVM. This blog entry shows a code snippet.

Anyway I thought not to go that far and decided to create the index in the database so that it can be shared by everyone. Lucence contains the JdbcDirectory interface for this purpose. However the implementation of this interface is not shipped with Lucene itself. I found a third party implementation of that. Compass project provides the implementation of JdbcDirectory. (No need to worry about compass configurations etc. JdbcDirectory can be used with pure Lucene without bothering about Compass Lucene stuff).

Here is a simple example
//you need to include lucene and jdbc jars 
import org.apache.lucene.store.jdbc.JdbcDirectory;

import org.apache.lucene.store.jdbc.dialect.MySQLDialect;

import com.mysql.jdbc.jdbc2.optional.MysqlDataSource;
.
//code snippet to create index
MysqlDataSource dataSource = new MysqlDataSource();

dataSource.setUser("root");

dataSource.setPassword("password");

dataSource.setDatabaseName("test");

dataSource.setEmulateLocators(true); //This is important because we are dealing with a blob type data field

JdbcDirectory jdbcDir = new JdbcDirectory(dataSource, new MySQLDialect(), "indexTable");

jdbcDir.create(); // creates the indexTable in the DB (test). No need to create it manually
.

//code snippet for indexing
StandardAnalyzer analyzer = new StandardAnalyzer();

IndexWriter writer = new IndexWriter(jdbcDir, analyzer, true);

indexDocs(writer, dataSource.getConnection());

System.out.println("Optimizing...");

writer.optimize();

writer.close();


static void indexDocs(IndexWriter writer, Connection conn)
throws Exception {
    String sql = "select id, name, color from pet";
    Statement stmt = conn.createStatement();
    ResultSet rs = stmt.executeQuery(sql);

    while (rs.next()) {
        Document d = new Document();
        d.add(new Field("id", rs.getString("id"), Field.Store.YES, Field.Index.NO));
        d.add(new Field("name", rs.getString("name"), Field.Store.YES, Field.Index.TOKENIZED));
        d.add(new Field("color", rs.getString("color"), Field.Store.YES,  Field.Index.TOKENIZED));
        writer.addDocument(d);
    }
}

This is the indexing part. Searching part is same as the one in my previous post.

62 comments:

Vipin said...

Hi Kalani,
This is an interesting post. Its great that we can directly write indexes into a database. I was doing it in a different way though. I create multimedia desktop applications using Adobe Flex and AIR. Flex/AIR supports SQLite database. So, for search, we index the content, read and write into a My SQL DB. Export MySQL db, and make SQLite DB in commandline. This SQLite DB is used to search from the flash/flex/AIR application. The search code is written by me. I didnt follow Lucene logic for search. I have only implemented single term, AND, OR and NOT searches. Am I doing something really wrong, or is it a long way to come to the result. How can I get the logic to build search code in Actionscript, which can support all the search methods which Lucene supports? I have been using this for quiet a long time, but not a Lucene expert. Can these be done?

Ralph said...

Hi
Very interesting. Are there any comparisons about the performance between a jdbc index and a normal file index.

Vipin said...

Performance is not comparable. Here the advantage is that you can have fulltext search in a CD or desktop application without having any server installed.

kalani said...

Hi Vipin,
is there any possibility to deal with a RAMDirectory(which keeps the index in the memory)? I mean if you can, then you can only use the MySql DB and RAMDirectory.You can periodically flush the updates in the RAMDirectory to MySql DB(to avoid losses in failures). When you start the app you should load the index to memory. I guess then you will be able to directly deal with the index in the RAM with pure Lucene logics for search.

If you are further interested you can refer to this post.

kalani said...

Hi Ralph,
I think jdbc index has a performance issue if it is used for a massive purpose. If the index is being frequently updated, it's obvious that writing to a database takes more time than writing to a normal file sys. But the advantage of jdbc index is it is good for clustered environments.

To overcome this performance issue I use a RAMDirectory as a synapse :)

Vipin said...

Hi Kalani,

I cannot deal with RAM Directory from my desktop application. The Flex/AIR framework doesn't support all those. I got interested in this particular post, because using this we can directly save the index into a database instead of indexing it into a file and then read it to translate into a database. This helps. I need to also check whether we can write directly into an SQLite database. Will that be possible?

I cannot load the index into my application while I search. The search can be little slower compared to online lucene search also. Right now I was thinking to re-write the Searching logic of lucene in Actionscript 3.0. Where should I look for the logic (I am not a Java guy, so reading and finding the logic is bit difficult for me)? If I get the logic behind it, I can write the classes for that.

Thanks for this great post!

kalani said...

Vipin,

I am not sure about making the index in a SQLite DB because seems the particular dialect is not available. You can find the supported dialects here.

Baby Blogger said...

Nice blog with valuable content.
Happy blogging.

Aditya.R said...

Hi Kalani,
I have benefited a lot from this blog... thanks.Do you know any good tutorial where I can start learning Compass ...I referred compass home page documentation but somehow didn't get the flow... It would be helpful if the concepts are explained using examples...

kalani said...

Hi Aditya,

Followings may be helpful with Compass docs.
http://www.geocities.com/sja085/community/tutorials/compass/compass01/
http://www.infoq.com/articles/compass-search-tutorial

Sorry I can't give a clear guide as I haven't used pure Compass.

Aditya.R said...

Thanks a lot for those links :). I will be following your blogs closely. Keep the good work going.

maksim said...

Thank you for help :) Great post!!!

Venkat Pavan Kumar said...

Hey Kalani, your blogs sounds very good. Couple of things:
1. Can we migrate the index files to DB?
2. How is the performance?

Thanks,
Pavan

kalani Ruwanpathirana said...

yes Pavan, by following this approach the index files will be created in the DB. Definitely there is a performance hit but it varies on the index size and how often you read and write from/to the index. (I used RamDirectory to reduce the performance hit)

Venkat Pavan Kumar said...

Hey Kalani,

My question was if I have a file based index then can we migrate this file to DB index file?

Appreciate your quick response.

Thanks,
Pavan

Venkat Pavan Kumar said...

Kalani, please ignore my previous query. I figured it out.

Thanks,
Pavan

Yogesh said...

Nice Post.
But i was looking in C#
The below post helped me to create
Create Lucene Index

Anyway thanks.

shontu said...

Well guys, following is my sample code....

MysqlDataSource dataSource = new MysqlDataSource();

dataSource.setUser("root");
dataSource.setPassword("ncl");
dataSource.setDatabaseName("userdb");
dataSource.setEmulateLocators(true); //This is important because we are dealing with a blob type data field

try{

JdbcDirectory jdbcDir = new JdbcDirectory(dataSource, new MySQLDialect(), "tttable");

StandardAnalyzer analyzer = new StandardAnalyzer();

IndexWriter writer = new IndexWriter(jdbcDir, analyzer,false);


writer.optimize();
writer.close();
}catch(Exception e){

System.out.print(e);
}



i am stuck at ------IndexWriter writer = new IndexWriter(jdbcDir, analyzer,false);---------

this line .Everytime i try to run this code i recieve following exception..

------"org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: PhantomReadLock[write.lock/tttable]"------------

i cant find wats wrong with the code....may be its a jar comaptible issue....

i am unable to get IndexWriter object..

any help regading this would be greately appreciated.


i am looking forwad for your replies.
thanx.

kalani Ruwanpathirana said...

Hi,

This LockObtainFailedException occurs if a previously created index writer is not closed properly. Creating an indexWriter creates a lock file for the directory. If you tries to create another writer in the same directory, this exception may occur.

shontu said...

Hello...First of all thanx for your reply....

i succeccfully run this code on oracle with the small change..

as.........
oracle.jdbc.pool.OracleDataSource dataSource
= new oracle.jdbc.pool.OracleDataSource();
dataSource.setDriverType("thin");
dataSource.setServerName("localhost");
dataSource.setPortNumber(1521);
dataSource.setDatabaseName("OraHome"); // Oracle SID
dataSource.setUser("home");
dataSource.setPassword("home");


JdbcDirectory jdbcDir = new JdbcDirectory(dataSource, new OracleDialect(), "indextable");

....................

this time without any error.

i still want to ask is there any version problem with compass jar file.....or with my mysql version..

coz line of the error is this only.


JdbcDirectory jdbcDir = new JdbcDirectory(dataSource, new MySQLDialect(), "tttable");

.....hope you understood my problem

kalani Ruwanpathirana said...

I don't think that there's a problem with your Compass or Mysql versions. I doubt that your db is crashed so that previously created lock file is not deleted. Did you clear the table and try it?

shontu said...

Yes... i cleared up for several time.
but no luck...

this error gets generated even if i run this code for the very first time....

so i don't think there is issue to clear lock files for the first time, when there is nothing generated..

Vijayant said...

could any one help me the same for plucene search

Volkan said...

Hi, thank you for this good article. I have a question how to use JdbcDirectory. I am not using Compass Framework, than how can I use JdbcDirectory? Which jar file needed?

kalani Ruwanpathirana said...

JdbcDirectory is just an interface in Lucene. I couldn't find any framework which implements JdbcDirectory other than Compass. Pure Lucene has not implemented that functionality.

Vijay Kumar said...

Hi Kalani,
I am trying to implement index and search logic. Did you get a chance to compare with any other search engines like Sphinx. If you have done any comparision ..want to know the results and recommendations.Thanks
Vijay

kalani Ruwanpathirana said...

Sorry Vijay, I didn't get a chance to compare Lucene performance with other search libs.

Vijay Kumar said...

Thank you very much Kalani ,
The issue for me i need to index 30 million records. Ita taking 15 million for indexing 1 million records, the speed is not good for indexing 30 million records. Please do suggest any inputs for optimization . My mail id vijaykumar.ravva@gmail.com .Thanks
Vijay

kalani Ruwanpathirana said...

H Vijay, A suggession I can think of is, Using RAM as much as possible and reducing frequent disk accesses while indexing. You have the control for these factors via mergeFactor and maxMergeDocs instance variables of IndexWriter. Making those values larger, is good for batch indexing scenarios like yours.

Vijay Kumar said...

Hey Kalani ..thanks much .. i did all those chanes already.. the problem with my code while accessign the Data base.. solved the issue ..now we are indexing 1 million in 6 mins ..Thanks for all your inputs and sharing your valuable thoughts. your blog is very helpful. Please add my email vijaykumar.ravva@gmail.com .

Thanks
Vijay
.

rajsays said...

Very interesting article and very timely , currently we are trying to see how can we use lucene to store in db, you post is great help
thanks
Rajesh Surve

me_amitava said...

Hi Kalani,

Firstly thanks for the wonderful post. It's an amazing piece of work which has helped a newbie like me to understand lucene. I just have a small query. I want to store the lucene indices in a local file system. could you please provide any idea as to how could that be done?

Thanks & Regards,

Amitava

kalani Ruwanpathirana said...

@Amitava: Thanks for the comment. Storing Lucene index in the file system is an easy thing. You just have to remove all database specific code and use "FSDirectory" instead of JDBCDirectory. Hope this will help.

me_amitava said...

Hi Kalani,

Thanks for your reply. Firstly,I am using Lucene 3.01. I had tried storing the data in local file system using FSDirectory. but the problem is I am having realtime data and the older Lucene indices are getting deleted from the local directory that I am using. Only the latest ones persist. I am putting down my code here.

public class LucenePOC {

StandardAnalyzer analyzer;
MMapDirectory index;
IndexWriter w;

public void createIndex(JSONObject userInfo) throws IOException,ParseException{
String text = null;
long userid = 0;
String creation_date = null;
JSONObject userdetails = new JSONObject();
String user_screen_name = null;
String profile_image_url= null;
String location= null;
String language = null;

File lucenefile = new File("C:/lucene");
lucenefile.createNewFile();

try{

analyzer = new StandardAnalyzer(Version.LUCENE_30);

index = new MMapDirectory(lucenefile.getAbsoluteFile());

w = new IndexWriter(index, analyzer, false,IndexWriter.MaxFieldLength.UNLIMITED);

userdetails = userInfo.getJSONObject("user");

if(userInfo.has("text") && userInfo.getString("text")!= null && userInfo.getString("text")!= "" ){
text = userInfo.getString("text");
addDoc(w, text);
}
if(userInfo.has("id")&& !userInfo.isNull("id")){
userid = userInfo.getLong("id");
addDoc(w,String.valueOf(userid));
}
if(userInfo.has("created_at") && (!userInfo.isNull("created_at"))){
creation_date = userInfo.getString("created_at");
addDoc(w,creation_date);
}
if((userdetails.has("screen_name")) && (!userdetails.isNull("screen_name")) ){

user_screen_name = userdetails.getString("screen_name");
addDoc(w,user_screen_name);
}
if((userdetails.has("profile_image_url")) && (!userdetails.isNull("profile_image_url")) ){
profile_image_url = userdetails.getString("profile_image_url");
addDoc(w,profile_image_url);
}
if((userdetails.has("location")) && (!userdetails.isNull("location")) ){
location = userdetails.getString("location");
addDoc(w,location);
}
if((userdetails.has("lang")) && (!userdetails.isNull("lang")) ){
language = userdetails.getString("lang");
addDoc(w,language);
}
w.close();

}catch (Exception e) {
System.out.println("Exception in createIndex of TwitterLucenePOC "+e);
}

}

private static void addDoc(IndexWriter w, String value) throws IOException {
Document doc = new Document();
doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}

}

Is there anything wrong that I am doing? Or am I missing something crucial? Any idea or suggestion on this would be of great help.

Thanks & Regards,

Amitava.

kalani Ruwanpathirana said...

Hi, I roughly went through the code and was wondering whether you are calling createIndex method, every time you are indexing a record. There is "index creation" code inside that method. If you are calling it over and over, the index will be recreated and it will loose old information I guess.

As I understand indexing code should be separated from index creating code and index creation should be done only once.

me_amitava said...

Hi Kalani,

Thanks for your reply. I am invoking the createIndex method every time I get a new set of JSON data. So as per your observation, every time I invoke the createIndex method, the older data which had been indexed previously are replaced by newly indexed data. So how can this issue be resolved, since I am getting real time data, which needs to be indexed? Also as you had suggested "indexing code should be separated from index creating code and index creation should be done only once" Can you please clarify on this?

kalani Ruwanpathirana said...

May be I am not much familiar with your scenario but I think you can provide two methods for that instead of one.

createIndex method should include this code (index creating code)

File lucenefile = new File("C:/lucene");
lucenefile.createNewFile();

try{

analyzer = new StandardAnalyzer(Version.LUCENE_30);

index = new MMapDirectory(lucenefile.getAbsoluteFile());
}

put the remaining code(indexing code)to another method and you can call that method whenever you gets a new set of JSON data. createIndex method should be called only once.

me_amitava said...

Hi Kalani,

Thanks for your inputs. I guess I've got it where I had gone wrong. I'll let you know once I have implementd your suggestion.

Thanks & Regards,
Amitava

me_amitava said...

Hi Kalani,

thanks for your suggestion. It worked for me.

Regards,
Amitava

kalani Ruwanpathirana said...

You are welcome. Happy to hear that.

me_amitava said...

Hi Kalani,

bothering you once more with my silly questions. Do you have any idea as to how we can search in lucene for indexes stored within a particular date range? For example, I want to search for indexes stored from April 5 to April 10. I have seen the "TermRangeQuery" class, but do not have any idea as to how to use it. Any suggestions would be of great help.

Thanks & Regards,
Amitava.

kalani Ruwanpathirana said...

You may get an idea from this thread. I haven't worked with that but seems it is just another simple query type.

db said...

Very interesting post.
My concerns are related to performance. Would storing in non-relational DB, like Mongo, improve performance?

Deeps M said...

Hi Kalani,

I am maintaining code of an existing product. This happens to refer lucene-db.jar. But I am not able to find this jar anywhere. Would you know from where I can obtain this jar.

Your inputs would be appreciated.

Regards,
Deepali

kalani Ruwanpathirana said...

Hi Deeps,

Sorry, I am not aware of a lucene-db.jar. If there is such jar then it should be available here I guess.

Mahesh said...

Mahesh said...
Hello kalani i created the indexing and searching mechanism as mentioned by you above but when i m trying to fetch the records for any id i m getting the data related to the first id only.
According to you what may the problem and solution to that problem...

kalani Ruwanpathirana said...

Hi Mahesh, I didn't get your question. Did indexing went ok? Then you are searching on which field?

ounich said...

Hi Kalani,
I'm using lucene for indexing arabic text.
can you help me about using Tika for parsing different(pdf, word,Exel)arabic documents
Thank youuuu

ounich said...

Hi Kalani,

your solutions are usefull, thanks a lot and excuse me for my English,

I'm interressed by lucene but for indexing arabic text, the problem is hox extract arabic text from pdf file and xml file,

Any idea about indexing data base in arabic??

thanks

kalani Ruwanpathirana said...

Hi ounich,

Sorry, I haven't worked with Arabic text but regarding the text extracting did you try PDFBox? I am not sure though.

ounich said...

Hi Kalani,
thanks for your answer, T tried with Tika witch including pdfbox, poi...
it's usefull for MicrosoftOffice document and html document but not for pdf, xml, rtf,
thanks a lot

ounich said...

Hi kilani,
thanks a lot for your solutions about indexing database with lucene,
I tried to exucute query from console but no suceess,
do you have any idea??
thanks a lot

kalani Ruwanpathirana said...

Hi Ounich,

Are you going to Lucene index a database. Would you be able to let me know what you did and what error are you getting?

rleuthold said...

Thank's for the article. Is this possible with a Derby DB as well? How would the connection look like?

ounich said...

hi rleuthold
Sorry i worked with sql database

monika said...

Hi
I am a final year student and I am trying to implement resume parser that will scan the resume(.docx and.pdf) and store relevant informayion in a database using java/j2ee.please help me implement this.My email id is mnksinha70@gmail.com
Thanks in advance.

monika said...

Hi
I am a final year student and I am trying to implement resume parser that will scan the resume(.docx and.pdf) and store relevant informayion in a database using java/j2ee.please help me implement this.My email id is mnksinha70@gmail.com
Thanks in advance.

monika said...

Hi.I am a final year student and trying to implement resume parser that will scan resume(.docx and .pdf) and store relevant information in database.I am trying to implement this using java and j2ee but not getting how to do.Please help me implement this feature.
Thanks
Reply me on mnksinha70@gmail.com

Udaya Kumar said...
This comment has been removed by the author.
Udaya Kumar said...

Hi

I’m using lucene version 4.0.0 and compass 2.2.0 to store lucene index in Mysql.

Am getting the following error and am stuck with that.

Exception in thread “main” java.lang.NoSuchMethodError: org.apache.lucene.store.IndexInput: method ()V not found
at org.apache.lucene.store.jdbc.handler.NoOpFileEntryHandler$NoOpIndexInput.(NoOpFileEntryHandler.java:34)
at org.apache.lucene.store.jdbc.handler.NoOpFileEntryHandler$NoOpIndexInput.(NoOpFileEntryHandler.java:34)
at org.apache.lucene.store.jdbc.handler.NoOpFileEntryHandler.(NoOpFileEntryHandler.java:86)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at org.apache.lucene.store.jdbc.JdbcFileEntrySettings.getSettingAsClass(JdbcFileEntrySettings.java:175)
at org.apache.lucene.store.jdbc.JdbcFileEntrySettings.getSettingAsClass(JdbcFileEntrySettings.java:157)
at org.apache.lucene.store.jdbc.JdbcDirectory.initialize(JdbcDirectory.java:173)
at org.apache.lucene.store.jdbc.JdbcDirectory.(JdbcDirectory.java:122)
at CreateIndexToDB.createIndex(CreateIndexToDB.java:38)
at CreateIndexToDB.main(CreateIndexToDB.java:87)

And also can you tell me which will perform better.
Storing index in FS or DB?

Thanks in advance

ounich said...

hi
excuse me, I didot use Mysql with lucene. I used some indexes created by Lucene. in my opinion, you should verify this problem with the version 3.0.2

Mugeesh Husain said...

Dear Kalani Ruwanpathirana,

i want to create lucene index from my mysql db value then want to search.. please help me out this difficulties i am new one in lucene..

Related Posts with Thumbnails