Chapter 15. Hibernate Full Text Search

Chapter 15. Hibernate Full Text Search
Prev	Part IV. Configuration, Performance, Validation and Full Text Search	Next

Lucene is a popular full text search engine. You can index documents, websites or arbitrary other data. The index can be searched with a API. Hibernate Search integrates Lucene Search with Hibernate. Entities can be indexed easily and with a special session, you can perform a full text search for your entities. A lot of databases provides already their own mechanism for full text search. But those solutions are not portable across databases and Lucene is probably more powerful and flexible than proprietary solutions.

Let’s have a first look at code sample before we talk about more details. You can find the full source code in the project LuceneSearch.

First, we need to configure the Lucene Search. If we use the AnnotationConfiguration to build the Hibernate session factory, there is only one property to be defined in the hibernate.cfg.xml. It is the location where the Lucene index should be stored.

<property name="hibernate.search.default.indexBase">
         /tmp
</property>

If you use a normal Configuration to build the session factory, it is required to configure a couple of event listener. Have a look in the reference documentation of Hibernate Search for more details. As it is possible to use XML with an AnnotationConfiguration as well, I propose to use this kind of configuration even with XML only mappings.

SessionFactory factory = new Configuration().configure().buildSessionFactory()

In the next step, the entities which should be searched have to be annotated.

@Entity
@Indexed
public class Article {
   @Id
   @DocumentId
   @GeneratedValue(strategy = GenerationType.AUTO)
   private Integer id;

   @Field(index = Index.UN_TOKENIZED, store = Store.YES)
   private String title;

   @Field
   private String content;
// getter setter methods are missing
}

@Indexed marks an entity to be indexed, @DocumentId is required and defines how a document is identified in the Lucene index. @Field specifies that a field should be indexed. In the sample you can find two different settings for @Field.

@Field with default values

The content is indexed using the standard analyser. This analyser splits the text into words, transform them to lower case, removes characters like ;.' and removes a couple of very frequent English words like a, is, in

Only the indexed content will be stored in the Lucene index but not the content itself. If you use a tool like Luke to have a look at your Lucene index, you cannot see the original content.

The title is not tokenized or transformed. @Field(index = Index.UN_TOKENIZED, store = Store.YES)

As a consequence you cannot search individual words of the title, but we can search for the precise title or do a wild card search – find all titles starting with Foo.

In contrast to the field content, the title is stored in the index (store = Store.YES ) and we can see it if we browse the index using Luke. I will tell you more about Luke at the end.

So, our entity is indexed and we can start to do full text searches in Hibernate. A Lucene search consists of three steps

Creating a search session
Creating a Lucene query
executing the query.

A code sample.

Session session = SessionFactoryUtil.getFactory().getCurrentSession();
//      create a full text session
FullTextSession fSession = Search.getFullTextSession(session);
fSession.beginTransaction();
//      create a luceneQuery with a parser
QueryParser parser = new QueryParser("title", new StandardAnalyzer());
Query lucenceQuery = null;
try {
   lucenceQuery = parser.parse("content:hibernate");

} catch (ParseException e) {
   throw new RuntimeException("Cannot search with query string",e);
}
//      execute the query
List<Article> articles = fSession.createFullTextQuery(lucenceQuery, Article.class)
   .list();
for (Article article : articles) {
   System.out.println(article);
}
fSession.getTransaction().commit();

A search session is created from an open Hibernate session. Basically it is just a wrapper adding the search specific methods to the session. We use a StandardAnalyser to analyse the search string, which is the same \ used to index the content field. Finally we execute the full text query.

The field title was not tokenized. A search for title needs to use a different approach. You can use a precise search

List<Article> articles = fSession.createFullTextQuery(
     new TermQuery(new Term("title", "About Hibernate")), Article.class).list();
      for (Article article : articles) {
         System.out.println(article);
      }

or a wildcard search

List<Article> articles = fSession.createFullTextQuery(
     new WildcardQuery(new Term("title", "About*")), Article.class).list();
      for (Article article : articles) {
         System.out.println(article);
      }

You can adapt the indexing and the search string analysing to your needs. For example we could specify that the indexing of the field title goes through a toLowerCase filter. Emanuel Bernard has demonstrated a couple of new features on the Devoxx conference. You can use word stemming – run, runner, running \ {}- to find words with the same stem, phonetic searches with Soundex or Metaphone algorithm to find words with a close sound or approximate searches with ngram search. I will cover more of this approaches in the next updates.

Luke let you browse your index and perform searching on it. It is very useful to test your searches or debug a problem.

http://www.getopt.org/luke/