Full text search in in Rails with Sunspot and Solr

[caption id=”attachment_147” align=”alignleft” width=”130” caption=”The book you should get to dig deeper into Solr”]The book you should get to dig deeper into Solr[/caption]Click here if you want to see a PDF version of this tutorial.

Full source code for this tutorial is available at GitHub.

Everyone wants to take their databases to run everything as fast as possible. We usually say query less, add more caching mechanisms, add indexes to the columns being searched, but another solution is not to use the database at all and look for better solutions for your querying needs.

When querying for text in our databases, we’re often doing “LIKE” searches. Like searches are only performant if we have an index in that field and the query is written in a way that the index is used. Imagine that you have a field “name” and it contains the text “Battlestar Galactica”. This query would be able to run and use the index:

SELECT p.* FROM products p WHERE p.name LIKE “Battlestar%”

The database would be able to optimize this query and use the index to find the expected row. But, what if the query was like this one:

SELECT p.* FROM products p WHERE p.name LIKE “%Galactica”

[caption id=”attachment_135” align=”alignright” width=”300” caption=”Your DBA getting ready to hit you”]Your DBA getting ready to hit you[/caption]Database indexes usually match from left to right, so, unless you have a nasty trick under your sleeve, this query will just look at ALL the rows in the products table and perform a match on every “name” column before returning a result. And that’s Really Bad News for you, as the DBA will probably come for you holding a Morning Star to beat you badly. So, querying with “LIKE” when you what you need is full text search isn’t nice.

That’s where full text search based solutions come in for help. Tools like Solr allow you to perform optimized text searches, filter input, categorization and even features like Google’s “Did you mean?”.

In this tutorial you’ll learn how to add full text searching capabilities to your Rails application using Sunpot and Solr. We will also delve a little bit into Solr’s configuration and learn how to use specific tokenizers to clear input, perform partial matching of words and faceting results.

This project uses Rails 3 and Ruby 1.9.2, you’ll find a Gemfile and and “.rvmrc” with all dependencies declared, it should be pretty easy to follow or setup your environment based on it (if you’re not using RVM, that’s a GREAT time to learn using it).

You can possibly follow this tutorial with a previous Rails version and without Bundler or RVM, given all models and most of the code will look exactly the same in Rails 2 and Sunspot is compatible to Rails 2 too.

The source code for this example application is available at GitHub here.

Starting the engines

Download the Sunspot source code from Github.

Enter the project folder and go to “sunspot/solr-1.3”, inside that folder you should see a “solr” folder, copy this folder into your project’s folder. This is where the general Solr configuration is going to live, don’t worry about these files just yet, we’ll get to them later in this tutorial.

Now create a “sunspot.yml” file under your project’s “config” folder, here’s a sample:

Listing 1 – sunspot.yml

development:
  solr:
    hostname: localhost
    port: 8980
    log_level: INFO
  auto_commit_after_delete_request: true

test:
  solr:
    hostname: localhost
    port: 8981
    log_level: OFF

production:
  solr:
    hostname: localhost
    port: 8982
    log_level: WARNING
  auto_commit_after_request: true  

You can have different configurations for every environment you’re running. To see all configuration options, go to the Sunspot source code and head to the “sunspot_rails/lib/sunspot/rails/configuration.rb” file.

Now we’ll create two models, Product and Category, so let’s start by creating the migration that will setup them:

rails g migration create_base_tables

Listing 2 – create_base_tables.rb

class CreateBaseTables < ActiveRecord::Migration

  def self.up
    create_table :categories do |t|
      t.string :name, :null => false
    end

    create_table :products do |t|
      t.string  :name, :null => false
      t.decimal :price, :scale => 2, :precision => 16, :null => false
      t.text    :description
      t.integer :category_id, :null => false
    end

    add_index :products, :category_id

  end

  def self.down
    drop_table :categories
    drop_table :products
  end

end

Now we move on to the basic models, starting with the Category model:

Listing 3 – category.rb

class Category < ActiveRecord::Base

  has_many :products

  validates_presence_of :name
  validates_uniqueness_of :name, :allow_blank => true

  searchable :auto_index => true, :auto_remove => true do
    text :name
  end

  def to_s
    self.name
  end

end

Here in the Category class we see our first reference to Sunspot, the “searchable” method, where we configure the fields that should be indexed by Solr. At the Category class, there’s only one field that’s useful at this moment, the “name”, so we tell Sunspot to configure the field name to be indexed as “text” (you usually don’t want your text indexed as “string”, as it will only be a hit in a full match).

The :auto_index and :auto_remove options are there to let Sunspot automatically send your model to be indexed at Solr when it is created/updated/destroyed. The default is “false” for both values, which means you have to manually send your data to Solr and unless you really want to do that, you should keep both of these values as “true” in your models.

Now lets look at the Product class:

Listing 4 – product.rb

class Product < ActiveRecord::Base

  belongs_to :category

  validates_presence_of :name, :description, :category_id, :price
  validates_uniqueness_of :name, :allow_blank => true

  searchable :auto_index => true, :auto_remove => true do
    text :name, :boost => 2.0
    text :description
    float :price
    integer :category_id
  end

  def to_s
    self.name
  end

end

In our Product class things are a little bit different, we have more fields (and more kinds) being indexed. “float” and “integer” are pretty self explanatory, but the “name” field has some black magic floating around, with the “boost” parameter. Boosting a field when indexing means that if the match is in that specific field, it has more “relevance” than if found somewhere else.

Imagine that you’re looking for Iron Maiden’s “Powerslave” album. You go to Iron Maiden’s Online Store and search for “powerslave”, hoping that the album will be the first hit, but then you see “Live After Dead” before “Powerslave”. Why did it happen? The “Live After Dead” album contains the “Powerslave” song in it’s track listing, so it’s a match as much as the real “Powerslave” album. What we need here is to tell the search tool that if a match is on an album name, it has higher relevance than if the hit is in the track listing.

Boosting allows you to reduce these issues. Some fields are inherently more important than others and you can tell that to Solr by configuring a “:boost” value for them. When something matches on them, the relevance of that match will be improved and it should come up before the other results in search.

Searching

Now let’s take a look at the ProductsController to see how we perform the search:  

Listing 4 – products_controller.rb

class ProductsController < ApplicationController

  def index
    @products = if params[:q].blank?
      Product.all :order => 'name ASC'
    else
      Product.solr_search do |s|
        s.keywords params[:q]
      end
    end
  end

end

As you can see, searching is quite simple, you just call the solr_search method and send in the text to be searched for. One thing that I don’t like about Sunspot is that searches do not return an Array like object, you get a Sunspot::Search::StandardSearch object that has, as a property, the results array which contains the records returned by the search.

Here’s a simple way to fix this issue (I usually place the contents of this file inside an initializer in “config/initializers”):

Listing 5 – sunspot_hack.rb

::Sunspot::Search::StandardSearch.class_eval do

  include Enumerable

  delegate(
    :current_page,
    :per_page,
    :total_entries,
    :total_pages,
    :offset,
    :previous_page,
    :next_page,
    :out_of_bounds?,
    :each,
    :in_groups_of,
    :blank?,
    :[],
    :to => :results)

end

This simple monkeypatch makes the search object itself behave like an Enumerable/Array and you can use it to navigate directly in the results, without having to call the “results” method. The methods usually used by will_paginate helpers are also included so you can pass this object to a will_paginate call in your view and it’s just going to work.

Indexing

Now that all the models are in place, we can start fine tuning the Solr indexing process. First thing to understand here is what happens when you send text to be indexed by Solr, let’s get into the tool, starting the server:

rake sunspot:solr:run

This rake task starts Solr in the foreground (if you wanted to start it in the background, you’d use “sunspot:solr:start”). With Solr running, you should add some data to the database, this tutorial’s project on Github contains a “seed.rb” file with some basic data for testing, just copy it over your project.

Also copy the “lib/tasks/db.rake” from the project to your project, it contains a “db:prepare” task that truncates the database, seeds it and then indexes all items in Solr and we’re doing to be reindexing data a lot.

With everything copied, run the “db:prepare” task:

rake db:prepare

This will add the categories and products to your database and also index them in Solr. If this task did run successfully, head to the Solr administration interface, at this URL:

http://localhost:8980/solr/admin/schema.jsp

Once you go to it, click on the “FIELDS”, then on “NAME_TEXT”, you should see a screen just like the one in image 1: [caption id=”” align=”alignnone” width=”1023” caption=”Image 1 – Solr schema browser”]Image 1 – Solr schema browser[/caption]

If you don’t see all the fields that are available in this image, your “rake db:prepare” command has probably failed or Solr wasn’t running when you called it.

What we see here is the information about the fields we’re indexing. This specific field contains all data from the name properties from both Category and Product classes, as you can notice from the top 10 terms.

The name field is not indexed by it’s full content, as a relational database would usually do, the text is broken into tokens, by the solr.StandardTokenizerFactory class in Solr. This class receives our text, like “Battlestar Galactica: The Boardgame” and turns it into:

[“Battlestar”, “Galactica”, “The”, “Boardgame”]

This is what gets indexed and, ultimately, searched by Solr. If you open the web application now and try to search for “battle”, you won’t have any matches. If you search for “Battlestar”, you get the two products that match the name.

Everything when indexing information in Solr revolves around building the best “tokens” available for your input. You have to teach Solr to crunch your data in a way that makes sense and makes it easy to search for, and adding filters to the indexing process does this. While in the same page as Image 1 above, click on the “DETAILS” links as shown in Image 2:

[caption id=”” align=”alignnone” width=”553” caption=”Image 2 – Viewing the analysis and search filters”]Image 2 – Viewing the analysis and search filters[/caption]

Each field in Solr has two analyzers, one is the “index” analyzer, that prepares the input to be indexed and the other is the “query” analyzer that prepares the search input to finally perform a search. Unless you have some special need, both of them are usually the same.

In our current configuration, we have the same two filters for both of the analyzers. The StandardFilterFactory filter removes punctuation characters from our input (the “:” in “Battlestar Galactica: The Boardgame” is not in our tokens) and the LowerCaseFilterFactory makes all input lowercased so we can search with “baTTle”, “BATTLE”, “BaTtLe” and they’re all going to work.

Before we move on to add more filters to our analyzers, let’s take a look at the analyzer screen in Solr Admin at - http://localhost:8980/solr/admin/analysis.jsp?highlight=on

In this screen we see how our input is going to be transformed into tokens by the configured analyzers.

[caption id=”” align=”alignnone” width=”1019” caption=”Image 3 – Solr analyzer page”]Image 3 – Solr analyzer page[/caption]

In this screen we have selected the “name_text” field in Solr. In the “Field value (Index)” you enter the values you’re sending to be indexed, just like you would send from your model property, in the “Field value (Query)” you enter the values you’d use to search.

Once you type and hit “Analyze” you should see the output just below the form as we see in Image 3. This output shows how your input is transformed into tokens by the tokenizer and filters, this way you can easily experiment by adding more filters and seeing if the output really matches the way you’d expect it to. This analysis view is your best friend when debugging search/indexing related issues or trying out ways to improve the way Solr indexes and matches your data.

Customizing fields

Now that you have an idea about how the indexing and searching process work, let’s start to customize the fields in Solr, open up the “solr/conf/schema.xml” file and look for this reference:

Listing 6 – solr/conf/schema.xml except

<fieldtype class="solr.TextField" positionIncrementGap="100" name="text">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>

If you look at Image 1, where we saw the “name_text” configuration, you’ll see that the field type is “text”, this except above is the configuration for all fields of type “text”, which means that if we add more filters here we’ll affect all fields of this type. This greatly simplifies the way we configure the tool, as we don’t have to define explicit configurations for every single field that our models have, we can just reuse this same “text” config for all fields that are supposed to be indexed as text.

But that’s a lot of talking, let’s get into action!

Let’s start the job by looking at our indexed data from before:

[“battlestar”, “galactica”, “the”, “boardgame”]

The “the” is mostly useless, as it’s going to be available in almost all properties and no one is ever going to search for “the” (oh yeah, there might be that ONE guy that does it). In Information Retrieval lingo, “the” is a stop word, it usually doesn’t have meaning by itself and doesn’t represent valuable information for our indexer, removing all stop words from your input improves performance and the relevance of your results.

Given that this is a common operation, Solr already contains a filter that’s capable of removing all stop words from your data, the solr.StopFilterFactory, let’s see how we can add it to our config:

Listing 7 – solr/config/schema.xml except

<fieldtype class="solr.TextField" positionIncrementGap="100" name="text">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
    <filter class="solr.ISOLatin1AccentFilterFactory"/>
    <filter class="solr.TrimFilterFactory" />
  </analyzer>
</fieldtype>

If you look at the “solr/config” folder you’ll se a “stopwords.txt” file that already contains most of the common stop words in English, you can add or remove words from there as needed and if you’re not indexing English text you can just remove the English names and add your language’s stop words. Now change this in your “solr/config/schema.xml” file and stop and start Solr again and open the analyzer:

[caption id=”” align=”alignnone” width=”1011” caption=”Image 4 – Solr analyzer page “]Image 4 – Solr analyzer page [/caption]

As you can see, in the last step, the “the” was removed from both the index input and the query input, we’re maintaining only the pieces of information that are really useful, this makes our index smaller and also speeds up searching.

While you were not looking, we have also added two other filters, solr.ISOLatin1AccentFilterFactory, that removes accents from words in Latin based languages, like Portuguese. If the input is “não”, it becomes “nao”. And after that there’s solr.TrimFilterFactory, that removes unnecessary spaces from our tokens.

Partial matching

Another pretty common need is to be able to match only a part of a word, usually a prefix. In the beginning of the tutorial, we saw that searching for “battle” doesn’t yield any results, while “battlestar” does. This happens because Solr, by default, only sees a match if it’s a full match. The word you entered must be exactly the same as a token that’s available in the index, if there is no exact match, Solr you tell you that there are no results.

If you look at Lucene’s Query Parser Syntax (Solr is somewhat a web interface to Lucene) you’ll see that you can use the “” operator to perform a partial match. We could then search for “battle” and this would yield the results we expect, but doing this kind of partial matching is slow and could possibly become a bottleneck for your application, so we have to figure out another way to do this.

When all you need is prefixed partial matching, the solr.EdgeNGramFilterFactory is your best friend. It will break words into pieces that will then be added to the index, so it looks like you have partial matching, but in fact the partials are tokens by themselves in the index, let’s see how our config would look like in this case:

Listing 8 – solr/config/schema.xml except

<fieldtype class="solr.TextField" positionIncrementGap="100" name="text">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
    <filter class="solr.ISOLatin1AccentFilterFactory"/>
    <filter class="solr.TrimFilterFactory" />
    <filter class="solr.EdgeNGramFilterFactory"
      minGramSize="3"
      maxGramSize="30"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
    <filter class="solr.ISOLatin1AccentFilterFactory"/>
    <filter class="solr.TrimFilterFactory" />
  </analyzer>
</fieldtype>

As you can see, now we have two sections in our , one of the analyzers is for “index” and the other is for “query”. This is needed because we don’t want to have our search parameters being transformed for a partial match. If the user is searching for “battle”, it doesn’t makes sense to show him results for “bat”, so the generation of pieces of each word should be done only when indexing information.

Now restart your Solr instance and head run again the form we had in the analyzer view, you should see something like Image 5:

[caption id=”” align=”alignnone” width=”1330” caption=”Image 5 – Analyzer output with partial matching enabled”]Image 5 – Analyzer output with partial matching enabled[/caption]

Looking at the output, “battlestar” became:

[“bat”, “batt”, “battl”, “battle”, “battles”, “battlest”, “battlesta”, “battlestar”]

Now, if you search for “battle”, you should find all products that have “battle” as a prefix in any of their words and the search input is not affected by this change.

Faceting

Faceting of results is YACF (Yet Another Cool Feature) that you have when using Solr and Sunspot. “What does that mean?”, you might ask, it means that Solr is able to organize your results based on one of it’s properties and tell you how many results did match for every property value.

“I still don’t get it”, you might be thinking now. In our Product model we’re indexing the “category_id” property, we’ll tell Sunspot to facet our search based on the “category_id” field and Sunspot will tell us how many matches each category had, even if we’re paginating the results. Let’s see how our searching code would change:  

Listing 9 – products_controller.rb except

def index
    @page = (params[:page] || 1).to_i
    @products = if params[:q].blank?
      Product.paginate :order => 'name ASC', :per_page => 3, :page => @page
    else

      result = Product.solr_search do |s|
        s.keywords params[:q]
        unless params[:category_id].blank?
          s.with( :category_id ).equal_to( params[:category_id].to_i )
        else
          s.facet :category_id
        end
        s.paginate :per_page => 3, :page => @page
      end

      if result.facet( :category_id )
        @facet_rows = result.facet(:category_id).rows
      end

      result
    end
  end

The search code really changed a lot, now if there’s a “category_id” parameter we will use that to filter our search, if there isn’t we’re going to perform faceting with the “s.facet :category_id” call. There’s also a slight change to the “product.rb” class, let’s see it:

Listing 10 – product.rb except

searchable :auto_index => true, :auto_remove => true do
    text :name, :boost => 2.0
    text :description
    float :price
    integer :category_id, :references => ::Category
  end

We’ve added the “:references => ::Category” to the “:category_id” field configuration so Sunspot knows that this field is, in fact, a foreign key to another object, this will allow Sunspot to load the categories in the facets automatically for you.

The “result.facet(:category_id)” asks the search object for the array that contains the facets returned for the :category_id field in this search. Each row in this list contains an “instance” (which, in our case, is an Category object) and a “count”, that’s the number of hits in that specific facet. Once you get your hands at the rows, we can use it in our view, let’s see how we used them:  

Listing 11 – products/index.html.haml except

- if !@facet_rows.blank? && @facet_rows.size > 1
    %ul
      - for row in @facet_rows
        %li= link_to( "#{row.instance} (#{row.count})", products_path( :q => params[:q], :category_id => row.instance ) )

If there are facets available, we use them to add links that will make the user filter based on each specific facet, each row object has an instance and a count, and we use both in the interface to tell the user which category is it and how many hits it had. Look at how our user interface looks like:

[caption id=”” align=”alignnone” width=”376” caption=”Image 6 – Faceting information”]Image 6 – Faceting information[/caption]

And now you finally have search functionality added to a Rails project, with partial matching, faceting, pagination and input cleanup. Just forget that you have ever performed a “SELECT p.* FROM products p WHERE p.name LIKE ‘%battle%’” and be happy to be using a great full text search solution.

Conclusion

Hopefully this tutorial should be enough to get you up and running with Solr, for more advanced features I’d recommend you to search on the Solr wiki and buy “Solr 1.4 – Enterprise Search Server” by David Smiley and Erick Pugh.

Related Posts

Comments or questions? Ping me on Twitter!