#343 Full-Text Search in PostgreSQL pro

Apr 17, 2012 | 16 minutes | Active Record, Tools, Plugins, Search

Postgres offers full-text searching right out of the box. This episode shows how to write queries from scratch, apply tools like Texticle and pg_search, and optimize performance through indexes.

Click to Play Video ▶

Download:
source codeProject Files in Zip (109 KB)
mp4Full Size H.264 Video (47.1 MB)
m4vSmaller H.264 Video (21.3 MB)
webmFull Size VP8 Video (21.8 MB)
ogvFull Size Theora Video (51 MB)

In past episodes we’ve covered many solutions for full-text searching, including Thinking Sphinx, Sunspot and ElasticSearch. All of these require an external search engine, separate from the database but if we use Postgres for our application’s database we can take advantage of its built-in full-text search capability. There are many advantages to keeping the full-text search inside the primary database and we’ll mention some of these as we show you how this works in this episode. The application will be working with in this episode is a simple blogging application that shows a number of articles. We’ve already implemented some of this application’s features, including pagination and a simple search form, similar to the one we demonstrated in episode 37, that finds articles whose name or content contains the search term.

First we’ll walk through the source code for this application, starting with the view template for the page above.

          /app/views/articles/index.html.erb
        
<h1>Articles</h1>

<%= form_tag articles_path, method: :get do %>
  <p>
    <%= text_field_tag :query, params[:query] %>
    <%= submit_tag "Search", name: nil %>
  </p>
<% end %>

<div id="articles">
<% @articles.each do |article| %>
  <h2>
    <%= link_to article.name, article %>
    <span class="comments">(<%= pluralize(article.comments.size, 'comment') %>)</span>
  </h2>
  <div class="info">
    by <%= article.author.name %>
    on <%= article.published_at.strftime('%b %d, %Y') %>
  </div>
  <div class="content"><%= article.content %></div>
<% end %>
</div>

<%= will_paginate @articles %>

This template contains the code that renders the list of articles and also the simple search form. When the form is submitted the search term is passed as query parameter to the articles path, which is the ArticlesController’s index action. Let’s look at that next.

          /app/controllers/articles_controller.rb
        
def index
  @articles = Article.text_search(params[:query]).page(params[:page]).per_page(3)
end

Here we fetch the articles we want by calling a custom text_search class method on the Article model and passing the query parameter to it, along with the current page number and number of articles per page. The text_search method looks like this:

          /app/models/article.rb
        
def self.text_search(query)
  if query.present?
    where("name ilike :q or content ilike :q", q: "%#{query}%")
  else
    scoped
  end
end

Here we check to see if the query parameter is present and if it is we find the articles that match that query using a like comparison (actually ilike so that the search is case-insensitive). We compare the articles’ name and content columns to the query, which is wrapped in percent signs so that the search term is matched anywhere in those fields.

Using Full-Text Searching

This is a poor excuse for full-text searching so let’s take advantage of Postgres’s full-text search feature to improve it. One problem with our current approach is that we’re treating the search term as a single string so if we search for “superman character” no articles are returned even though we have articles that contain both those words but not together. Changing from a like condition to a full-text search with Postgres is quite easy. All we have to do is replace ilike with double at signs (@@) and remove the percent signs from the search term like this:

          /app/models/article.rb
        
def self.text_search(query)
  if query.present?
    where("name @@ :q or content @@ :q", q: query)
  else
    scoped
  end
end

When we search for “superman character” now we’ll see all the articles that contain both of those words.

A search for “superman characters” now returns the relevants articles.

By default the full-text search uses an English dictionary with stemming so if we search for “superman characters” (note the extra “s”) we get the same set of results, even though the word “characters” isn’t in the articles. Stop words are also supported which means that common English words will be ignored. If we search for “superman of dc comics”, this will return the “Superman” article even though it doesn’t contain the word “of”.

One of the best reasons for keeping full-text searching in the database is that it works seamlessly with all other SQL clauses. For example our pagination still works without any adjustment from us. The text_search method is still just a simple ActiveRecord scope that we can add other clauses to, such as page and per_page for pagination. Another good reason for using this approach is that everything stays in sync automatically: if we were to create a new article, or edit an article, that content is instantly available in the full-text search.

How Full-Text Searching Works

We’ve shown how powerful full-text searching is but what’s going on when we use it and how can we get the most from it? We’ll use the database console to show what goes on when we make a full-text search. We can use the rails db command to enter the Postgres console and we’ll use the SELECT command with some static text to experiment with various operators and functions.

          terminal
        
noonoo:blog eifion$ rails db
psql (9.1.2)
Type "help" for help.

blog_development=# select 'ninja turtles' @@ 'turtles';
 ?column? 
----------
 t
(1 row)

The t we get as a response stands for true. Searching for turtle here will return true as well because of stemming, while searching for something else entirely such as green will return false. There’s some typecasting going on behind the scenes here. The phrase on the left side of the @@ is being converted to a tsvector while the phrase on the right is converted to a plain tsquery. We can mimic this by calling the relevant functions.

          terminal
        
blog_development=# select to_tsvector('ninja turtles') @@ plainto_tsquery('turtle');
 ?column? 
----------
 t
(1 row)

This isn’t the full story, however. We can pass a dictionary as a first argument to these functions, like this:

          terminal
        
blog_development=# select to_tsvector('english', 'ninja turtles') @@ plainto_tsquery('english', 'turtle');
 ?column? 
----------
 t
(1 row)

English is the default dictionary so this will give us the same results as before. This means that we can change the language or do a more literal search by passing in simple instead of english.

          terminal
        
blog_development=# select to_tsvector('simple', 'ninja turtles') @@ plainto_tsquery('simple', 'turtle');
 ?column? 
----------
 f
(1 row)

This time we get a false response as we’re searching for turtle not turtles.

You might be wondering what the difference between a plain tsquery and a normal one is. If we use to_tsquery instead of plainto_tsquery we’ll get the same results as we only have a single word in our query. If we enter multiple search words, though, we’ll get a syntax error.

          terminal
        
blog_development=# select to_tsvector('simple', 'ninja turtles') @@ to_tsquery('simple', 'ninja turtles');
ERROR:  syntax error in tsquery: "ninja turtles"

A normal tsquery expects boolean operators between each word so we can use & to search for ninja and turtles or a pipe to search for ninja or turtles.

          terminal
        
blog_development=# select to_tsvector('simple', 'ninja turtles') @@ to_tsquery('simple', 'ninja | turtles');
 ?column? 
----------
 t
(1 row)

To exclude a word we use an exclamation mark, so ninja & !turtles would match text that contains ninja but not turtles. If we’re going to use a plain to_tsquery like this we’ll need to translate whatever the user enters as a search term and use the relevant operators.

When performing a plain-text search we often want to sort the results by relevance. We can do this by using the ts_rank function which returns a float value that we can use for ranking.

          terminal
        
blog_development=# select ts_rank(to_tsvector('ninja turtles'), to_tsquery('turtles'));
  ts_rank  
-----------
 0.0607927
(1 row)

Now that we know how to sort by relevance let’s apply this to our application. We’ll add some code to the text_search method in the Article model that sets a rank variable to the value of the sum of the ranks for the name and content fields. We can then order the results by that rank in descending order.

          /app/models/article.rb
        
def self.text_search(query)
  if query.present?
    rank = <<-RANK
    ts_rank(to_tsvector(name), plainto_tsquery(#{sanitize(query)})) +
    ts_rank(to_tsvector(content), plainto_tsquery(#{sanitize(query)}))
  RANK
  where("name @@ :q or content @@ :q", q: "%#{query}%").order("#{rank} desc")
  else
    scoped
  end
end

When we search for “dc comics” now the Superman article will appear at the top of the results as that’s the article that contains that phrase most frequently.

Search results are now ordered by relevance.

Texticle and PgSearch

The text_search method is starting to get a little messy and it will only get worse if we search against more columns. Fortunately there’s a gem that can help up here. Texticle makes it easier to do full-text searching in Postgres. To use it we just need to add it to the application’s gemfile and then run bundle to install it.

          /Gemfile
        
gem 'texticle', require: 'texticle/rails'

Now we can clean up the text_search method in our model like this:

          /app/models/article.rb
        
def self.text_search(query)
  if query.present?
    search(query)
  else
    scoped
  end
end

This will automatically search all text-based columns on the table and do much the same as we had before. If we search for “Superman” again now and look at the log file we’ll see the SQL query that was made.

          terminal
        
  Article Load (2.5ms)  SELECT "articles".*, ts_rank(to_tsvector('english', "articles"."name"::text), to_tsquery('english', 'dc\ comics'::text)) + ts_rank(to_tsvector('english', "articles"."content"::text), to_tsquery('english', 'dc\ comics'::text)) AS "rank0.9491099641574605" FROM "articles" WHERE (to_tsvector('english', "articles"."name"::text) @@ to_tsquery('english', 'dc\ comics'::text) OR to_tsvector('english', "articles"."content"::text) @@ to_tsquery('english', 'dc\ comics'::text)) ORDER BY "rank0.9491099641574605" DESC LIMIT 3 OFFSET 0
   (1.2ms)  SELECT COUNT(*) FROM "articles" WHERE (to_tsvector('english', "articles"."name"::text) @@ to_tsquery('english', 'dc\ comics'::text) OR to_tsvector('english', "articles"."content"::text) @@ to_tsquery('english', 'dc\ comics'::text))

This query is similar to the manual one we had, using ts_rank to generate a rank and @@ to compare a tsvector and a tsquery.

Texticle is great for a simple full-text search like this but if we want a more customizable solution then another gem called PgSearch is better. This offers a couple of ways to do full-text searching in Postgres. One option called multi-search will search multiple models at once. The other option is called pg_search_scope and is used for searching a single model. We’ll use this second option in our application. First we’ll replace the texticle gem in our gemfile with pg_search and then run bundle again to install it.

          /Gemfile
        
#gem 'texticle', require: 'texticle/rails'
gem 'pg_search'

Now in our Article model we need to include PgSearch. We can then use the pg_search_scope method. We need to pass in a name for our scope and we’ll call ours search. We can then use the against option to specify the columns that we want to search. This will generate a search method which we can use in the same way we used it with Texticle.

          /app/models/article.rb
        
class Article < ActiveRecord::Base
  attr_accessible :author_id, :author, :content, :name, :published_at
  belongs_to :author
  has_many :comments
  
  include PgSearch
  pg_search_scope :search, against: [:name, :content]

  def self.text_search(query)
    if query.present?
      search(query)
    else
      scoped
    end
  end
end

We’ll need to restart our Rails application for these changes to be picked up but after we do we should have full-text searching working with PgSearch. This won’t work exactly as before as it doesn’t use the English dictionary by default so if we search for “characters” the articles containing “character” won’t be returned. If we look at the log we’ll see that query passes simple to the tsvector and tsquery instead of the English dictionary.

          terminal
        
  Article Load (1.5ms)  SELECT "articles".*, (ts_rank((to_tsvector('simple', coalesce("articles"."name", '')) || to_tsvector('simple', coalesce("articles"."content", ''))), (to_tsquery('simple', ''' ' || 'characters' || ' ''')), 0)) AS pg_search_rank FROM "articles" WHERE (((to_tsvector('simple', coalesce("articles"."name", '')) || to_tsvector('simple', coalesce("articles"."content", ''))) @@ (to_tsquery('simple', ''' ' || 'characters' || ' ''')))) ORDER BY pg_search_rank DESC, "articles"."id" ASC LIMIT 3 OFFSET 0

To get this behaviour back we’ll need to specify the dictionary like this:

          /app/models/article.rb
        
pg_search_scope :search, against: [:name, :content],
  using: {tsearch: {dictionary: "english"}}

Now when we search for “characters” we’ll get the articles that contain “character” returned too.

A search for “characters” now returns articles containing “character”.

Searching Related Attributes

One useful feature of PgSearch is that it makes it easy to search through related attributes. For example in our application an Article belongs to an Author and has many Comments. We can use PgSearch’s associated_against option to include these and we’ll include the author’s name and the comments’ name and content fields.

          /app/models/article.rb
        
pg_search_scope :search, against: [:name, :content],
  using: {tsearch: {dictionary: "english"}},
  associated_against: {author: :name, comments: [:name, :content]}

Now a search for “Clark” will return all the articles written by Clark Kent.

The author model’s name field is now also searched.

Another of PgSearch’s useful features is the ability to ignore accents.

          /app/models/article.rb
        
pg_search_scope :search, against: [:name, :content],
  using: {tsearch: {dictionary: "english"}},
  associated_against: {author: :name, comments: [:name, :content]},
  ignoring: :accents

This might not work out of the box and if we reload the page now we’ll get an exception complaining about an undefined function called unaccent in the SQL. To fix this we’ll need to add an extension to our Postgres database. We can do this inside a migration so we’ll create one to do that.

          terminal
        
$ rails g migration add_unaccent_extension

This migration will just execute a SQL statement.

          /db/migrations/20120428000000_add_unaccent_extension.rb
        
class AddUnaccentExtension < ActiveRecord::Migration
  def up
    execute "create extension unaccent"
  end

  def down
    execute "drop extension unaccent"
  end
end

We can now run the migration by running rake db:migrate and when we reload the page now the error will be gone. We can search for “süperman” (with an umlaut) now and this will find the articles containing “superman”.

Searching for a word containing an accent will now return matching results for that word without the accent.

PgSearch provides many other options such as the ability to add weights to certain attributes in the relevance ranking and there’s more information about this in the documentation.

Speeding Up Searches

Let’s shift gears now and focus on performance. We can see from the log that the search query only takes a couple of milliseconds to run but as we only have a few records in the database this isn’t really a surprise. To see how searching performs under more realistic conditions we’ll need to add more records. The seeds.rb file generates the initial records for our application and we’ll alter this file so it creates each record 1,000 times.

          /db/seeds.rb
        
bruce = Author.create! name: "Bruce Wayne"
clark = Author.create! name: "Clark Kent"

1000.times do 
  superman = Article.create! name: "Superman", author: clark, published_at: 3.weeks.ago, content: <<-ARTICLE
  Superman is a fictional comic book superhero appearing in publications by DC Comics, widely considered to be an American cultural icon. Created by American writer Jerry Siegel and Canadian-born American artist Joe Shuster in 1932 while both were living in Cleveland, Ohio, and sold to Detective Comics, Inc. (later DC Comics) in 1938, the character first appeared in Action Comics #1 (June 1938) and subsequently appeared in various radio serials, television programs, films, newspaper strips, and video games. (from Wikipedia)
  ARTICLE

  # Some articles and comments omitted.

  Comment.create! name: "Lois Lane", article: superman, content: <<-COMMENT
  I want to meet Superman again, does anyone know where I can find him?
  COMMENT
end

We could use a Rake task to do this but this will work as a quick and dirty way to create the records. We can run the file by running rake db:seed. We’ll also go back to our Article model and change back to the more manual search so that we have a simpler query that we can focus on optimizing.

          /app/models/article.rb
        
def self.text_search(query)
  if query.present?
    # search(query)
    rank = <<-RANK
      ts_rank(to_tsvector(name), plainto_tsquery(#{sanitize(query)})) +
      ts_rank(to_tsvector(content), plainto_tsquery(#{sanitize(query)}))
    RANK
  where("name @@ :q or content @@ :q", q: "%#{query}%").order("#{rank} DESC")
  else
    scoped
  end
end

When we make a search now it takes a noticeable time to run. If we look at the log we’ll see that the search took almost a second to complete, long enough for Rails’ auto-explain to kick in and run an EXPLAIN on the query. The main problem is missing indexes and adding one will make a big difference to the speed. We should add an index to whatever’s on the left side of the @@ operators, in this case the name and content fields. Also it’s not good to rely on the auto-typecasting so we’ll expand out the query.

          /app/models/article.rb
        
def self.text_search(query)
  if query.present?
    # search(query)
    rank = <<-RANK
      ts_rank(to_tsvector(name), plainto_tsquery(#{sanitize(query)})) +
      ts_rank(to_tsvector(content), plainto_tsquery(#{sanitize(query)}))
    RANK
  where("to_tsvector('english', name) @@ :q or to_tsvector('english', content) @@ :q", q: query).order("#{rank} DESC")
  else
    scoped
  end
end

This is the same query that we had before but now we have something more concrete that we can add an index to. We’ll generate a migration to add the indexes.

          terminal
        
$ rails g migration add_search_index_to_articles

The migration will create the two new indexes.

          /db/migrations/20120428000000_add_search_index_to_articles.rb
        
class AddSearchIndexToArticles < ActiveRecord::Migration
  def up
    execute "create index articles_name on articles using gin(to_tsvector('english', name))"
    execute "create index articles_content on articles using gin(to_tsvector('english', content))"
  end

  def down
    execute "drop index articles_name"
    execute "drop index articles_content"
  end
end

We create two indexes here, one for the name column and one for content and we use the to_tsvector call like we do in the query to generate the index. Note, though, that we pass this in to a gin method which is a way to efficiently create an index for full-text searching. Postgres offers another indexing function called GiST which is also worth considering. There are more details about these in the documentation but as a rule GIN works best for web applications.

After we’ve migrated our database to add the indexes we can try running another search. This time it runs more quickly and if we look at the logs we’ll see that database query takes about half the time it did before. This is good but it’s not quite the speedup we were hoping for. The primary cause of the remaining slowness is the call to ts_rank as this doesn’t take advantage of the index. If we sort the results by name rather than rank the searches become blazing fast, the log now showing the results running in a few milliseconds. As a compromize we could omit the content field from the relevance ranking. This way there will be some relevance but not as much and the query still runs quickly.

Problems With schema.rb

We’ll finish off this episode by talking about the schema.rb file. This is supposed to be a representation of the full database schema in Ruby but it’s now incomplete. This is because we’ve added some migrations that run SQL code and this is quite common when a Rails application uses Postgres as its database server. If we try to load in the schema file, e.g. when we set up our test database, it will be incomplete. To fix this we can uncomment the following line in our application config file.

          /config/application.rb
        
# Use SQL instead of Active Record's schema dumper when creating the database.
# This is necessary if your schema can't be completely dumped by the schema dumper,
# like if you have constraints or database-specific column types
config.active_record.schema_format = :sql

If we run rake db:migrate now the schema file will be generated in SQL. A new structure.sql file will be created and this will include everything we need to generate the database. We can now remove the db/schema.rb file. If we ever need to load or dump the schema data now we should use the db:structure:load and db:structure:dump Rake tasks and not the schema tasks.