« SSL and login fixes for rt.cpan.org | Main | Upgrading to RT 4 »

June 27, 2011

What's New in 4: Fast full-text searching

RT is commonly used to manage a lot of data. Unfortunately, prior to RT 4, it didn't support effective indexing of the content of comments or correspondence. RT 4 makes use of the full-text indexing capabilities of PostgreSQL and Oracle, or the external Sphinx library for MySQL, to efficiently index this content, allowing for high-speed searching of textual content. Due to bugs in version 4.0.0, you'll need version 4.0.1 or higher to take advantage of the full-text indexing detailed below.

By default, with no full-text indexes enabled, RT 4 does not allow searching of the "Content" field at all, owing to how resource-intensive unindexed full-text searches are. If you wish to enable non-indexed full-text searching, like previous versions of RT allowed, you will need to add the following to your RT_SiteConfig.pm:

Set(%FullTextSearch,
    Enable  => 1,
    Indexed => 0,
);

The real magic comes from enabling your database's full-text indexes, however. For the section below, we'll be showing the configuration for PostgreSQL; we'll briefly address MySQL, which is slightly more complicated, at the end of this post.

If you intend to enable full-text indexed searches, you should also read the docs/full_text_indexing.pod file which shipped with your RT install, which will provide roughly the same information as below, but is additionally guaranteed to be up-to-date with your version of RT.

PostgreSQL configuration

To start off with, you'll need to configure your database to store the index. On PostgreSQL, this consists of adding a new column and creating an index on it. To do this, run rt-setup-fulltext-index:

$ /opt/rt4/sbin/rt-setup-fulltext-index
Enter the name of a DB table that will be used to store the Pg tsvector.
You may either use the existing Attachments table, or create a new
table.
[Attachments]:

Enter the name of a column that will be used to store the Pg tsvector:
[ContentIndex]:

Generally, you'll want to store the new column in the Attachments table, right next to the data it is indexing. If you want to back up the index separately, or have other storage constraints, you may wish to create a new table to store it, but the default is almost certainly correct. Similarly, the default column name is also most likely correct.

You may choose between GiST or GIN indexes; the former is several times
slower to search, but takes less space on disk and is faster to update.
[GiST]:

You can read more about the differences between the index types in the PostgreSQL manual.

Going to run the following in the DB:
    ALTER TABLE Attachments ADD COLUMN ContentIndex tsvector

Going to run the following in the DB:
    CREATE INDEX ContentIndex_idx ON Attachments USING gist(ContentIndex)

You can now configure RT to use the newly-created full-text index by
adding the following to your RT_SiteConfig.pm:

Set( %FullTextSearch,
    Enable     => 1,
    Indexed    => 1,
    Column     => 'ContentIndex',
    Table      => 'Attachments',
);

At this point, your database has been configured. Ensure that you add the Set(...) block that rt-setup-fulltext-index output (which may be different from the above!) to your RT_SiteConfig.pm before continuing.

Your index now exists, but is empty. In order to fill it, we must find and index every textual attachment in RT using rt-fulltext-indexer --all. This can be a very time-consuming task, so be prepared for it to take a while. It can be interrupted and resumed safely, however.

Once it has finished, you will need to schedule rt-fulltext-indexer to run at regular intervals to pick up new additions to the database. This is easiest done by running ln -s /opt/rt4/sbin/rt-fulltext-indexer /etc/cron.hourly, which will ensure that it runs once an hour, but many other alternatives are possible. Depending on how fast tickets are created in your system, running it as frequently as once every two to five minutes may be possible.

MySQL configuration

The story on MySQL is unfortunately considerably more complicated, as MySQL does not natively support a full-text index format. It integrates with the external Sphinx search library, but this requires recompiling MySQL to include the SphinxSE engine which allows MySQL queries to retrieve data from the Sphinx searchd indexer. While many vendors package the external Sphinx tools, none of them ships a MySQL package which has been compiled with SphinxSE at the time of writing. As such, this is most likely not for the faint of heart, or those only just getting their feet wet with system administration.

Despite that, the process of recompiling MySQL with SphinxSE is fairly straightforward. Full instructions are provided in the Sphinx documentation. Sphinx 2.0.1 has been tested to work with both MySQL 5.0 and 5.1; it is not compatible with MySQL 5.5 at this time.

Once a SphinxSE-enabled MySQL has been installed, the configuration proceeds roughly as for PostgreSQL, above. First, you must run rt-setup-fulltext-index, which will ensure that your MySQL is correctly configured with SphinxSE, then prompt you for additional information:

$ /opt/rt4/sbin/rt-setup-fulltext-index
Enter name of a new MySQL table that will be used to connect to the
Sphinx server:
[AttachmentsIndex]:

Enter URL of the sphinx search server; this should be of the form
sphinx://<server>:<port>/<index name>
[sphinx://localhost:3312/rt]:

In order to communicate with the Sphinx searchd server, we create a virtual SphinxSE table in the database. This table hence needs to know where your Sphinx searchd server is running. The default is usually correct if you are planning to run searchd on your database server on the default port.

Maximum number of matches to return; this is the maximum number of
attachment records returned by the search, not the maximum number
of tickets.  Both your RT_SiteConfig.pm and your sphinx.conf must
agree on this value.  Larger values cause your Sphinx server to
consume more memory and CPU time per query.
[10000]:

This is an unfortunate limitation imposed by the design of Sphinx; 10000 should be sufficient for most purposes, but you may need to increment this number if you have tickets with an extremely large number of transactions which match your common search criteria. If you wish to change this number later, you must alter it in both RT_SiteConfig.pm and sphinx.conf files.

Going to run the following in the DB:
    CREATE TABLE AttachmentsIndex (
        id     INTEGER UNSIGNED NOT NULL,
        weight INTEGER NOT NULL,
        query  VARCHAR(3072) NOT NULL,
        INDEX(query)
    ) ENGINE=SPHINX CONNECTION="sphinx://localhost:3312/rt" CHARACTER SET utf8

You can now configure RT to use the newly-created full-text index by
adding the following to your RT_SiteConfig.pm:

Set( %FullTextSearch,
    Enable     => 1,
    Indexed    => 1,
    Table      => 'AttachmentsIndex',
    MaxMatches => '10000',
);

It will also output a sample sphinx.conf file. At this point, your database has been configured. Ensure that you add the Set(...) block that rt-setup-fulltext-index output (which may be different from the above!) to your RT_SiteConfig.pm before continuing.

Your index now exists, but is empty. In order to fill it, we must run the Sphinx indexer:

$ indexer rt
Sphinx 2.0.1-beta (r2792)
Copyright (c) 2001-2011, Andrew Aksyonoff
Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/etc/sphinx.conf'...
indexing index 'rt'...
collected 133809 docs, 1015.6 MB
sorted 137.2 Mhits, 100.0% done
total 133809 docs, 1015645236 bytes
total 144.967 sec, 7006036 bytes/sec, 923.02 docs/sec
total 400 reads, 6.474 sec, 637.6 kb/call avg, 16.1 msec/call avg
total 582 writes, 2.473 sec, 959.3 kb/call avg, 4.2 msec/call avg

Finally, you will need to start the searchd daemon:

$ searchd
Sphinx 2.0.1-beta (r2792)
Copyright (c) 2001-2011, Andrew Aksyonoff
Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/etc/sphinx.conf'...
listening on all interfaces, port=3312
precaching index 'rt'
precached 1 indexes in 0.033 sec

You will need to re-index the data using indexer --rotate rt at regular intervals. Since Sphinx does not do incremental indexing, but rather re-indexes all content each time it is run, a "main+delta" split of the data may be necessary to reduce index update time sufficiently to allow close to real-time indexing.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83456074b69e2014e896d805e970d

Listed below are links to weblogs that reference What's New in 4: Fast full-text searching:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

The comments to this entry are closed.