Sphinx is a great open source package for implementing a full text search. Before we can use it to search, we first must inject all of our data into it. There are two primary ways of loading that data in – directly accessing the data via a sql query, or using the xmlpipe2 format. Although using the database as a direct data source is very fast, it can sometimes be difficult to craft a query that will contain normalized data for all the fields you require in an index. The XML option gives us much more flexibility at the cost of speed(although it is still very fast). This article will deal with show you how to generate that XML. It assumed to have a basic understanding of how Sphinx works, if not browse the docs first.
An example xmlpipe2 format looks like this:
<?xml version="1.0" encoding="utf-8"?> <sphinx:docset> <sphinx:schema> <sphinx:field name="subject"/> <sphinx:field name="content"/> <sphinx:attr name="published" type="timestamp"/> <sphinx:attr name="author_id" type="int" bits="16" default="1"/> </sphinx:schema> <sphinx:document id="1234"> <content>this is the main content <![CDATA[[and this <cdata> entry must be handled properly by xml parser lib]]></content> <published>1012325463</published> <subject>note how field/attr tags can be in <b class="red">randomized</b> order</subject> <misc>some undeclared element</misc> </sphinx:document> <!-- ... more documents here ... --> </sphinx:docset>
First we define the schema, which contains fields and attributes. Fields will be processed for fulltext searches, and attributes will be used to help filter those search results. More information about attributes and their options can be found in the docs. Once the schema is defined, we start adding our document data. A document contains elements that will map to the previously defined fields and attributes.
Lets try and encapsulate some of that logic into a PHP class:
<?php class SphinxXMLFeed { private $fields = array(); private $attributes = array(); private $documents = array(); public function setFields($fields) { $this->fields = $fields; } public function setAttributes($attributes) { $this->attributes = $attributes; } public function addDocument($doc) { $this->documents[] = $doc; } public function render() { // create a new XML document $dom = new DomDocument('1.0'); $dom->encoding = "utf-8"; $dom->formatOutput = true; // create root node $root = $dom->createElement('sphinx:docset'); $root = $dom->appendChild($root); // create the schema $schema = $dom->createElement('sphinx:schema'); // common fields we will be cloning $tmp_field = $dom->createElement('sphinx:field'); $tmp_attr = $dom->createElement('sphinx:attr'); // add fields to the schema foreach($this->fields as $field) { $new_field = clone($tmp_field); $new_field->setAttribute('name', $field); $schema->appendChild($new_field); } // add attributes to the schema foreach($this->attributes as $attributes) { $new_attr = clone($tmp_attr); foreach($attributes as $key => $value) { $new_attr->setAttribute($key, $value); $schema->appendChild($new_attr); } } // add the schema to the document $root->appendChild($schema); // go through each document foreach($this->documents as $doc) { $node = $dom->createElement('sphinx:document'); $node->setAttribute('id', $doc['id']); foreach($doc as $key => $value) { if($key == 'id') continue; $tmp = $dom->createElement($key); $tmp->appendChild($dom->createTextNode($value)); $node->appendChild($tmp); } // add the document to the dom $root->appendChild($node); } // return xml text return $dom->saveXML(); } }
The previous code uses PHP’s DomDocument interface because that is less error prone than manually echo’ing out XML tags. One downside of using DomDocument is we must build the entire XML tree before we can output it. This means we must keep each document in memory, so if you are indexing a large amount of data you will probably hit PHP’s memory limit. We will fix this in the next article. For now, you can use this class as follows:
// instantiate the class $doc = new SphinxXMLFeed(); // set the fields we will be indexing $doc->setFields(array( 'title', 'teaser', 'content', )); // set any attributes $doc->setAttributes(array( array('name' => 'blog_id', 'type' => 'int', 'bits' => '16', 'default' => '0'), )); // generate some random document. These would usually be pulled from a database // or other data source foreach(range(1, 3) as $id) { $doc->addDocument(array( 'id' => $id, 'blog_id' => rand(1, 10), 'title' => "Article Part {$id}", 'teaser' => "Article {$id} teaster", 'content' => "Article {$id} content", )); } // Render the XML $doc->render();
That code will generate the following XML:
<?xml version="1.0" encoding="utf-8"?> <sphinx:docset> <sphinx:schema> <sphinx:field name="title"/> <sphinx:field name="teaser"/> <sphinx:field name="content"/> <sphinx:attr name="blog_id" type="int" bits="16" default="0"/> </sphinx:schema> <sphinx:document id="1"> <blog_id>6</blog_id> <title>Article Part 1</title> <teaser>Article 1 teaster</teaser> <content>Article 1 content</content> </sphinx:document> ... </sphinx:docset>
You would setup you datasource in sphinx.conf something like this:
source xml_blog_posts { type = xmlpipe xmlpipe_command = /usr/bin/php /home/example.com/lib/tasks/sphinx_blogs.php }
Don’t forget to checkout the next article where we optimize this class to handle millions of records!
Continue to next article: Sphinx xmlpipe2 in PHP: Part II

August 16th, 2009 at 10:41 pm
[...] the last article we successfully created a PHP class that outputs XML as input for Sphix’s indexer. However it [...]
March 1st, 2010 at 6:15 pm
I would like to read more stuff like this one