Sphinx xmlpipe2 in PHP: Part II

words by Brian Racer

In the last article we successfully created a PHP class that outputs XML as input for Sphinx’s indexer. However it was incredibly inefficient as we had to hold everything in memory. Here is an updated class that extends XMLWriter, which is a built in PHP class that is essentially undocumented and works great for creating memory efficient streams of XML data. Rather than keeping each document in memory, XMLWriter will allow us to immediately flush that document’s XML elements to standard output.

<?php
/*
  *  SphinxXMLFeed - efficiently generate XML for Sphinx's xmlpipe2 data adapter
  *  (c) 2009 Jetpack LLC http://jetpackweb.com
  */
class SphinxXMLFeed extends XMLWriter
{
  private $fields = array();
  private $attributes = array();
 
  public function __construct($options = array())
  {
    $defaults = array(
      'indent' => false,
    );
    $options = array_merge($defaults, $options);
 
    // Store the xml tree in memory
    $this->openMemory();
 
    if($options['indent']) {
      $this->setIndent(true);
    }
  }
 
  public function setFields($fields) {
    $this->fields = $fields;
  }
 
  public function setAttributes($attributes) {
    $this->attributes = $attributes;
  }
 
  public function addDocument($doc) {
    $this->startElement('sphinx:document');
    $this->writeAttribute('id', $doc['id']);
 
    foreach($doc as $key => $value) {
      // Skip the id key since that is an element attribute
      if($key == 'id') continue;
 
      $this->startElement($key);
      $this->text($value);
      $this->endElement();
    }
 
    $this->endElement();
    print $this->outputMemory();
  }
 
  public function beginOutput() {
 
    $this->startDocument('1.0', 'UTF-8');
    $this->startElement('sphinx:docset');
    $this->startElement('sphinx:schema');
 
    // add fields to the schema
    foreach($this->fields as $field) {
      $this->startElement('sphinx:field');
      $this->writeAttribute('name', $field);
      $this->endElement();
    }
 
    // add attributes to the schema
    foreach($this->attributes as $attributes) {
      $this->startElement('sphinx:attr');
      foreach($attributes as $key => $value) {
        $this->writeAttribute($key, $value);
      }
      $this->endElement();
    }
 
    // end sphinx:schema
    $this->endElement();
    print $this->outputMemory();
  }
 
  public function endOutput()
  {
    // end sphinx:docset
    $this->endElement();
    print $this->outputMemory();
  }
}

We can use it as follows:

$doc = new SphinxXMLFeed();
 
$doc->setFields(array(
  'title',
  'teaser',
  'content',
));
 
$doc->setAttributes(array(
  array('name' => 'blog_id', 'type' => 'int', 'bits' => '16', 'default' => '0'),
));
 
$doc->beginOutput();
 
foreach(range(1, 1000) as $id) {
  $doc->addDocument(array(
    'id' => $id,
    'blog_id' => rand(1, 10),
    'title' => "Article Part {$id}",
    'teaser' => "Article {$id} teaster",
    'content' => "Article {$id} content",
  ));
}
 
$doc->endOutput();

As you can see the first thing we need to do is populate the fields and attributes. Once that is done, we call beginOutput, that will create the head of the XML document. After each document is added, the document’s xml markup is immediately outputted and the memory buffer is cleared.

Finally we call endOutput, which will close the sphinx:docset element.

I have used this class in production to index millions of records that take up dozens of gigabytes. Keep in mind if you are working with that much data, you will probably need to bach your queries so you are not loading all the records at once!


Sphinx xmlpipe2 in PHP: Part I

words by Brian Racer

Sphinx is a great open source package for implementing a full text search. Before we can use it to search, we first must inject all of our data into it. There are two primary ways of loading that data in – directly accessing the data via a sql query, or using the xmlpipe2 format. Although using the database as a direct data source is very fast, it can sometimes be difficult to craft a query that will contain normalized data for all the fields you require in an index. The XML option gives us much more flexibility at the cost of speed(although it is still very fast). This article will deal with show you how to generate that XML. It assumed to have a basic understanding of how Sphinx works, if not browse the docs first.

An example xmlpipe2 format looks like this:

<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
 
<sphinx:schema>
  <sphinx:field name="subject"/> 
  <sphinx:field name="content"/>
  <sphinx:attr name="published" type="timestamp"/>
  <sphinx:attr name="author_id" type="int" bits="16" default="1"/>
</sphinx:schema>
 
<sphinx:document id="1234">
  <content>this is the main content <![CDATA[[and this <cdata> entry must be handled properly by xml parser lib]]></content>
  <published>1012325463</published>
  <subject>note how field/attr tags can be in <b class="red">randomized</b> order</subject>
  <misc>some undeclared element</misc>
</sphinx:document>
<!-- ... more documents here ... -->
</sphinx:docset>

First we define the schema, which contains fields and attributes. Fields will be processed for fulltext searches, and attributes will be used to help filter those search results. More information about attributes and their options can be found in the docs. Once the schema is defined, we start adding our document data. A document contains elements that will map to the previously defined fields and attributes.

Lets try and encapsulate some of that logic into a PHP class:

<?php
 
class SphinxXMLFeed
{
  private $fields = array();
  private $attributes = array();
  private $documents = array();
 
  public function setFields($fields) {
    $this->fields = $fields;
  }
 
  public function setAttributes($attributes) {
    $this->attributes = $attributes;
  }
 
  public function addDocument($doc) {
    $this->documents[] = $doc;
  }
 
  public function render() {
 
    // create a new XML document
    $dom = new DomDocument('1.0');
    $dom->encoding = "utf-8";
    $dom->formatOutput = true;
 
    // create root node
    $root = $dom->createElement('sphinx:docset');
    $root = $dom->appendChild($root);
 
    // create the schema
    $schema = $dom->createElement('sphinx:schema');
 
    // common fields we will be cloning
    $tmp_field = $dom->createElement('sphinx:field');
    $tmp_attr  = $dom->createElement('sphinx:attr');
 
    // add fields to the schema
    foreach($this->fields as $field) {
      $new_field = clone($tmp_field);
      $new_field->setAttribute('name', $field);
      $schema->appendChild($new_field);
    }
 
    // add attributes to the schema
    foreach($this->attributes as $attributes) {
      $new_attr = clone($tmp_attr);
      foreach($attributes as $key => $value) {
        $new_attr->setAttribute($key, $value);
        $schema->appendChild($new_attr);
      }
    }
 
    // add the schema to the document
    $root->appendChild($schema);
 
    // go through each document
    foreach($this->documents as $doc) {
      $node = $dom->createElement('sphinx:document');
      $node->setAttribute('id', $doc['id']);
 
      foreach($doc as $key => $value) {
        if($key == 'id') continue;
        $tmp = $dom->createElement($key);
        $tmp->appendChild($dom->createTextNode($value));
 
        $node->appendChild($tmp);
      }
 
      // add the document to the dom
      $root->appendChild($node);
    }
 
    // return xml text
    return $dom->saveXML();
  }
}

The previous code uses PHP’s DomDocument interface because that is less error prone than manually echo’ing out XML tags. One downside of using DomDocument is we must build the entire XML tree before we can output it. This means we must keep each document in memory, so if you are indexing a large amount of data you will probably hit PHP’s memory limit. We will fix this in the next article. For now, you can use this class as follows:

// instantiate the class
$doc = new SphinxXMLFeed();
 
// set the fields we will be indexing
$doc->setFields(array(
  'title',
  'teaser',
  'content',
));
 
// set any attributes
$doc->setAttributes(array(
  array('name' => 'blog_id', 'type' => 'int', 'bits' => '16', 'default' => '0'),
));
 
// generate some random document. These would usually be pulled from a database
// or other data source
foreach(range(1, 3) as $id) {
  $doc->addDocument(array(
    'id' => $id,
    'blog_id' => rand(1, 10),
    'title' => "Article Part {$id}",
    'teaser' => "Article {$id} teaster",
    'content' => "Article {$id} content",
  ));
}
 
// Render the XML
$doc->render();

That code will generate the following XML:

<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
  <sphinx:schema>
    <sphinx:field name="title"/>
    <sphinx:field name="teaser"/>
    <sphinx:field name="content"/>
    <sphinx:attr name="blog_id" type="int" bits="16" default="0"/>
  </sphinx:schema>
  <sphinx:document id="1">
    <blog_id>6</blog_id>
    <title>Article Part 1</title>
    <teaser>Article 1 teaster</teaser>
    <content>Article 1 content</content>
  </sphinx:document>
  ...
</sphinx:docset>

You would setup you datasource in sphinx.conf something like this:

source xml_blog_posts
{
    type = xmlpipe
    xmlpipe_command = /usr/bin/php /home/example.com/lib/tasks/sphinx_blogs.php
}

Don’t forget to checkout the next article where we optimize this class to handle millions of records!

Continue to next article: Sphinx xmlpipe2 in PHP: Part II


Useful PHP Subversion Commit Hook

words by Brian Racer

Here is a subversion pre-commit hook script we use on PHP projects to make sure the developer making the commit is providing a meaningful description, and then PHP lint is run on each PHP script to make sure it will compile correctly.

#!/bin/bash
 
REPOS="$1"
TXN="$2"
 
PHP="/usr/bin/php"
SVNLOOK="/usr/bin/svnlook"
AWK="/usr/bin/awk"
GREP="/bin/egrep"
SED="/bin/sed"
 
CHANGED=`$SVNLOOK changed -t "$TXN" "$REPOS" | $AWK '{print $2}' | $GREP \\.php$`
 
for FILE in $CHANGED
do
    MESSAGE=`$SVNLOOK cat -t "$TXN" "$REPOS" "$FILE" | $PHP -l`
    if [ $? -ne 0 ]
    then
        echo 1>&2
        echo "***********************************" 1>&2
        echo "PHP error in: $FILE:" 1>&2
        echo `echo "$MESSAGE" | $SED "s| -| $FILE|g"` 1>&2
        echo "***********************************" 1>&2
        exit 1
    fi
done
 
# Make sure that the log message contains some text.
SVNLOOKOK=1
SVNLOOK=/usr/bin/svnlook
$SVNLOOK log -t "$TXN" "$REPOS" | \
   grep "[a-zA-Z0-9]" > /dev/null || SVNLOOKOK=0
if [ $SVNLOOKOK = 0 ]; then
  echo Empty log messages are not allowed. Please provide a proper log message. 1>&2
  exit 1
fi
 
# Make sure text might be meaningful
LOGMSGLEN=$($SVNLOOK log -t "$TXN" "$REPOS" | grep [a-zA-Z0-9] | wc -c)
if [ "$LOGMSGLEN" -lt 6 ]; then
  echo -e "Please provide a meaningful comment when committing changes." 1>&2
  exit 1
fi
 
# All checks passed, so allow the commit.
exit 0

Override PHP’s mail() function during development

words by Brian Racer

When doing local development we generally don’t want our test servers sending out mail to the world. And it would be ideal to be able to review the emails our application does send out before deploying the changes to the world. An easy way to achieve this functionality is to override PHP’s sendmail_path config variable. First lets install a few packegs that will allow us to send mail, and some useful scripts to rewrite the mail:

sudo apt-get install procmail sendmail

Next create the following script that will rewrite any mail that all mail generated by PHP’s mail() function to the local user of your choice:

vi /usr/local/bin/trapmail
formail -R cc X-original-cc \
-R to X-original-to \
-R bcc X-original-bcc \
-f -A"To: [email protected]" \
| /usr/sbin/sendmail -t -i

Replace [email protected] with your local username or an external email address.

Now update your php.ini file’s sendmail_path:

grep sendmail_path /etc/php5/apache2/conf/php.ini
 
sendmail_path=/usr/local/bin/trapmail

You can then use mail client like mutt or Thunderbird to review the emails, or just tail your mbox file.