In the last article we successfully created a PHP class that outputs XML as input for Sphinx’s indexer. However it was incredibly inefficient as we had to hold everything in memory. Here is an updated class that extends XMLWriter, which is a built in PHP class that is essentially undocumented and works great for creating memory efficient streams of XML data. Rather than keeping each document in memory, XMLWriter will allow us to immediately flush that document’s XML elements to standard output.
<?php /* * SphinxXMLFeed - efficiently generate XML for Sphinx's xmlpipe2 data adapter * (c) 2009 Jetpack LLC http://jetpackweb.com */ class SphinxXMLFeed extends XMLWriter { private $fields = array(); private $attributes = array(); public function __construct($options = array()) { $defaults = array( 'indent' => false, ); $options = array_merge($defaults, $options); // Store the xml tree in memory $this->openMemory(); if($options['indent']) { $this->setIndent(true); } } public function setFields($fields) { $this->fields = $fields; } public function setAttributes($attributes) { $this->attributes = $attributes; } public function addDocument($doc) { $this->startElement('sphinx:document'); $this->writeAttribute('id', $doc['id']); foreach($doc as $key => $value) { // Skip the id key since that is an element attribute if($key == 'id') continue; $this->startElement($key); $this->text($value); $this->endElement(); } $this->endElement(); print $this->outputMemory(); } public function beginOutput() { $this->startDocument('1.0', 'UTF-8'); $this->startElement('sphinx:docset'); $this->startElement('sphinx:schema'); // add fields to the schema foreach($this->fields as $field) { $this->startElement('sphinx:field'); $this->writeAttribute('name', $field); $this->endElement(); } // add attributes to the schema foreach($this->attributes as $attributes) { $this->startElement('sphinx:attr'); foreach($attributes as $key => $value) { $this->writeAttribute($key, $value); } $this->endElement(); } // end sphinx:schema $this->endElement(); print $this->outputMemory(); } public function endOutput() { // end sphinx:docset $this->endElement(); print $this->outputMemory(); } }
We can use it as follows:
$doc = new SphinxXMLFeed(); $doc->setFields(array( 'title', 'teaser', 'content', )); $doc->setAttributes(array( array('name' => 'blog_id', 'type' => 'int', 'bits' => '16', 'default' => '0'), )); $doc->beginOutput(); foreach(range(1, 1000) as $id) { $doc->addDocument(array( 'id' => $id, 'blog_id' => rand(1, 10), 'title' => "Article Part {$id}", 'teaser' => "Article {$id} teaster", 'content' => "Article {$id} content", )); } $doc->endOutput();
As you can see the first thing we need to do is populate the fields and attributes. Once that is done, we call beginOutput, that will create the head of the XML document. After each document is added, the document’s xml markup is immediately outputted and the memory buffer is cleared.
Finally we call endOutput, which will close the sphinx:docset element.
I have used this class in production to index millions of records that take up dozens of gigabytes. Keep in mind if you are working with that much data, you will probably need to bach your queries so you are not loading all the records at once!

September 21st, 2009 at 9:38 am
This is awesome. Thanks so much for the great tutorial.
You will have to forgive me, all of my php development experience is within the context of the Drupal CMS, which is does not make use of php objects and also adds a layer of database extraction, so there are a couple small things that are keeping me from getting this to work.
Would it be possible for you to post (or send me) the actual file with generic mysql connection information, etc? Thanks again.
June 8th, 2010 at 1:36 am
Nice script!!!
I started using sphinx and impressed with its functionality, but I feel difficulty to understand xmlpipe2, in this case what should I write in sphinx.conf to work above script??
Also I cant understant what xmlpipe_command in sphinx.conf is for and what should I write in xmlpipe_command?.Can anyone explain?Thanks in advance.