Determining linux distribution

words by Brian Racer

I work with a wide variety of client server deployments, and sometimes it isn’t obvious(via uname) what distribution and version a server is running. Here is a quick list of common files which contain that information:

Debian          /etc/debian_release, /etc/debian_version,
Fedora          /etc/fedora-release
Gentoo          /etc/gentoo-release
Mandrake        /etc/mandrake-release
Novell SUSE     /etc/SUSE-release
Red Hat         /etc/redhat-release, /etc/redhat_version
Slackware       /etc/slackware-release, /etc/slackware-version
Solaris/Sparc   /etc/release
Sun JDS         /etc/sun-release
Ubuntu          /etc/lsb-release
UnitedLinux     /etc/UnitedLinux-release
Yellow dog      /etc/yellowdog-release

Sphinx xmlpipe2 in PHP: Part II

words by Brian Racer

In the last article we successfully created a PHP class that outputs XML as input for Sphinx’s indexer. However it was incredibly inefficient as we had to hold everything in memory. Here is an updated class that extends XMLWriter, which is a built in PHP class that is essentially undocumented and works great for creating memory efficient streams of XML data. Rather than keeping each document in memory, XMLWriter will allow us to immediately flush that document’s XML elements to standard output.

<?php
/*
  *  SphinxXMLFeed - efficiently generate XML for Sphinx's xmlpipe2 data adapter
  *  (c) 2009 Jetpack LLC http://jetpackweb.com
  */
class SphinxXMLFeed extends XMLWriter
{
  private $fields = array();
  private $attributes = array();
 
  public function __construct($options = array())
  {
    $defaults = array(
      'indent' => false,
    );
    $options = array_merge($defaults, $options);
 
    // Store the xml tree in memory
    $this->openMemory();
 
    if($options['indent']) {
      $this->setIndent(true);
    }
  }
 
  public function setFields($fields) {
    $this->fields = $fields;
  }
 
  public function setAttributes($attributes) {
    $this->attributes = $attributes;
  }
 
  public function addDocument($doc) {
    $this->startElement('sphinx:document');
    $this->writeAttribute('id', $doc['id']);
 
    foreach($doc as $key => $value) {
      // Skip the id key since that is an element attribute
      if($key == 'id') continue;
 
      $this->startElement($key);
      $this->text($value);
      $this->endElement();
    }
 
    $this->endElement();
    print $this->outputMemory();
  }
 
  public function beginOutput() {
 
    $this->startDocument('1.0', 'UTF-8');
    $this->startElement('sphinx:docset');
    $this->startElement('sphinx:schema');
 
    // add fields to the schema
    foreach($this->fields as $field) {
      $this->startElement('sphinx:field');
      $this->writeAttribute('name', $field);
      $this->endElement();
    }
 
    // add attributes to the schema
    foreach($this->attributes as $attributes) {
      $this->startElement('sphinx:attr');
      foreach($attributes as $key => $value) {
        $this->writeAttribute($key, $value);
      }
      $this->endElement();
    }
 
    // end sphinx:schema
    $this->endElement();
    print $this->outputMemory();
  }
 
  public function endOutput()
  {
    // end sphinx:docset
    $this->endElement();
    print $this->outputMemory();
  }
}

We can use it as follows:

$doc = new SphinxXMLFeed();
 
$doc->setFields(array(
  'title',
  'teaser',
  'content',
));
 
$doc->setAttributes(array(
  array('name' => 'blog_id', 'type' => 'int', 'bits' => '16', 'default' => '0'),
));
 
$doc->beginOutput();
 
foreach(range(1, 1000) as $id) {
  $doc->addDocument(array(
    'id' => $id,
    'blog_id' => rand(1, 10),
    'title' => "Article Part {$id}",
    'teaser' => "Article {$id} teaster",
    'content' => "Article {$id} content",
  ));
}
 
$doc->endOutput();

As you can see the first thing we need to do is populate the fields and attributes. Once that is done, we call beginOutput, that will create the head of the XML document. After each document is added, the document’s xml markup is immediately outputted and the memory buffer is cleared.

Finally we call endOutput, which will close the sphinx:docset element.

I have used this class in production to index millions of records that take up dozens of gigabytes. Keep in mind if you are working with that much data, you will probably need to bach your queries so you are not loading all the records at once!


Sphinx xmlpipe2 in PHP: Part I

words by Brian Racer

Sphinx is a great open source package for implementing a full text search. Before we can use it to search, we first must inject all of our data into it. There are two primary ways of loading that data in – directly accessing the data via a sql query, or using the xmlpipe2 format. Although using the database as a direct data source is very fast, it can sometimes be difficult to craft a query that will contain normalized data for all the fields you require in an index. The XML option gives us much more flexibility at the cost of speed(although it is still very fast). This article will deal with show you how to generate that XML. It assumed to have a basic understanding of how Sphinx works, if not browse the docs first.

An example xmlpipe2 format looks like this:

<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
 
<sphinx:schema>
  <sphinx:field name="subject"/> 
  <sphinx:field name="content"/>
  <sphinx:attr name="published" type="timestamp"/>
  <sphinx:attr name="author_id" type="int" bits="16" default="1"/>
</sphinx:schema>
 
<sphinx:document id="1234">
  <content>this is the main content <![CDATA[[and this <cdata> entry must be handled properly by xml parser lib]]></content>
  <published>1012325463</published>
  <subject>note how field/attr tags can be in <b class="red">randomized</b> order</subject>
  <misc>some undeclared element</misc>
</sphinx:document>
<!-- ... more documents here ... -->
</sphinx:docset>

First we define the schema, which contains fields and attributes. Fields will be processed for fulltext searches, and attributes will be used to help filter those search results. More information about attributes and their options can be found in the docs. Once the schema is defined, we start adding our document data. A document contains elements that will map to the previously defined fields and attributes.

Lets try and encapsulate some of that logic into a PHP class:

<?php
 
class SphinxXMLFeed
{
  private $fields = array();
  private $attributes = array();
  private $documents = array();
 
  public function setFields($fields) {
    $this->fields = $fields;
  }
 
  public function setAttributes($attributes) {
    $this->attributes = $attributes;
  }
 
  public function addDocument($doc) {
    $this->documents[] = $doc;
  }
 
  public function render() {
 
    // create a new XML document
    $dom = new DomDocument('1.0');
    $dom->encoding = "utf-8";
    $dom->formatOutput = true;
 
    // create root node
    $root = $dom->createElement('sphinx:docset');
    $root = $dom->appendChild($root);
 
    // create the schema
    $schema = $dom->createElement('sphinx:schema');
 
    // common fields we will be cloning
    $tmp_field = $dom->createElement('sphinx:field');
    $tmp_attr  = $dom->createElement('sphinx:attr');
 
    // add fields to the schema
    foreach($this->fields as $field) {
      $new_field = clone($tmp_field);
      $new_field->setAttribute('name', $field);
      $schema->appendChild($new_field);
    }
 
    // add attributes to the schema
    foreach($this->attributes as $attributes) {
      $new_attr = clone($tmp_attr);
      foreach($attributes as $key => $value) {
        $new_attr->setAttribute($key, $value);
        $schema->appendChild($new_attr);
      }
    }
 
    // add the schema to the document
    $root->appendChild($schema);
 
    // go through each document
    foreach($this->documents as $doc) {
      $node = $dom->createElement('sphinx:document');
      $node->setAttribute('id', $doc['id']);
 
      foreach($doc as $key => $value) {
        if($key == 'id') continue;
        $tmp = $dom->createElement($key);
        $tmp->appendChild($dom->createTextNode($value));
 
        $node->appendChild($tmp);
      }
 
      // add the document to the dom
      $root->appendChild($node);
    }
 
    // return xml text
    return $dom->saveXML();
  }
}

The previous code uses PHP’s DomDocument interface because that is less error prone than manually echo’ing out XML tags. One downside of using DomDocument is we must build the entire XML tree before we can output it. This means we must keep each document in memory, so if you are indexing a large amount of data you will probably hit PHP’s memory limit. We will fix this in the next article. For now, you can use this class as follows:

// instantiate the class
$doc = new SphinxXMLFeed();
 
// set the fields we will be indexing
$doc->setFields(array(
  'title',
  'teaser',
  'content',
));
 
// set any attributes
$doc->setAttributes(array(
  array('name' => 'blog_id', 'type' => 'int', 'bits' => '16', 'default' => '0'),
));
 
// generate some random document. These would usually be pulled from a database
// or other data source
foreach(range(1, 3) as $id) {
  $doc->addDocument(array(
    'id' => $id,
    'blog_id' => rand(1, 10),
    'title' => "Article Part {$id}",
    'teaser' => "Article {$id} teaster",
    'content' => "Article {$id} content",
  ));
}
 
// Render the XML
$doc->render();

That code will generate the following XML:

<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
  <sphinx:schema>
    <sphinx:field name="title"/>
    <sphinx:field name="teaser"/>
    <sphinx:field name="content"/>
    <sphinx:attr name="blog_id" type="int" bits="16" default="0"/>
  </sphinx:schema>
  <sphinx:document id="1">
    <blog_id>6</blog_id>
    <title>Article Part 1</title>
    <teaser>Article 1 teaster</teaser>
    <content>Article 1 content</content>
  </sphinx:document>
  ...
</sphinx:docset>

You would setup you datasource in sphinx.conf something like this:

source xml_blog_posts
{
    type = xmlpipe
    xmlpipe_command = /usr/bin/php /home/example.com/lib/tasks/sphinx_blogs.php
}

Don’t forget to checkout the next article where we optimize this class to handle millions of records!

Continue to next article: Sphinx xmlpipe2 in PHP: Part II


Office 2007 docx, pptx, and xslx saving as zip files

words by Brian Racer

Recently I had an issue where various browsers on Windows desktops were saving docx, xslx, and pptx as zip files when downloaded from our linux web servers(apache in this case). The solution was to add extra mime-types to /etc/mime/types:

echo 'application/vnd.openxmlformats       docx pptx xlsx' >> /etc/mime.types

And then restart apache:

sudo /etc/init.d/apache2 reload

This would also require mod_mime to be loaded, and is by default in Debian based systems. To verify the location of the mime.types file your server is using, the following commands may be helpful(replace the httpd.conf or mods conf directory with your distributions location).

grep -n mime.types /etc/apache2/mods-available/*
grep -n mime.types /etc/apache2/httpd.conf

Linux desktops with Open Office had no such problems 🙂


How to make ActionMailer deliver all mail locally

words by Brian Racer

This is similar to my article on forcing PHP to deliver all mail locally, except this will focus on Action Mailer.

When doing local development we generally don’t want our test and development servers sending out mail to the world. And it would be ideal to be able to review the emails our application sends out before deploying the changes to the world. An easy way to achieve this functionality is to create a custom script that rewrites the outgoing message, and then passes that on to the local MTA such as sendmail or exim.

On my Ubuntu development machine I prefer to use exim over sendmail, and mutt to review the emails:

sudo apt-get install exim4 mutt

Next create the following script that will rewrite any messages generated by ActionMailer to the local user of your choice:

vi /usr/local/bin/trapmail
formail -R cc X-original-cc \
-R to X-original-to \
-R bcc X-original-bcc \
-f -A"To: [email protected]" \
| /usr/sbin/sendmail -t -i

Replace [email protected] with your local username or an external email address.

Now setup your development environment configuration file:

config/enviroments/development.rb

ActionMailer::Base.delivery_method = :sendmail
ActionMailer::Base.sendmail_settings = {
  :location       => '/usr/local/bin/trapmail',
}

Now all messages will be redirected to the local user.


Ubuntu Tip: Force new windows to start centered on the desktop

words by Brian Racer

I use a pretty generic Gnome + Compiz desktop setup in Ubuntu, but one thing that really irks me is my applications always seem to start snapped to a corner. What I really want is for them to open centered on my desktop. You can achieve this by doing a little registry modification(I’m pretty sure there is a nice GUI app to adjust these settings, but I don’t believe it is installed by default).

Press Alt+F2 and enter gconf-config. This will open up Gnome’s registry editor.

Set the following two values:
Key: /apps/metacity/general/focus_new_windows Value: smart
Key: /apps/compiz/plugins/place/screen0/options/mode Value: 1

Now your applications should start up nice and centered 🙂


Using a sane user-agent for Ubuntu’s Firefox 3.5 – Shiretoko

words by Brian Racer

Ubuntu’s current release version of Firefox 3.5 is named Shiretoko and sends a user-agent of Shiretoko/3.5 rather than Firefox/3.5. This broke a number of sites I use that rely on browser sniffing such as Facebook Chat and DailyMotion. There are two ways to adjust this behavior:

1) Type about:config in the address bar. Search for ‘general.useragent.extra.firefox’. Double click “Shiretoko/3.5” replace it with “Firefox/3.5”

2) Use the User Agent Switcher plugin. This I prefer this option as it also lets me set IE user agents so I can use a few sites that think they require IE, and also set iPhone header’s for development.