Thursday 26 March 2009

Tika and Solr

This is just a quick note to document another experience with Solr.

Background: To index Word, Excel, PDF and other "unstructured" documents, Solr uses Tika, another Apache project. Tika comes bundled in Solr and is ready to run in Solr. However, if you want to run Tika individually (e.g. you don't trust your installation, or you're just curious) you have to copy a few .jar files around (Java experts who can manage class paths will probably tell me there's a better way to do this).

I did
cd [Your path]/apache-solr-nightly/lib
cp commons-io-1.4.jar commons-codec-1.3.jar [Your path]/apache-solr-nightly/example/solr/lib
cp ~/.m2/repository/org/jempbox/jempbox/0.2.0/jempbox-0.2.0.jar [Your path]/apache-solr-nightly/example/solr/lib
(I have no idea where ~/.m2 came from. It may have been when I ran the Tika build.) Then I could run
java -jar tika-0.2.jar
in that directory.

Sunday 22 March 2009

Solr and Rails

Well, after some long diversions I have Solr working in some simple test cases with Rails. The long diversion was partly caused by not understanding what was offered by the Rails Solr plug-in, so I'm going to give an overview here, and a link to detailed instructions for Solr in Rails at the end of this post.

The Rails plug-in for Solr from git://github.com/mattmatt/acts_as_solr.git includes a complete installation of Solr. You don't need to install Solr separately. (My "long diversion" is that I rushed off and installed Solr separately, and spent a fair bit of time getting it running due to my ignorance of how it worked.)

If you want to index Word, Excel, PDF, and other types of documents, there is a bit of additional configuration to do. To index those files types you have to get a nightly build of Solr from here, and copy some files and directories as described in the link at the end of this post. You have to add the following lines to example/solr/conf/solrconf.xml:
  <requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="ext.map.Last-Modified">last_modified</str>
<bool name="ext.ignore.und.fl">true</bool>
</lst>
</requestHandler>
The plug-in also includes rake tasks to start and stop instances of the Solr server for development, test and production -- very handy. Just type
rake solr:start RAILS_ENV=test 
to start the test Solr server (default environment is development). It also gives you a yaml file in your environment directory to configure the ports that each instance of Solr will use (as installed: production on 8983, test on 8981 and development on 8982).

One thing I learned on my diversion is that Solr comes with an administration user interface that shows how many documents are in the Solr database, and lets you try ad-hoc queries. It's a good way to test if Solr is actually running. For example, after running the rake task to start Solr for development, you can browse to localhost:8982/solr/admin and you should get the Solr administration page.

So that's the overview. The detailed write up is here. It's good. I just wish I had this overview first so I knew what I was getting and where I was going.

Friday 20 March 2009

Rsync iPhone

If you're using the Cydia installer on your iPhone, there's a package for rsync. Just open Cydia and search for "rsync". You won't find a "BSD subsystem" package, because they say Cydia comes with BSD. It might, but it doesn't come with rsync hence the extra installation.

Thursday 19 March 2009

Creating Screencasts on Ubuntu

I'm building a web site using Drupal 6 for my son's school's parent advisory council. The idea of the site is to facilitate community participation. Since we're a volunteer group and we aren't at the same workplace every day, I thought screencasts might be a great way to help people learn how to use the site.

This post covers the technical how-to for screencasts with Ubuntu 8.10 on a Lenovo x300. When we get some feedback about whether the screencasts are helpful, I'll post about the social part of the experience.

It took me an afternoon and a morning of thrashing to get everything working. Here's what I did:
  1. Install gtk-recordmydesktop and gnome-alsamixer:
  2. sudo apt-get install gtk-recordmydesktop gnome-alsamixer
  3. Open Applications-> Sound & Video-> GNOME ALSA Mixer and make sure the microphone is recording and isn't muted. The controls you have available in the ALSA Mixer depend on your sound hardware, so you may have to do some research on your own to find the right settings for your sound card. At this point, you should be able to record sound and video with RecordMyDesktop (Applications-> Sound & Video-> gtk_recordMyDesktop)
  4. If you find that the sound stutters on playing back your screencast, run recordmydesktop from a terminal window
    recordmydesktop
    If you see "Broken pipe: Overrun occurred.", the problem is when you record the screencast, not on playback. I installed the Jack audio server and that fixed it
To install Jack and use Jack,
  1. Install (I think this is right -- I installed using the Synaptic GUI)
    sudo apt-get install jackd libjack-dev
    Note that including the development library (libjack-dev) is very important. If you don't include it, you'll get and error when you start recording with RecordMyDesktop ("dlopen/dlsym error on libjack.so")
  2. Do Applications-> Sound & Video-> JACK Control
  3. Click on Start in the JACK window
  4. Do Applications-> Sound & Video-> gtk_recordMyDesktop
  5. Click on Advanced
  6. Click on the Sound tab and make it look like this
  7. Close the window
  8. Record the screencast (Select Window if you want, then click Record)
  9. When you're done, stop Jack and Quit from the control window before you try to play back the screencast. I found that playback would freeze up if Jack was still running. This is a moderately big nuisance, but I haven't found a way to make it work with Jack open.
recordmydesktop creates a video file in .ogv format, which isn't a commonly installed codec on Windows Media Player or Quicktime (on Windows, at least). However, if you upload the video to YouTube, it will be converted to a format watchable anywhere, as far as I can tell. You do seem to lose some quality of video in the conversion. There are video conversion tools available for Ubuntu, but I haven't tried to convert myself. I don't know if that would help the YouTube resolution anyway, as the more times you convert video the worse it tends to be.

Tuesday 17 March 2009

Drupal Set-Up and Administration Tricks

I can't claim to be the world's Drupal expert yet, but I'm learning some interesting tricks that are worth documenting and sharing.

First, my Drupal installations typically include a lot of contributed modules. They're distributed as .tar.gz files, which you upload to your server and unpack in the appropriate place. This can get tedious and error prone (e.g. I forget to unpack one).

I built a template directory with a complete Drupal install on a Linux box (you could do something similar with Windows if you had cygwin or a similar set of Linux-like tools). I unpacked the Drupal version, then went into the modules directory and unpacked all the modules I typically use. That was easy to do with a script like this:
for f in module_directory/*.gz; do tar -xzf $f; done
Your mileage may vary with that script, depending on where you put the tar'd modules.

This allows a couple of nice things:
  • One module I use, fckeditor, requires that you unpack the basic (non-Drupal) installation for FCKeditor in a subdirectory of the Drupal fckeditor module. With a little playing around, you can easily do this in the template directory once, and then you have the deployment set up correctly
  • I need to set a higher PHP memory limit in Drupal's "settings.php" file. I can do it once in the template directory and deploy many sites reliably (read why here and here)
  • You need to temporarily change the permission of the "settings.php" file for installation, and create the "sites/default/files" folder before navigating to the Drupal site for the first time. I create them both and make them writable in the template directories
Then, tar and gzip the resulting directory tree. There's one big trick to this step. You need to make sure you get the .htaccess file, so I include it explicitly:
tar -czf ../drupal-6.10-template.tar.gz * .htaccess
There's a reason I don't just tar the template directory. That would give me a tar file that would create a directory and expand into it. On my hosting service, it's more convenient for me to expand into the existing "public_html" directory.

The other "trick" isn't so much a trick yet, as it is thoughts on running several Drupal sites from one installation. It seems to be a Drupal pattern to create "multisite" configurations where you have sub-domains running from the same Drupal source code (PHP files), but with different databases, and different configurations and themes.

This sounds good, and you would only have to update the files once for all your sites. However, I think it also causes some problems:
  • The technique for Drupal multisites that I found gives you a circular symbolic link in the sites directory hierarchy. I believe this causes problems when I try to copy the site prior to upgrading it (I copy to have a backup)
  • The recommended Drupal upgrade procedure requires that you take the site offline, disable all modules, and then re-enable all modules after you upgrade. That means all your sites are offline for the entire upgrade, rather than being able to do each site quickly
With the template approach, the pain of uploading the files is gone. The rest of the work you have to do for each site anyway, so it becomes almost the same amount of work to just run each site as a completely independent Drupal installation.

I'm going to try my template based upgrade and will post about the experience then.

Thursday 12 March 2009

NetBeans, Gems, Rails, and Permissions

I've gone from being a shell/make/rcs guy to quite liking IDEs, or at least useful IDEs. I find NetBeans to be a pretty nice, light-weight (in the good sense) IDE, but it has some issues on Ubuntu and other properly secured OSs. Here's how I've got it to work. This applies to NetBeans 6.1 and 6.5, I believe.

First, you have to set up your Ruby platforms so they keep their gems in writable directories. Go to Tools-> Ruby Platforms. On NetBeans 6.5 (at least), the jRuby gems are in a writable, per user path by default. If you click on the "Autodetect Platforms" button and get the native Ruby platform, change the "Gem Home:" and "Gem Path:" directories to somewhere writable, like /home/reid/ruby/gems/1.8.

While you're here, make sure the version of /usr/bin/gem is 1.3.1 or higher. If it isn't, I think you have to upgrade from a shell. I did that upgrade a while ago, so I don't remember how to do it, but you can find out easily through Google. (Ubuntu users may want to look here.)

It should all look like this:

Now go re-install all the gems you need through Tools-> Ruby Gems.

At this point, you still may not be able to install plugins. You'll get the following message: "Missing the Rails 2.2.2 gem" (or whatever version NetBeans installed for you). Rake from within NetBeans seems to look at the system rails executable, and not the one installed through the NetBeans' own gem installer. But the environment.rb generated for a new application does use the version of Rails installed by NetBeans. What I did (yuck, because there's some redundancy here) is manually install the appropriate Rails version:
sudo gem install rails
I'm sure there's a better way, but I can't think of it right now and I really want to write some Rails code instead of fighting with NetBeans.

Cheap Hosting Part II

In an earlier post I described how I was running out of memory in PHP using a moderate set of Drupal contributed modules.

It turns out I was able to increase the PHP memory on my HostPapa hosted site. The method to use it to add the following line to Drupal's "settings.php" file in each site's "files" directory:
ini_set('memory_limit',             '48M');
Or if the line already exists, make sure the amount is '48M'.

One of the problems I think I had earlier is not changing the right settings.php file. I have multiple sites in multiple sub-domains, because the main purpose of this host is to do proofs-of-concept for clients. I probably thought changing the settings.php file didn't work because I changed the wrong one.

Suppose you installed Drupal under public_html. You have to change settings.php in public_html/sites/default to change the memory limit on your main domain (for example, "example.com"). For a sub-domain "test.example.com", you have to change settings.php in public_html/sites/test.example.com. And so on for any other sub-domains you have.

Tuesday 3 March 2009

Securing Healthcare Data with MySQL

As a follow-up to an earlier post, I should mention that part of the reason I had healthcare data on my personal laptop was to do some data analysis with MySQL. Between MySQL and the command line tools, it was very easy for me to load data from other sources and run queries to monitor or predict the amount of medication we were packaging.

When I was done doing the data analysis, I wanted to scrub the data off my hard drive. On the version of MySQL that was installed via Synaptic on Ubuntu 8.10, the default database engine was MyISAM. When a table is dropped, it deletes the MyISAM file. No need to worry about deleted records retaining data in a "tablespace" file that one might have to worry about in other RDBMSs.

Then all I had to do was scrub the hard drive as I described in my earlier post.

Cheap Hosting and PHP Memory

(For an update, read this post.)

I've been working on a few websites: One for my son's school's parent advisory council (PAC) and one for some community health centres. It gives me a chance to get to know what's out there in the open source world for content management, wikis, and other collaboration tools that we're supposed to be using to make ourselves more effective and productive.

The first step was to find a hosting provider that let me run PHP, as the free hosting I get from my ISP is for static pages only. I was pleasantly surprised to find very cheap hosting with PHP, MySQL and everything I needed, many for less than C$10 per month. I chose HostPapa -- despite the somewhat odd name, they seemed to have a good reputation.

There's a catch, it turns out. I'm using Drupal 6, a well-established and widely used content management system. When I build a site with a reasonable set of contributed modules (calendars, translations, FAQs), it needs more than 32MB of memory in PHP (I get a white page with a message saying "memory exhausted"). And I'd like to add a number of other modules.

My hosting provider doesn't let me change the amount of PHP memory, so the whole hosting arrangement is mostly useless. (There are good instructions for how to increase the PHP memory in Drupal here, but not surprisingly the cheap providers don't allow you to take too much memory.)

I also originally considered using Joomla!. Once I ran into the problem with Drupal I did some searching of the Internet and found that it seems to be an issue with Joomla! as well. Neither runs comfortably in 32MB if you have a reasonable selection of add-on modules. (Instructions to increase PHP memory exist with Joomla! with the same caveat that the provider can prevent you from doing it.)

It's rather unfortunate. My idea was to have this space to be able to do proofs of concept or show potential clients what could be done with modern collaboration tools. It would be likely that any serious customer would be able to host their own server, or at least pay for a slightly more expensive hosting provider. I don't need a high volume website, but I do need a space where I can show the capabilities of the tools. I hate to have to pay more money on pure speculation that this will lead to work, but it looks like that's the case.

I don't know whether to be annoyed with the content management systems for being memory hogs, or the hosting providers for setting unreasonably low limits on accounts on which they should fully expect people to use a modern content management system. By the way, I'm not really annoyed at either. I can see their point of view: HostPapa is really cheap and they need to control costs somehow, and serious Drupal users will find the difference of $10 per month for better hosting to be inconsequential, or will be paying for a dedicated server anyway because they get so much traffic.

The obvious lesson: One of the big things to look at when evaluating hosting services is how much PHP memory they give you. It's easy to offer unlimited database space, unlimited sub-domains, and other such goodies when you know that no one can use serious website management tools anyway.