Zend_Search_Lucene: Not enterprise-ready
Posted on November 7th, 2008 in PHP, Zend Framework |
Zend Framework has been attracting more and more attention from the PHP community lately, and while it lacks certain things (like code generation) that other frameworks (like Rails) have implemented to great effect, Zend Framework 2.0 is slowly taking shape and it looks like it will be the framework of choice for startups and enterprises alike. (Yes, it will even have code generation.)
But despite having several “enterprise-ready” components, I’ve found that one in particular is not: Zend_Search_Lucene, Zend Framework’s native PHP implementation of Apache Lucene, written in Java.
Don’t get me wrong; Zend_Search_Lucene is great for a small site or blog. However, from extensive personal experience, it is not appropriate for a site with a medium or large index. I think this should be noted upfront in the documentation.
Against my better judgment, the company I work for migrated our previous search solution to Zend_Search_Lucene. On pretty heavy-duty hardware, indexing a million documents took several hours, and searches were relatively slow. The indexing process consumed vast amounts of memory, and the indexes frequently became corrupted (using 1.5.2). A single wild card search literally brought the web server to its knees, so we disabled that feature. Memory usage was very high for searches, and as a result requests per second necessarily declined heavily as we had to reduce the number of Apache child processes.
We have since moved to Solr (a Lucene-based Java search server) and the difference is dramatic. Indexing now takes around 10 minutes and searches are lightning fast. What a difference a language makes.
4 Responses
We made the same observation. While PHP is parsing the source char per char, it takes up to 3 minutes for a 2mb ascii file - or 1000 documents. Way too slow.
[...] of the project as I do—the MVC implementation, the Lucene-derived search engine (update: not so much), the caching component. They want flashy stuff. They want Ajax. And they don’t want to work [...]
Why did I not read this 6 month ago :-(. We see the exact same things. For small indexes (up to 100k objects) things work fine. For larger indexes we get corrupted indexes once every 48 hours.
Even when everything is working it is far from fast.
We have indexed dates in the index and when doing a Zend_Search_Lucene_Search_Query_Range the system goes into total hibernation. For very small ranges it works, like searching in between one or at most two days.
[...] I initially thought Solr would be the ideal candidate. However, I’m still not entirely sure yet how well it can cope with large indexes. Perhaps I can optimise the index by only indexing some content and do full hydration of nodes using MySQL based on the Solr result set (a Solr/MySQL hybrid). Additionally, I don’t really like the idea of having dependancies on non-standard services (Solr is a Java application requiring Tomcat, or similar). That’s when I found out that Zend had the same opinion. They went and built there own Lucene port, entirely in PHP. Zend_Search_Lucene isn’t as good as Solr - it doesn’t provide the amount of features I’d like and it’s not scalable. [...]