Brian Beck just announced that he’s beginning work on Merquery, a full-text indexer and search engine specifically designed for developers using RAD frameworks like Django.
I’m so excited about this I can barely contain myself. Right now Ellington ships with a search engine built on top of Swish-E. It’s pretty cool, and I’ve been debating cleaning it up and rolling it into django.contrib. However it has a number of major flaws that limit its usefulness:
- Swish-E’s Python bindings don’t have any way to return results ordered by date (vital when you’re searching for news stories), so we use a patched version of the bindings. This makes installation super annoying.
- Swish-E doesn’t have any sort of incremental indexing, so we have to re-index the entire contents of the database every time. We’ve got nearly half a million stories in the database, so indexing takes over two hours, meaning we really only can do it once a day. Thus, breaking news stories aren’t in the index. Argh.
- The search query syntax is braindead and buggy. By braindead I mean there’s only “A or B” type searches — no phrase searches or cool operators. As for buggy… let’s just say there are certain groups of characters you can search for which will throw the search engine into an infinate loop that consumes nearly 100% CPU. We filter them out at the view level, but sheesh.
So, yeah, I’m super-exited about a modern, pure-Python search engine and indexer.
Brian, if you’re “listening”, I’d be thrilled to help you out with this project in any way I can.
Comments:
This makes me thinking, what about tsearch2 for PostegreSQL. In my last site i've used MySQL fulltext index with good results and i was planning using tsearch2 for my next 3 projects in Django.
So with Swish-E and all the headaches you prefer that instead of tsearch2... is any limitation using it?
Anyway Merquery seems a promising solution, if it meets brian's goals... tsearch2 looks very complicated, at least for me :p
Yeah, tsearch2 is quite cool. I've used it -- and MySQL FULLTEXT -- a number of times. However, the coupling to a specific database engine makes them unsuitable for Django at large.
The reason that Ellington's search engine doesn't use tsearch2 actually has nothing to do with that, though -- Swish-E lets us be pretty specific about what fields and what objects get indexed, and is also AMAZINGLY fast.
The idea of a super-simple drop-in indexer for dynamic web apps makes me drool... :)
"The idea of a super-simple drop-in indexer for dynamic web apps makes me drool... :)"
I understand you very well! heh
What's the deal with 'Lucene'. I was planning to use Tsearch2 for my django web app. I can't imagine 'Merquery' would fit my current application, but can be extremely useful for smaller apps.
By the way, any words on Django 9.2's date?
Full-text indexing is complicated for a reason. Don't oversimplifying it by ignoring the international audience:
* unicode - lots of characters are equivalent
* some character systems don't even have word separators, like Chinese, Japanese, and Korean (you can get away with bigram tokenizers, but it's really just a hack)
* stop words are different in different languages
* how do you detect the language anyway?
The world has suffered long enough from general i18n ignorance. I'd say take a solid system (like Lucene) and focus on making it easy to integrate with. Don't reinvent the wheel http://stabell.org/2006/03/...
Mike: Lucene is a stand-alone indexer/search engine (unlike tsearch2 which is a PostgreSQL addon).
Bjorn: good points... However, the "build versus buy" decision (or in open source "build versus adapt") is a lot more complicated than "don't reinvent the wheel." If this particular wheel happens to be octagonal because the original builder though it was "close enough", it might be easier to start from scratch then to break out the chisels, if you get my drift.
Fact is that Lucene might handle i18n very well, and it might handle stop words excellently, but if I can't figure out how to use the damn thing that does me no good.
Jacob: Although more fun to build your own, wouldn't it be more valuable to the FOSS world in general to focus on making Lucene easier to install, then? Assuming the goal is to have a strong open source full-text search engine, not 10 with different weaknesses.
... or 10 with different strengths.
So you're saying I should instead concentrate on making Linux better instead of using a Mac, right?
Or that I should switch my database to MySQL and concentrate on making that better instead of using PostgreSQL?
Or maybe that I should stop using a computer altogether and instead concentrate on refining the typewriter?
OK, seriously for a moment: the awesome, incredible, amazing, wonderful thing about being a programmer is that I get to choose what tools I want to use! Why, oh why should I compromise on quality when I don't have to?
Also, I don't think my goal is to have "a strong open source full-text search engine"; it's to have one that works the way I want it to work. I highly doubt the Lucene folks want me barging in and telling them how to "fix" their product -- it obviously works very well for them the way it is.
The point is that different people have different needs. The myth is that search (or languages, or web frameworks, or...) is simple and thus we don't need many competetive products. We do, because what I say "search" I could mean something considerably different than what you mean when you say it.
i was trying to find a blosxom plugin. what is this? i'm confused.
Lupy (http://divmod.org/trac/wiki...) was a pure Python implementation of Lucene, so it read and wrote Lucene binary files.
Getting that up and running again would be brilliant - you could then upgrade from Lupy to PyLucene whilst retaining the same indexes.
I don't know if you know about these?...
http://hyperestraier.source...
http://hype.python-hosting....
Matt here at Pollenation has used it/them (he wrote the bindings along with donovan preston I beleive) and said some nice things things afterwards.. :-)
I wrote something about text indexing with Django here: http://spacedman.blogspot.com/
Not sure if my ideas are along the right lines or way off beam...
I remember finding one of those infinate loop bugs... Remember that Jacob...
Today I was searching a full text search in the python arena and I discovered this proposition:
http://www.cps-project.org/...
Does someone has already tried it with django?
what is this site about?
Leave a comment: