To participate you must create an account on apostrophenow.com. If you have already done so, click Login.

Ticket #501 (assigned defect)

Opened 6 weeks ago

Last modified 5 weeks ago

Lucene search needs an optional stop words dictionary to cope with very large sites like FM

Reported by: tboutell Owned by: tboutell
Priority: major Milestone: 1.5
Component: apostrophePlugin Version: 1.4
Keywords: search, zend, scalability Cc: geoffd, dordille
Symfony version: 1.4

Description

Large sites will run out of memory conducting searches for common words due to the high memory usage of Zend Lucene core.

There's a right way to add stop words:

http://framework.zend.com/manual/en/zend.search.lucene.extending.html

You would need to get the current analyzer and add the stop word filter to it just before searching or updating the index.

The stop word dictionary could live right in app.yml or be in a separate file in data/. There could be a folder of them, named by culture code. And it shouldn't be mandatory to use a stop word dictionary.

If you don't have time right now, you could hack this by just stripping stop words out of queries in an override of a/search, however they would still be cluttering up the index, and you'd be stripping out words that have significance in Zend search like 'AND' and 'OR'.

Change History

Changed 5 weeks ago by tboutell

  • owner changed from agilbert to tboutell
  • status changed from new to assigned
  • milestone changed from 1.4.1 to 1.5

The new app_a_search_hard_limit option greatly ameliorates this problem by limiting the number of result pages are considered. The top 1000 (for instance) are considered as possible candidates, then filtered by whether you're allowed to look at them etc. This is not perfect but it is very good in practice (it's rare for almost all of the top 1000 to be locked pages you can't see).

So this is now a hopefully-in-1.5 ticket assigned to me instead of a stat fix for Alex.

Note: See TracTickets for help on using tickets.