Introduction
============
While the catalog tool in Zope is immensely useful, we have seen some slowdowns
in large Plone sites with a combination of additional indexes and lots of
content.
The catalog implementation is using BTree set operations like union, multiunion
and intersection. Those operations are fairly fast, especially when everything
is in memory. However, the catalog implementation is rather naive which leads
to lots of set operations on rather big sets.
Query plan
==========
Search engines and databases uses query optimizers to select query plans that
will minimize the result set as early as possible, because working with large
amounts of data is time consuming.
What we want to do is to search against the indexes giving the smallest result
set first. However, for that to be useful, we need to pass that result along
into the indexes to allow the indexes to limit the result set as soon as
possible internally. When calculating a path search, there is no need to look
in all 150000 results if the portal type index has already limited the possible
result to 10000. If we have already limited the result to 10000 results, all
set operations are going to be significantly faster.
We identify different searches by the list of indexes that are searched. If
there are no query plans for a set of indexes, the query is run like normal
while storing the number of results for each index. When all indexes have been
checked, the list is sorted on number of results and stored as a query
plan. Next time a search on the same indexes comes in, the query plan is
looked up.
To get different query plans for similar queries, you can provide additional
bogus index names. They will be ignored by the catalog, but will become part of
the key. For indexes that have only a small number of distinct values the
query value will become part of the key as well. These type of indexes often
have an uneven distribution of indexed keys to values. For example there might
be very few `pending` documents in a site, but many `published` ones.
Testing
=======
To test, import the monkey patch in other tests, like CMFPlone::
import experimental.catalogqueryplan
and run the test.
Development
===========
Development of this project takes place at:
https://github.com/Jarn/experimental.catalogqueryplan
Changelog
=========
3.2.8 - 2013-01-07
------------------
- Port ExtendedPathIndex fix from ExtendedPathIndex 3.0.1.
[mike.rhodes, hannosch]
3.2.7 - 2011-08-23
------------------
- Backport c122666 from ZCatalog, fixing batching restriction in early part
of second half of the batch.
[davisagli, hannosch]
3.2.6 - 2011-08-21
------------------
- Backport from e.btree: Update to Cython 0.15.
[hannosch]
- Backport from e.btree: Correct small/big assignment if only the first
argument is a tree set.
[hannosch]
- Backport from e.btree: Avoid intersection optimizations if both arguments
are non-tree sets.
[hannosch]
3.2.5 - 2011-05-27
------------------
- Backport c50071 from Products.CMFPlone to fix batch handling.
[hannosch]
- Backport c121708 from ZCatalog, fixing the addition of two LazyCat's if any
of them had already been flattened.
[hannosch]
3.2.4 - 2011-04-27
------------------
- Fix possible TypeError in `sortResults` method if only b_start but not b_size
has been provided.
[hannosch]
3.2.3 - 2011-04-10
------------------
- Specify supported Python versions.
[hannosch]
3.2.2 - 2011-04-09
------------------
- Backport c121349 from ZCatalog, optimizing the date range index to add a
floor and ceiling date. In this version it's hardcoded values.
[hannosch]
- Backport c121191 from ZCatalog, which fixes an edge-case in the date range
index optimization.
[hannosch]
3.2.1 - 2011-03-16
------------------
- Specify minimum requirement for the 3.2.x series of Plone >= 4.0.3.
[hannosch]
3.2.0 - 2011-03-08
------------------
* Patched `PloneBatch.__getitem__` to work with new limited batch results.
[hannosch]
* Backported sort/batch improvements from ZCatalog 2.13.4.
[hannosch]
* Update to Cython 0.14.1.
[hannosch]
* Avoid using the relative import syntax introduced in Python 2.5.
[hannosch]
3.1.0 - 2010-12-27
------------------
* Added automatic sorting limit calculation based on batch arguments. If the
query contains a `b_start` and `b_size` argument and no explicit `sort_limit`
is provided, the sort limit will be calculated as `b_start + b_size`.
[hannosch]
* Backported `Products.ZCatalog.Lazy` improvements from its 2.13.2 release.
* Update to Cython 0.14.
[hannosch]
3.0.2 - 2010-09-28
------------------
* Subtract inverse set in daterangeindex as that is usually significantly
smaller than the set of content within range (like effectiveRange).
[tesdal]
3.0.1 - 2010-09-24
------------------
* Update to Cython 0.13.
[hannosch]
* Make sure to patch the intersection function in our own catalog module.
Depending on import time order effects we could end up with the standard
intersection function which performs horribly for the reference catalog.
[hannosch]
3.0 - 2010-05-13
----------------
* Fixed tests to work with latest Zope 2.12 release.
[hannosch]
3.0a3 - 2010-03-08
------------------
* Extended the stored queryplan format to optionally contain the value indexes
set. This also makes it possible to manually influence the set.
[hannosch]
* Changed intersection algorithm to avoid calculating the length of tree sets,
as this would cause a scan of all their buckets. Intersections of sets with
large tree sets are 10x faster while intersections of small sets and small
tree sets are 50% slower.
[hannosch, tesdal]
* Expanded performance tests to check against small tree sets.
[hannosch]
3.0a2 - 2010-02-21
------------------
* Marked this package as a Plone plugin.
[hannosch]
3.0a1 - 2010-02-21
------------------
* Updated to Cython 0.12.1.
[hannosch]
* Reinstated request cache that was removed in 1.6. Catalog.getCounter() is
part of the key.
[tesdal]
* Use optimized intersection instead of unoptimized weightedIntersection
if possible when joining result sets.
[tesdal]
* Merged work from the querytree-cython branch. We have completely optional
C optimizations based on Cython now.
[hannosch]
* Moved the performance tests to the normal tests package and made them
available on test level 2.
[hannosch]
* Moved tests into a sub-package.
[hannosch]
2.1 - 2009-11-19
----------------
* Moved patching into an initialize method and import ZopeTestCase in tests.
This avoids import errors in Zope 2.12.
[hannosch]
2.0 - 2009-11-10
----------------
* Add browser view named ``catalogqueryplan-prioritymap`` for convenient
dumping of the current query plan as a Python module & support for loading
it again using the `CATALOGQUERYPLAN` environment variable.
[witsch]
1.9 - 2009-11-06
----------------
* Yet more fixes for the recent optimization. The ZCatalog API is too flexible.
[hannosch]
1.8 - 2009-11-06
----------------
* Fixed an optimization introduced in the 1.7 release. We also need to look
into the form arguments of a request to look for query restrictions.
[hannosch]
1.7 - 2009-10-17
----------------
* When processing queries, do not ask indexes for their restrictions which
aren't actually part of the query.
[hannosch]
* Added a `DEFAULT_PRIORITYMAP` hook into the catalog module. This allows to
provide a default priority map dictionary to initialize the prioritymap with
a precomputed or manually designed one.
[hannosch]
1.6 - 2009-09-10
----------------
* Removed the per request caching for the date range index. This can lead to
invalid results for subsequent queries during the same request, if the data
is changed.
[hannosch]
1.5 - 2009-07-27
----------------
* Make sure to always include all indexes found in the original query into the
queryplan. Otherwise if we get a query at first which happens to have no
restrictions on a particular index, this index will no longer be queried at
all. Now we at least preserve the index as part of the queryplan. The real
solution is to continuously update the queryplan with result lengths as done
in unimr.catalogqueryplan.
[hannosch]
1.4 - 2009-05-20
----------------
* Added detailed per index time logging to the slow query reporting. You get
the time spent per index in addition to the total now.
[hannosch]
* Speed up the common KeywordIndexes like `portal_type` and
`allowedRolesAndUsers`. Provided with a small passed in result set and an
or-query we intersect it with each set in the index and union them later.
Doing a straight multiunion on these types of indexes most often creates a
set close to the total catalog size.
[hannosch]
* Added optional logging of slow queries, inspired by unimr.catalogqueryplan.
[hannosch]
* Added a mechanism to indicate the types of indexes, whose query values should
be taken into account when building the prioritymap. An opt-in mechanism via
VALUETYPES is provided much like ADVANCEDTYPES. Queries for example for
`review_state` differ depending on the query for `pending` and `published`
items, as those are usually very unevenly distributed.
[hannosch]
1.3 - 2009-03-15
----------------
* Changed the log messages to debug level.
[hannosch]
1.2 - 2009-03-03
----------------
* Don't use request.request as part of the update as it tends to
trigger the browser id (_ZopeId)
[tesdal]
* Make sure UnIndex always returns IISet, not int.
[tesdal]
* Only sort intersection sets if there are more than 2 sets,
otherwise the order is irrelevant
[tesdal]
* Added logging for patches
[swampmonkey]
1.1 - 2009-01-02
----------------
* Made the set monkeypatches temporary to avoid zc.relationship
trying to persist the set methods.
[tesdal]
1.0 - 2009-01-02
----------------
* Removed redundant intersections, added type checking to difference
[tesdal]
* Add alternative weightedIntersection, and reuse BTree tests
[tesdal]
* Don't monkeypatch intersection as zc.relationship will try
to pickle the function.
Added new ExtendedPathIndex code.
[tesdal]
* Optimize UnIndex.apply_index internally, sort sets for AND,
use multiunion for OR.
[tesdal]
* Limit the number of if-statements in intersection,
and added test for fastest way of finding max and min.
[tesdal]
* Monkeypatch difference to handle big/tiny difference in Python
This doesn't belong in queryplan, as it's only a BTree patch,
and should be refactored out.
[tesdal]
* Added performance tests.
[tesdal]
* Fixed a bug with UnIndex return result missing index id
[tesdal]
* Added tests for intersection, fixed a bug with empty second argument set
[tesdal]
* Monkeypatch intersect to handle big/tiny intersects in Python
[tesdal]
* Improved UnIndex query, to avoid redundant intersections
[tesdal]
* Clarified LanguageIndex support. We are missing fallback support right now
and now disable the optimization when fallback is enabled.
[hannosch, mj]
0.9 - 2008-10-18
----------------
* Added support for LinguaPlone's LanguageIndex.
[hannosch]
0.8 - 2008-09-03
----------------
* Let each index patch register itself with the ADVANCEDTYPES list.
This should enable patching of other indexes as well, and remove
the dependency on ExtendedPathIndex.
[tesdal]
0.7 - 2008-08-22
----------------
* Check whether we're supposed to use daterangeindex
at all before retrieving cached data.
[tesdal]
0.6 - 2008-07-03
----------------
* Use a volatile instance variable to store the prioritymap.
[mj]
0.5 - 2008/06/23
----------------
* DateRangeIndex shouldn't overwrite the semi-request passed into the
apply_index method.
[mj]
0.4 - 2008/06/23
----------------
* DateRangeIndex now doesn't assume that REQUEST is available.
[tesdal]
0.3
---
* Handle request being a dictionary.
[tesdal]
0.3
---
* Refactored patches into multiple files.
[tesdal]
* Dynamic query optimization based on result set analysis
from queries against the same indexes.
[tesdal]
* Manual query optimization based on typical usage pattern.
[tesdal]
0.1
---
* Initial release