ftw.crawler
===========
Installation
------------
To install ``ftw.crawler``, the easiest way is to create a buildout that
contains the configuration, pulls in the egg using ``zc.recipe.egg`` and
creates a script in the ``bin/`` directory that directly launches the crawler
with the respective configuration as an argument:
- First, create a configuration file for the crawler. You can base your
configuration on `ftw/crawler/tests/assets/basic_config.py <https://github.com/4teamwork/ftw.crawler/blob/master/ftw/crawler/tests/assets/basic_config.py>`_ by copying
it to your buildout and adapting it as needed.
Make sure to configure at least the ``tika`` and ``solr`` URLs to point to
the correct locations of the respective services, and to adapt the ``sites``
list to your needs.
- Create a buildout config that installs ``ftw.crawler`` using ``zc.recipe.egg``:
``crawler.cfg``
.. code:: ini
[buildout]
parts +=
crawler
crawl-foo-org
[crawler]
recipe = zc.recipe.egg
eggs = ftw.crawler
- Further define a buildout section that creates a ``bin/crawl-foo-org``
script, which will call ``bin/crawl foo_org_config.py`` using absolute paths
(for easier use from cron jobs):
.. code:: ini
[crawl-foo-org]
recipe = collective.recipe.scriptgen
cmd = ${buildout:bin-directory}/crawl
arguments =
${buildout:directory}/foo_org_config.py
--tika http://localhost:9998/
--solr http://localhost:8983/solr
(The ``--tika`` and ``--solr`` command line arguments are optional, they
can also be set in the configuration file. If given, the command line
arguments take precedence over any parameters in the config file.)
- Add a buildout config that downloads and configures a Tika JAXRS server:
``tika-server.cfg``
.. code:: ini
[buildout]
parts +=
supervisor
tika-server-download
tika-server
[supervisor]
recipe = collective.recipe.supervisor
plugins =
superlance
port = 8091
user = supervisor
password = admin
programs =
10 tika-server (stopasgroup=true) ${buildout:bin-directory}/tika-server true your_os_user
[tika-server-download]
recipe = hexagonit.recipe.download
url = http://repo1.maven.org/maven2/org/apache/tika/tika-server/1.5/tika-server-1.5.jar
md5sum = 0f70548f233ead7c299bf7bc73bfec26
download-only = true
filename = tika-server.jar
[tika-server]
port = 9998
recipe = collective.recipe.scriptgen
cmd = java
arguments = -jar ${tika-server-download:destination}/${tika-server-download:filename} --port ${:port}
Modify ``your_os_user`` and the supervisor and Tika ports as needed.
- Finally, add a `bootstrap.py <http://downloads.buildout.org/2/bootstrap.py>`_
and create the ``buildout.cfg`` that pulls all of the above together:
``buildout.cfg``
.. code:: ini
[buildout]
extensions = mr.developer
extends =
tika-server.cfg
crawler.cfg
- Bootstrap and run buildout:
.. code:: bash
python bootstrap.py
bin/buildout
Running the crawler
-------------------
If you created the ``bin/crawl-foo-org`` script with the buildout described
above, that's all you need to run the crawler:
- Make sure Tika and Solr are running
- Run ``bin/crawl-foo-org`` *(with either a relative or absolute path, working
directory doesn't matter, so it can easily be called from a cron job)*
Running ``bin/crawl`` directly
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``bin/crawl-foo-org`` is just a tiny wrapper that calls the ``bin/crawl``
script, generated by ``ftw.crawler``'s setuptools ``console_script``
entry point, with the absolute path to the configuration file as the only
argument. Any other arguments to the ``bin/crawl-foo-org`` script will be
forwarded to ``bin/crawl``.
Therefore running ``bin/crawl-foo-org [args]`` is equivalent to
``bin/crawl foo_org_config.py [args]``.
Provide known sitemap urls in site configs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you know the sitemap url, you can configure one or many sitemap urls
statically:
.. code:: python
Site('http://example.org/foo/',
sitemap_urls=['http://example.org/foo/the_sitemap.xml'])
Configure site ID for purging
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In order for the purging to work smoothly it is recommend to configure a
crawler site ID.
Make sure that each site ID is unique per solr core!
Candidate documents for purging will be identified by this crawler site id.
.. code:: python
Site('http://example.org/',
crawler_site_id='example.org-news')
Be aware that your solr core must provide a string-field ``crawler_site_id``.
Indexing only a particular URL
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you only want to index a particular URL, pass that URL as the first
argument to ``bin/crawl-foo-org``. The crawler will then only fetch and index
that specific URL.
Slack-Notifications
-------------------
``ftw.crawler`` supports Slack-Notifications. Those notifications can be used
to monitor the crawler on possible errors while crawling.
To enable slack-notifications for your environment, you need to do the following things:
- Install ``ftw.crawler`` with the ``slack`` extra.
- Set the `SLACK_TOKEN` and the `SLACK_CHANNEL` params in your crawler config or
- use the `--slacktoken` and the `--slackchannel` arguments in the command line when
calling the `/crawl` script.
To generate a valid slack token for your integration, you have to create a new bot in
your slack-team. After you generated the new bot slack will automatically generate a
valid token for this bot. This token can then be used for your integration.
You can also generate a test token to test your integration, but don't forget to create
a bot for this if your application goes to production!
Development
-----------
To start hacking on ``ftw.crawler``, use the ``development.cfg`` buildout:
.. code:: bash
ln -s development.cfg buildout.cfg
python bootstrap.py
bin/buildout
This will build a Tika JAXRS server and a Solr instance for you. The Solr
configuration is set up to be compatible with the testing / example
configuration at `ftw/crawler/tests/assets/basic_config.py <https://github.com/4teamwork/ftw.crawler/blob/master/ftw/crawler/tests/assets/basic_config.py>`_.
To run the crawler against the example configuration:
.. code:: bash
bin/tika-server
bin/solr-instance fg
bin/crawl ftw/crawler/tests/assets/basic_config.py
Links
-----
- Github: https://github.com/4teamwork/ftw.crawler
- Issues: https://github.com/4teamwork/ftw.crawler/issues
- Pypi: http://pypi.python.org/pypi/ftw.crawler
- Continuous integration: https://jenkins.4teamwork.ch/search?q=ftw.crawler
Copyright
---------
This package is copyright by `4teamwork <http://www.4teamwork.ch/>`_.
``ftw.crawler`` is licensed under GNU General Public License, version 2.
Changelog
=========
1.4.0 (2017-11-08)
------------------
- Add crawler_site_id option for improving purging. [jone]
1.3.0 (2017-11-03)
------------------
- Fix purging problem.
Warning: updating "ftw.crawler" to this version breaks your existing crawlers
when you set the site url to a sitemap url. Please use the "sitemap_urls"
attribute instead. You also need to purge the Solr index manually and reindex.
[jone]
1.2.1 (2017-10-30)
------------------
- Encode URL in UTF-8 before generating MD5-Hash.
[raphael-s]
1.2.0 (2017-06-22)
------------------
- Support Slack notifications.
[raphael-s]
1.1.0 (2016-10-04)
------------------
- Support configuration of absolute sitemap urls. [jone]
- Slow down on too many requests. [jone]
1.0 (2015-11-09)
----------------
- Initial implementation.
[lgraf]