معرفی شرکت ها


ftw.crawler-1.4.0


Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

Crawl sites, extract text and metadata, index it in Solr
ویژگی مقدار
سیستم عامل -
نام فایل ftw.crawler-1.4.0
نام ftw.crawler
نسخه کتابخانه 1.4.0
نگهدارنده []
ایمیل نگهدارنده []
نویسنده 4teamwork AG
ایمیل نویسنده mailto:info@4teamwork.ch
آدرس صفحه اصلی https://github.com/4teamwork/ftw.crawler
آدرس اینترنتی https://pypi.org/project/ftw.crawler/
مجوز GPL2
ftw.crawler =========== Installation ------------ To install ``ftw.crawler``, the easiest way is to create a buildout that contains the configuration, pulls in the egg using ``zc.recipe.egg`` and creates a script in the ``bin/`` directory that directly launches the crawler with the respective configuration as an argument: - First, create a configuration file for the crawler. You can base your configuration on `ftw/crawler/tests/assets/basic_config.py <https://github.com/4teamwork/ftw.crawler/blob/master/ftw/crawler/tests/assets/basic_config.py>`_ by copying it to your buildout and adapting it as needed. Make sure to configure at least the ``tika`` and ``solr`` URLs to point to the correct locations of the respective services, and to adapt the ``sites`` list to your needs. - Create a buildout config that installs ``ftw.crawler`` using ``zc.recipe.egg``: ``crawler.cfg`` .. code:: ini [buildout] parts += crawler crawl-foo-org [crawler] recipe = zc.recipe.egg eggs = ftw.crawler - Further define a buildout section that creates a ``bin/crawl-foo-org`` script, which will call ``bin/crawl foo_org_config.py`` using absolute paths (for easier use from cron jobs): .. code:: ini [crawl-foo-org] recipe = collective.recipe.scriptgen cmd = ${buildout:bin-directory}/crawl arguments = ${buildout:directory}/foo_org_config.py --tika http://localhost:9998/ --solr http://localhost:8983/solr (The ``--tika`` and ``--solr`` command line arguments are optional, they can also be set in the configuration file. If given, the command line arguments take precedence over any parameters in the config file.) - Add a buildout config that downloads and configures a Tika JAXRS server: ``tika-server.cfg`` .. code:: ini [buildout] parts += supervisor tika-server-download tika-server [supervisor] recipe = collective.recipe.supervisor plugins = superlance port = 8091 user = supervisor password = admin programs = 10 tika-server (stopasgroup=true) ${buildout:bin-directory}/tika-server true your_os_user [tika-server-download] recipe = hexagonit.recipe.download url = http://repo1.maven.org/maven2/org/apache/tika/tika-server/1.5/tika-server-1.5.jar md5sum = 0f70548f233ead7c299bf7bc73bfec26 download-only = true filename = tika-server.jar [tika-server] port = 9998 recipe = collective.recipe.scriptgen cmd = java arguments = -jar ${tika-server-download:destination}/${tika-server-download:filename} --port ${:port} Modify ``your_os_user`` and the supervisor and Tika ports as needed. - Finally, add a `bootstrap.py <http://downloads.buildout.org/2/bootstrap.py>`_ and create the ``buildout.cfg`` that pulls all of the above together: ``buildout.cfg`` .. code:: ini [buildout] extensions = mr.developer extends = tika-server.cfg crawler.cfg - Bootstrap and run buildout: .. code:: bash python bootstrap.py bin/buildout Running the crawler ------------------- If you created the ``bin/crawl-foo-org`` script with the buildout described above, that's all you need to run the crawler: - Make sure Tika and Solr are running - Run ``bin/crawl-foo-org`` *(with either a relative or absolute path, working directory doesn't matter, so it can easily be called from a cron job)* Running ``bin/crawl`` directly ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``bin/crawl-foo-org`` is just a tiny wrapper that calls the ``bin/crawl`` script, generated by ``ftw.crawler``'s setuptools ``console_script`` entry point, with the absolute path to the configuration file as the only argument. Any other arguments to the ``bin/crawl-foo-org`` script will be forwarded to ``bin/crawl``. Therefore running ``bin/crawl-foo-org [args]`` is equivalent to ``bin/crawl foo_org_config.py [args]``. Provide known sitemap urls in site configs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you know the sitemap url, you can configure one or many sitemap urls statically: .. code:: python Site('http://example.org/foo/', sitemap_urls=['http://example.org/foo/the_sitemap.xml']) Configure site ID for purging ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In order for the purging to work smoothly it is recommend to configure a crawler site ID. Make sure that each site ID is unique per solr core! Candidate documents for purging will be identified by this crawler site id. .. code:: python Site('http://example.org/', crawler_site_id='example.org-news') Be aware that your solr core must provide a string-field ``crawler_site_id``. Indexing only a particular URL ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you only want to index a particular URL, pass that URL as the first argument to ``bin/crawl-foo-org``. The crawler will then only fetch and index that specific URL. Slack-Notifications ------------------- ``ftw.crawler`` supports Slack-Notifications. Those notifications can be used to monitor the crawler on possible errors while crawling. To enable slack-notifications for your environment, you need to do the following things: - Install ``ftw.crawler`` with the ``slack`` extra. - Set the `SLACK_TOKEN` and the `SLACK_CHANNEL` params in your crawler config or - use the `--slacktoken` and the `--slackchannel` arguments in the command line when calling the `/crawl` script. To generate a valid slack token for your integration, you have to create a new bot in your slack-team. After you generated the new bot slack will automatically generate a valid token for this bot. This token can then be used for your integration. You can also generate a test token to test your integration, but don't forget to create a bot for this if your application goes to production! Development ----------- To start hacking on ``ftw.crawler``, use the ``development.cfg`` buildout: .. code:: bash ln -s development.cfg buildout.cfg python bootstrap.py bin/buildout This will build a Tika JAXRS server and a Solr instance for you. The Solr configuration is set up to be compatible with the testing / example configuration at `ftw/crawler/tests/assets/basic_config.py <https://github.com/4teamwork/ftw.crawler/blob/master/ftw/crawler/tests/assets/basic_config.py>`_. To run the crawler against the example configuration: .. code:: bash bin/tika-server bin/solr-instance fg bin/crawl ftw/crawler/tests/assets/basic_config.py Links ----- - Github: https://github.com/4teamwork/ftw.crawler - Issues: https://github.com/4teamwork/ftw.crawler/issues - Pypi: http://pypi.python.org/pypi/ftw.crawler - Continuous integration: https://jenkins.4teamwork.ch/search?q=ftw.crawler Copyright --------- This package is copyright by `4teamwork <http://www.4teamwork.ch/>`_. ``ftw.crawler`` is licensed under GNU General Public License, version 2. Changelog ========= 1.4.0 (2017-11-08) ------------------ - Add crawler_site_id option for improving purging. [jone] 1.3.0 (2017-11-03) ------------------ - Fix purging problem. Warning: updating "ftw.crawler" to this version breaks your existing crawlers when you set the site url to a sitemap url. Please use the "sitemap_urls" attribute instead. You also need to purge the Solr index manually and reindex. [jone] 1.2.1 (2017-10-30) ------------------ - Encode URL in UTF-8 before generating MD5-Hash. [raphael-s] 1.2.0 (2017-06-22) ------------------ - Support Slack notifications. [raphael-s] 1.1.0 (2016-10-04) ------------------ - Support configuration of absolute sitemap urls. [jone] - Slow down on too many requests. [jone] 1.0 (2015-11-09) ---------------- - Initial implementation. [lgraf]


نحوه نصب


نصب پکیج whl ftw.crawler-1.4.0:

    pip install ftw.crawler-1.4.0.whl


نصب پکیج tar.gz ftw.crawler-1.4.0:

    pip install ftw.crawler-1.4.0.tar.gz