معرفی شرکت ها


doc_crawler-1.2


Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

Explore a website recursively and download all the wanted documents (PDF, ODT…)
ویژگی مقدار
سیستم عامل -
نام فایل doc_crawler-1.2
نام doc_crawler
نسخه کتابخانه 1.2
نگهدارنده []
ایمیل نگهدارنده []
نویسنده Simon Descarpentries
ایمیل نویسنده contact@acoeuro.com
آدرس صفحه اصلی https://github.com/Siltaar/doc_crawler.py
آدرس اینترنتی https://pypi.org/project/doc_crawler/
مجوز -
doc_crawler - explore a website recursively and download all the wanted documents (PDF, ODT…). == Synopsis doc_crawler.py [--accept=jpe?g$] [--download] [--single-page] [--verbose] http://… doc_crawler.py [--wait=3] [--no-random-wait] --download-files url.lst doc_crawler.py [--wait=0] --download-file http://… or python3 -m doc_crawler […] http://… == Description _doc_crawler_ can explore a website recursively from a given URL and retrieve, in the descendant pages, the encountered document files (by default: PDF, ODT, DOC, XLS, ZIP…) based on regular expression matching (typically against their extension). Documents can be listed on the standard output or downloaded (with the _--download_ argument). To address real life situations, activities can be logged (with _--verbose_). + Also, the search can be limited to one page (with the _--single-page_ argument). Documents can be downloaded from a given list of URL, that you may have previously produced using default options of _doc_crawler_ and an output redirection such as: + `./doc_crawler.py http://… > url.lst` Documents can also be downloaded one by one if necessary (to finish the work), using the _--download-file_ argument, which makes _doc_crawler_ a tool sufficient by itself to assist you at every steps. By default, the program waits a randomly-pick amount of seconds, between 1 and 5, before each download to avoid being rude toward the webserver it interacts with (and so avoid being black-listed). This behavior can be disabled (with a _--no-random-wait_ and/or a _--wait=0_ argument). _doc_crawler.py_ works great with Tor : `torsocks doc_crawler.py http://…` == Options *--accept*=_jpe?g$_:: Optional regular expression (case insensitive) to keep matching document names. Example : _--accept=jpe?g$_ will keep all : .JPG, .JPEG, .jpg, .jpeg *--download*:: Directly downloads found documents if set, output their URL if not. *--single-page*:: Limits the search for documents to download to the given URL. *--verbose*:: Creates a log file to keep trace of what was done. *--wait*=x:: Change the default waiting time before each download (page or document). Example : _--wait=3_ will wait between 1 and 3s before each download. Default is 5. *--no-random-wait*:: Stops the random pick up of waiting times. _--wait=_ or default is used. *--download-files* url.lst:: Downloads each documents which URL are listed in the given file. Example : _--download-files url.lst_ *--download-file* http://…:: Directly save in the current folder the URL-pointed document. == Tests Around 30 _doctests_ are included in _doc_crawler.py_. You can run them with the following command in the cloned repository root: + `python3 -m doctest doc_crawler.py` Tests can also be launched one by one using the _--test=XXX_ argument: + `python3 -m doc_crawler --test=download_file` Tests are successfully passed if nothing is output. == Requirements - requests - yaml One can install them under Debian using the following command : `apt install python3-requests python3-yaml` == Author Simon Descarpentries - https://s.d12s.fr == Ressources Github repository : https://github.com/Siltaar/doc_crawler.py + Pypi repository : https://pypi.python.org/pypi/doc_crawler == Support To support this project, you may consider (even a symbolic) donation via : https://liberapay.com/Siltaar == Licence GNU General Public License v3.0. See LICENCE file for more information.


نحوه نصب


نصب پکیج whl doc_crawler-1.2:

    pip install doc_crawler-1.2.whl


نصب پکیج tar.gz doc_crawler-1.2:

    pip install doc_crawler-1.2.tar.gz