معرفی شرکت ها

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

Enter something here

ویژگی	مقدار
سیستم عامل	OS Independent
نام فایل	SimpleCrawler-1.0.1
نام	SimpleCrawler
نسخه کتابخانه	1.0.1
نگهدارنده	[]
ایمیل نگهدارنده	[]
نویسنده	jackwardell
ایمیل نویسنده	jack@wardell.xyz
آدرس صفحه اصلی	https://github.com/jackwardell
آدرس اینترنتی	https://pypi.org/project/SimpleCrawler/
مجوز	-

# SimpleCrawler * This web crawler can be used to crawl a website from the command line or code # Install * Python required (3.6+) https://www.python.org/downloads/ * `pip install SimpleCrawler` OR * `git clone https://github.com/jackwardell/SimpleCrawler.git` * `cd SimpleCrawler` * `python3 -m venv venv` * `source venv/bin/activate` * `pip install --upgrade pip` * `pip install -r requirements.txt` * `pip install -e .` * `pytest` * `crawl https://www.example.com` # Rules: This crawler will: * Only crawl text/html mime-types * Only crawl pages that return 200 OK HTTP statuses * Look at /robots.txt and obey by default (but can be overridden) * Add User-Agent, default value = PyWebCrawler (but can be changed) * Ignore ?query=strings and #fragments by default (but can be changed) * Get links from ONLY href value in <a href='/some-link'>click here</a> tags Todo: * Nicer logging * Crawl client errors and server error pages? (Most websites have 404 & 500 handlers which may have links) * Parse more than just <a> tags and href attrs e.g. src='/some-link' * Add a scheduler / wait frequency (Politeness) * Request timeout # Use * just type `crawl <url>` into your command line e.g. `crawl https://www.google.com` ``` $ crawl --help Usage: crawl [OPTIONS] URL Options: -u, --user-agent TEXT -w, --max-workers INTEGER -t, --timeout INTEGER -h, --check-head -d, --disobey-robots -wq, --with-query -wf, --with-fragment --debug / --no-debug --help Show this message and exit. ``` * optional params: - "--user-agent" or "-u" - what the User-Agent header param is - default = 'PyWebCrawler' - "--max-workers" or "-w" - max number of worker threads - default = 1 - "--timeout" or "-t" - how long to wait for new items from work queue before shutting down - default = 10 - "--check-head" or "-t" - whether to send HEAD request before sending GET request - why? some GET responses will be large (e.g. pdf) and sending HEAD request first will allow crawler to see MIME type before it makes a GET request, therefore averting the GET request altogether - default = False - "--disobey-robots" or "-d" - whether to disobey robots.txt file - default = False - "--with-query" or "-wq" - whether to allow query args e.g. https://www.example.com/?hello=world -> https://www.example.com/ if not --with-query - default = False - "--with-fragment" or "-wf" - whether to allow fragments e.g. https://www.example.com/#helloworld -> https://www.example.com/ if not --with-fragment - default = False - "--debug/--no-debug", default=False - whether to run the crawl, if debug on, then it wont crawl but will pump out crawler config - default = False OR from code ``` from simple_crawler import Crawler crawler = Crawler() found_links = crawler.crawl('https://www.example.com/') ```

نیازمندی

مقدار	نام
-	requests
-	click

زبان مورد نیاز

مقدار	نام
>=3.6	Python

نحوه نصب

نصب پکیج whl SimpleCrawler-1.0.1:

pip install SimpleCrawler-1.0.1.whl

نصب پکیج tar.gz SimpleCrawler-1.0.1:

pip install SimpleCrawler-1.0.1.tar.gz