Python爬虫之Scrapy环境搭建
首先要安装Python环境,Python环境搭建见:https://blog.csdn.net/alice_tl/article/details/76793590
接下来讲如何搭建Scrapy环境
1、安装Scrapy,在终端使用pip install Scrapy(注意最好是*环境)
进度提示如下:
alicedeMacBook-Pro:~ alice$ pip install Scrapy
Collecting Scrapy
Using cached https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl
Collecting w3lib>=1.17.0 (from Scrapy)
Using cached https://files.pythonhosted.org/packages/37/94/40c93ad0cadac0f8cb729e1668823c71532fd4a7361b141aec535acb68e3/w3lib-1.19.0-py2.py3-none-any.whl
Collecting six>=1.5.2 (from Scrapy)
Using cached https://files.pythonhosted.org/packages/67/4b/141a581104b1f6397bfa78ac9d43d8ad29a7ca43ea90a2d863fe3056e86a/six-1.11.0-py2.py3-none-any.whl
Collecting cssselect>=0.9 (from Scrapy)
Using cached https://files.pythonhosted.org/packages/7b/44/25b7283e50585f0b4156960691d951b05d061abf4a714078393e51929b30/cssselect-1.0.3-py2.py3-none-any.whl
Collecting parsel>=1.1 (from Scrapy)
Using cached https://files.pythonhosted.org/packages/fd/1a/9642a5ea68763d5e7c419df0873073e54bb23d0a8897d3c78e146dd6f355/parsel-1.5.0-py2.py3-none-any.whl
Requirement already satisfied: pyOpenSSL in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from Scrapy) (0.13.1)
Collecting queuelib (from Scrapy)
Using cached https://files.pythonhosted.org/packages/4c/85/ae64e9145f39dd6d14f8af3fa809a270ef3729f3b90b3c0cf5aa242ab0d4/queuelib-1.5.0-py2.py3-none-any.whl
Collecting PyDispatcher>=2.0.5 (from Scrapy)
Using cached https://files.pythonhosted.org/packages/cd/37/39aca520918ce1935bea9c356bcbb7ed7e52ad4e31bff9b943dfc8e7115b/PyDispatcher-2.0.5.tar.gz
Collecting service-identity (from Scrapy)
Using cached https://files.pythonhosted.org/packages/29/fa/995e364220979e577e7ca232440961db0bf996b6edaf586a7d1bd14d81f1/service_identity-17.0.0-py2.py3-none-any.whl
Collecting Twisted>=13.1.0 (from Scrapy)
Using cached https://files.pythonhosted.org/packages/90/50/4c315ce5d119f67189d1819629cae7908ca0b0a6c572980df5cc6942bc22/Twisted-18.7.0.tar.bz2
Complete output from command python setup.py egg_info:
Download error on https://pypi.python.org/simple/incremental/: [Errno 60] Operation timed out -- Some packages may not be found!
Couldn't find index page for 'incremental' (maybe misspelled?)
No local packages or download links found for incremental>=16.10.1
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-U_6VZF/Twisted/setup.py", line 20, in <module>
setuptools.setup(**_setup["getSetupArgs"]())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/core.py", line 111, in setup
_setup_distribution = dist = klass(attrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/dist.py", line 268, in __init__
self.fetch_build_eggs(attrs['setup_requires'])
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/dist.py", line 313, in fetch_build_eggs
replace_conflicting=True,
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources/__init__.py", line 843, in resolve
dist = best[req.key] = env.best_match(req, ws, installer)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources/__init__.py", line 1088, in best_match
return self.obtain(req, installer)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources/__init__.py", line 1100, in obtain
return installer(requirement)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/dist.py", line 380, in fetch_build_egg
return cmd.easy_install(req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/command/easy_install.py", line 632, in easy_install
raise DistutilsError(msg)
distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('incremental>=16.10.1')
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-U_6VZF/Twisted/
出现缺少Twisted的错误提示:
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-U_6VZF/Twisted/
2、安装Twiseted,终端里输入:sudo pip install twisted==13.1.0
alicedeMacBook-Pro:~ alice$ pip install twisted==13.1.0
Collecting twisted==13.1.0
Downloading https://files.pythonhosted.org/packages/10/38/0d1988d53f140ec99d37ac28c04f341060c2f2d00b0a901bf199ca6ad984/Twisted-13.1.0.tar.bz2 (2.7MB)
100% |████████████████████████████████| 2.7MB 398kB/s
Requirement already satisfied: zope.interface>=3.6.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from twisted==13.1.0) (4.1.1)
Requirement already satisfied: setuptools in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from zope.interface>=3.6.0->twisted==13.1.0) (18.5)
Installing collected packages: twisted
Running setup.py install for twisted ... error
Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-inJwZ2/twisted/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-record-OmuVWF/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build/lib.macosx-10.13-intel-2.7
creating build/lib.macosx-10.13-intel-2.7/twisted
copying twisted/copyright.py -> build/lib.macosx-10.13-intel-2.7/twisted
copying twisted/_version.py -> build/li
3、再次使用sudo pip install scrapy安装,发现仍然出现错误提示,这次是没有安装lxml的错误提示:
Could not find a version that satisfies the requirement lxml (from Scrapy) (from versions: )
No matching distribution found for lxml (from Scrapy)
alicedeMacBook-Pro:~ alice$ sudo pip install Scrapy
The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting Scrapy
Downloading https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl (249kB)
100% |████████████████████████████████| 256kB 210kB/s
Collecting w3lib>=1.17.0 (from Scrapy)
Downloading https://files.pythonhosted.org/packages/37/94/40c93ad0cadac0f8cb729e1668823c71532fd4a7361b141aec535acb68e3/w3lib-1.19.0-py2.py3-none-any.whl
Requirement already satisfied: six>=1.5.2 in /Library/Python/2.7/site-packages (from Scrapy) (1.11.0)
Collecting cssselect>=0.9 (from Scrapy)
Downloading https://files.pythonhosted.org/packages/7b/44/25b7283e50585f0b4156960691d951b05d061abf4a714078393e51929b30/cssselect-1.0.3-py2.py3-none-any.whl
Collecting parsel>=1.1 (from Scrapy)
Downloading https://files.pythonhosted.org/packages/fd/1a/9642a5ea68763d5e7c419df0873073e54bb23d0a8897d3c78e146dd6f355/parsel-1.5.0-py2.py3-none-any.whl
Requirement already satisfied: pyOpenSSL in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from Scrapy) (0.13.1)
Collecting queuelib (from Scrapy)
Downloading https://files.pythonhosted.org/packages/4c/85/ae64e9145f39dd6d14f8af3fa809a270ef3729f3b90b3c0cf5aa242ab0d4/queuelib-1.5.0-py2.py3-none-any.whl
Collecting PyDispatcher>=2.0.5 (from Scrapy)
Downloading https://files.pythonhosted.org/packages/cd/37/39aca520918ce1935bea9c356bcbb7ed7e52ad4e31bff9b943dfc8e7115b/PyDispatcher-2.0.5.tar.gz
Collecting service-identity (from Scrapy)
Downloading https://files.pythonhosted.org/packages/29/fa/995e364220979e577e7ca232440961db0bf996b6edaf586a7d1bd14d81f1/service_identity-17.0.0-py2.py3-none-any.whl
Collecting Twisted>=13.1.0 (from Scrapy)
Downloading https://files.pythonhosted.org/packages/90/50/4c315ce5d119f67189d1819629cae7908ca0b0a6c572980df5cc6942bc22/Twisted-18.7.0.tar.bz2 (3.1MB)
100% |████████████████████████████████| 3.1MB 59kB/s
Collecting lxml (from Scrapy)
Could not find a version that satisfies the requirement lxml (from Scrapy) (from versions: )
No matching distribution found for lxml (from Scrapy)
4、安装lxml,使用:sudo pip install lxml
alicedeMacBook-Pro:~ alice$ sudo pip install lxml
The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting lxml
Downloading https://files.pythonhosted.org/packages/a1/2c/6b324d1447640eb1dd240e366610f092da98270c057aeb78aa596cda4dab/lxml-4.2.4-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (8.7MB)
100% |████████████████████████████████| 8.7MB 187kB/s
Installing collected packages: lxml
Successfully installed lxml-4.2.4
5、再次安装scrapy,使用sudo pip install scrapy,安装成功
alicedeMacBook-Pro:~ alice$ sudo pip install Scrapy
The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting Scrapy
Downloading https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl (249kB)
100% |████████████████████████████████| 256kB 11.5MB/s
Collecting w3lib>=1.17.0 (from Scrapy)
Downloading https://files.pythonhosted.org/packages/37/94/40c93ad0cadac0f8cb729e1668823c71532fd4a7361b141aec535acb68e3/w3lib-1.19.0-py2.py3-none-any.whl
Requirement already satisfied: six>=1.5.2 in /Library/Python/2.7/site-packages (from Scrapy) (1.11.0)
Collecting cssselect>=0.9 (from Scrapy)
Downloading https://files.pythonhosted.org/packages/7b/44/25b7283e50585f0b4156960691d951b05d061abf4a714078393e51929b30/cssselect-1.0.3-py2.py3-none-any.whl
Collecting parsel>=1.1 (from Scrapy)
Downloading https://files.pythonhosted.org/packages/fd/1a/9642a5ea68763d5e7c419df0873073e54bb23d0a8897d3c78e146dd6f355/parsel-1.5.0-py2.py3-none-any.whl
Requirement already satisfied: pyOpenSSL in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from Scrapy) (0.13.1)
Collecting queuelib (from Scrapy)
Downloading https://files.pythonhosted.org/packages/4c/85/ae64e9145f39dd6d14f8af3fa809a270ef3729f3b90b3c0cf5aa242ab0d4/queuelib-1.5.0-py2.py3-none-any.whl
Collecting PyDispatcher>=2.0.5 (from Scrapy)
Downloading https://files.pythonhosted.org/packages/cd/37/39aca520918ce1935bea9c356bcbb7ed7e52ad4e31bff9b943dfc8e7115b/PyDispatcher-2.0.5.tar.gz
Collecting service-identity (from Scrapy)
Downloading https://files.pythonhosted.org/packages/29/fa/995e364220979e577e7ca232440961db0bf996b6edaf586a7d1bd14d81f1/service_identity-17.0.0-py2.py3-none-any.whl
Collecting Twisted>=13.1.0 (from Scrapy)
Downloading https://files.pythonhosted.org/packages/90/50/4c315ce5d119f67189d1819629cae7908ca0b0a6c572980df5cc6942bc22/Twisted-18.7.0.tar.bz2 (3.1MB)
100% |████████████████████████████████| 3.1MB 41kB/s
Requirement already satisfied: lxml in /Library/Python/2.7/site-packages (from Scrapy) (4.2.4)
Collecting functools32; python_version < "3.0" (from parsel>=1.1->Scrapy)
Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/functools32/
Downloading https://files.pythonhosted.org/packages/c5/60/6ac26ad05857c601308d8fb9e87fa36d0ebf889423f47c3502ef034365db/functools32-3.2.3-2.tar.gz
Collecting attrs (from service-identity->Scrapy)
Downloading https://files.pythonhosted.org/packages/41/59/cedf87e91ed541be7957c501a92102f9cc6363c623a7666d69d51c78ac5b/attrs-18.1.0-py2.py3-none-any.whl
Requirement already satisfied: pyasn1 in /Library/Python/2.7/site-packages (from service-identity->Scrapy) (0.4.4)
Collecting pyasn1-modules (from service-identity->Scrapy)
Downloading https://files.pythonhosted.org/packages/19/02/fa63f7ba30a0d7b925ca29d034510fc1ffde53264b71b4155022ddf3ab5d/pyasn1_modules-0.2.2-py2.py3-none-any.whl (62kB)
100% |████████████████████████████████| 71kB 35kB/s
Collecting zope.interface>=4.4.2 (from Twisted>=13.1.0->Scrapy)
Downloading https://files.pythonhosted.org/packages/ac/8a/657532df378c2cd2a1fe6b12be3b4097521570769d4852ec02c24bd3594e/zope.interface-4.5.0.tar.gz (151kB)
100% |████████████████████████████████| 153kB 50kB/s
Collecting constantly>=15.1 (from Twisted>=13.1.0->Scrapy)
Downloading https://files.pythonhosted.org/packages/b9/65/48c1909d0c0aeae6c10213340ce682db01b48ea900a7d9fce7a7910ff318/constantly-15.1.0-py2.py3-none-any.whl
Collecting incremental>=16.10.1 (from Twisted>=13.1.0->Scrapy)
Downloading https://files.pythonhosted.org/packages/f5/1d/c98a587dc06e107115cf4a58b49de20b19222c83d75335a192052af4c4b7/incremental-17.5.0-py2.py3-none-any.whl
Collecting Automat>=0.3.0 (from Twisted>=13.1.0->Scrapy)
Downloading https://files.pythonhosted.org/packages/a3/86/14c16bb98a5a3542ed8fed5d74fb064a902de3bdd98d6584b34553353c45/Automat-0.7.0-py2.py3-none-any.whl
Collecting hyperlink>=17.1.1 (from Twisted>=13.1.0->Scrapy)
Downloading https://files.pythonhosted.org/packages/a7/b6/84d0c863ff81e8e7de87cff3bd8fd8f1054c227ce09af1b679a8b17a9274/hyperlink-18.0.0-py2.py3-none-any.whl
Collecting PyHamcrest>=1.9.0 (from Twisted>=13.1.0->Scrapy)
Downloading https://files.pythonhosted.org/packages/9a/d5/d37fd731b7d0e91afcc84577edeccf4638b4f9b82f5ffe2f8b62e2ddc609/PyHamcrest-1.9.0-py2.py3-none-any.whl (52kB)
100% |████████████████████████████████| 61kB 73kB/s
Requirement already satisfied: setuptools in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from zope.interface>=4.4.2->Twisted>=13.1.0->Scrapy) (18.5)
Collecting idna>=2.5 (from hyperlink>=17.1.1->Twisted>=13.1.0->Scrapy)
Downloading https://files.pythonhosted.org/packages/4b/2a/0276479a4b3caeb8a8c1af2f8e4355746a97fab05a372e4a2c6a6b876165/idna-2.7-py2.py3-none-any.whl (58kB)
100% |████████████████████████████████| 61kB 66kB/s
Installing collected packages: w3lib, cssselect, functools32, parsel, queuelib, PyDispatcher, attrs, pyasn1-modules, service-identity, zope.interface, constantly, incremental, Automat, idna, hyperlink, PyHamcrest, Twisted, Scrapy
Running setup.py install for functools32 ... done
Running setup.py install for PyDispatcher ... done
Found existing installation: zope.interface 4.1.1
Uninstalling zope.interface-4.1.1:
Successfully uninstalled zope.interface-4.1.1
Running setup.py install for zope.interface ... done
Running setup.py install for Twisted ... done
Successfully installed Automat-0.7.0 PyDispatcher-2.0.5 PyHamcrest-1.9.0 Scrapy-1.5.1 Twisted-18.7.0 attrs-18.1.0 constantly-15.1.0 cssselect-1.0.3 functools32-3.2.3.post2 hyperlink-18.0.0 idna-2.7 incremental-17.5.0 parsel-1.5.0 pyasn1-modules-0.2.2 queuelib-1.5.0 service-identity-17.0.0 w3lib-1.19.0 zope.interface-4.5.0
6、检查scrapy是否安装成功,输入scrapy --version
出现scrapy的版本信息,比如:Scrapy 1.5.1 - no active project即可。
alicedeMacBook-Pro:~ alice$ scrapy --version
Scrapy 1.5.1 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
PS:如果中途没有能够正常访问org网和使用sudo管理员权限安装,则会出现类似的错误提示
Exception:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pip/_internal/basecommand.py", line 141, in main
status = self.run(options, args)
File "/Library/Python/2.7/site-packages/pip/_internal/commands/install.py", line 299, in run
resolver.resolve(requirement_set)
Exception:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pip/_internal/basecommand.py", line 141, in main
status = self.run(options, args)
File "/Library/Python/2.7/site-packages/pip/_internal/commands/install.py", line 299, in run
resolver.resolve(requirement_set)
File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 102, in resolve
self._resolve_one(requirement_set, req)
File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 256, in _resolve_one
abstract_dist = self._get_abstract_dist_for(req_to_install)
File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 209, in _get_abstract_dist_for
self.require_hashes
File "/Library/Python/2.7/site-packages/pip/_internal/operations/prepare.py", line 283, in prepare_linked_requirement
progress_bar=self.progress_bar
File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 836, in unpack_url
progress_bar=progress_bar
File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 673, in unpack_http_url
progress_bar)
File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 897, in _download_http_url
_download_url(resp, link, content_file, hashes, progress_bar)
File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 617, in _download_url
hashes.check_against_chunks(downloaded_chunks)
File "/Library/Python/2.7/site-packages/pip/_internal/utils/hashes.py", line 48, in check_against_chunks
for chunk in chunks:
File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 585, in written_chunks
for chunk in chunks:
File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 574, in resp_read
decode_content=False):
File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 465, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 430, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 35, in __exit__
self.gen.throw(type, value, traceback)
File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 345, in _error_catcher
raise ReadTimeoutError(self._pool, None, 'Read timed out.')
ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.
上一篇: throw和throws的区别