Python爬虫与信息提取 参考资料
Python开发工具的选择
文本类
- IDLE 自带的、默认、代码一般不超过300行
- Notepad++
- Sublime Text 专业的编程体验
- Vim & Emacs
- Atom
- Komodo Edit
集成类(IDE)
-
PyCharm 社区版免费、专业版收费
-
Wing 收费、调试、版本控制、同步
-
PyDev & Eclipse 开源开放、自定义使用相对复杂
-
PTVS & Visual Studio 丰富的调试功能
-
Anaconda & Spyder 开源免费、支持800个库、和Canopy由同一个人领导开发 用于科学计算
-
Canopy 收费、用于科学计算、支持超过500个库
开发环境安装
Mac是自带Python(2.x版本)环境的,如果需要Python 3.x版本需要手动安装
常用库介绍
-
安装:
pip install requests pip3 install requests
调用方式:
import requests
用途:发起请求、获取请求内容
-
安装:
pip install beautifulsoup4 pip3 install beautifulsoup4
调用方式:
from bs4 import beautifulsoup
用途:解析HTML标签
-
Re(正则表达式)库是Python的标准库
用途: 调用方式:
import re
-
安装:
pip3 install scrapy
```conda install -c conda-forge scrapy```
升级:
pip3 install --upgrade scrapy
测试:
scrapy ‐h
使用:
- 初始化一个工程
scrapy startproject [projectnaem]
- 建一个爬虫
scrapy genspider [spidername] [url]
*配置产生的爬虫
- 运行爬虫
scrapy crawl [spidername]
实战
通用代码框架
import requests try: r= requests.get([url],timeout=30) r.raise_for_status() r.encoding = r.apparent_encoding return r=status_code except: return "产生异常"
股票数据Scrapy爬虫
- 编写Spider
# -*- coding: utf-8 -*- import scrapy import re class StocksSpider(scrapy.Spider): name = 'stocks' # allowed_domains = ['baidu.com'] # 东方财富网 start_urls = ['http://quote.eastmoney.com/stocklist.html'] def parse(self, response): self.headers = { 'user-agent':'Mozilla/5.0' } for href in response.css('a::attr(href)').extract(): try: stock = re.findall(r"[s][hz]\d{6}",href)[0] url = 'https://gupiao.baidu.com/stock/'+ stock+'.html' yield scrapy.Request(url,callback=self.parse_stock) except: continue def parse_stock(self,response): infoDict = {} stockInfo = response.css('.stock-bets') name = stockInfo.css('.bets-name').extract()[0] keyList = stockInfo.css('dt').extract() valueList = stockInfo.css('dd').extract() for i in range(len(keyList)): key = re.findall(r'>.*<dt>',keyList[i])[0][1:-5] try: val = re.findall(r'\d+\.?.*</dd>',valueList[i])[0][0:-5] except: val = '--' infoDict[key] = val infoDict.update( {'股票名称':re.findall('\s.*\(',name)[0].split()[0] + \ re.findall('\>.*\<',name)[0][1:-1]}) yield infoDict
- 编写pipeline
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html class PythondemoPipeline(object): def process_item(self, item, spider): return item class BaiduStockPipline(object): def open_spider(self,spider): self.f = open('BaiduStockInfo.txt','w') def close_spider(self,spider): self.f.close() def process_item(self,item,spider): try: line = str(dict(item))+'\n' self.f.write(line) except: pass return item
- 配置pipeline
找到settings.py文件并修改配置
- 结果如下: