WHS GIS Engineer

Python爬虫从入门到放弃

2017-06-15
WHS

Python爬虫与信息提取 参考资料

Python开发工具的选择

文本类

  • IDLE 自带的、默认、代码一般不超过300行
  • Notepad++
  • Sublime Text 专业的编程体验
  • Vim & Emacs
  • Atom
  • Komodo Edit

集成类(IDE)

  • PyCharm 社区版免费、专业版收费

  • Wing 收费、调试、版本控制、同步

  • PyDev & Eclipse 开源开放、自定义使用相对复杂

  • PTVS & Visual Studio 丰富的调试功能

  • Anaconda & Spyder 开源免费、支持800个库、和Canopy由同一个人领导开发 用于科学计算

  • Canopy 收费、用于科学计算、支持超过500个库

开发环境安装

Mac是自带Python(2.x版本)环境的,如果需要Python 3.x版本需要手动安装

常用库介绍

  • Requests

    安装:

    pip  install requests
    pip3 install requests
    

    调用方式:import requests

    用途:发起请求、获取请求内容

  • BeautifulSoup

    安装:

    pip  install beautifulsoup4
    pip3 install beautifulsoup4
    

    调用方式:from bs4 import beautifulsoup

    用途:解析HTML标签

  • Re(正则表达式)库是Python的标准库

    用途: 调用方式:import re

  • Scrapy爬虫框架

    安装:pip3 install scrapy

     ```conda install -c conda-forge scrapy```
    

    升级:pip3 install --upgrade scrapy

    测试:scrapy ‐h

    使用:

    • 初始化一个工程

    scrapy startproject [projectnaem]

    • 建一个爬虫

    scrapy genspider [spidername] [url]

    *配置产生的爬虫

    • 运行爬虫

    scrapy crawl [spidername]

实战

通用代码框架

   import requests
   try:
       r= requests.get([url],timeout=30)
       r.raise_for_status()
       r.encoding = r.apparent_encoding
       return r=status_code
   except:
        return "产生异常"

股票数据Scrapy爬虫

  • 编写Spider
    # -*- coding: utf-8 -*-
  import scrapy
  import re


  class StocksSpider(scrapy.Spider):
    name = 'stocks'
    # allowed_domains = ['baidu.com']
    # 东方财富网
    start_urls = ['http://quote.eastmoney.com/stocklist.html']

    def parse(self, response):
         self.headers = {
          'user-agent':'Mozilla/5.0'
         }
         for href in response.css('a::attr(href)').extract():
            try:
                stock = re.findall(r"[s][hz]\d{6}",href)[0]
                url = 'https://gupiao.baidu.com/stock/'+ stock+'.html'
                yield scrapy.Request(url,callback=self.parse_stock)
            except:
                continue
        
    def parse_stock(self,response):
        infoDict = {}
        stockInfo = response.css('.stock-bets')
        name = stockInfo.css('.bets-name').extract()[0]
        keyList = stockInfo.css('dt').extract()
        valueList = stockInfo.css('dd').extract()
        for i in range(len(keyList)):
            key = re.findall(r'>.*<dt>',keyList[i])[0][1:-5]
            try:
                 val = re.findall(r'\d+\.?.*</dd>',valueList[i])[0][0:-5]
            except:
                 val = '--'
            infoDict[key] = val

        infoDict.update(
            {'股票名称':re.findall('\s.*\(',name)[0].split()[0] + \
             re.findall('\>.*\<',name)[0][1:-1]})            
        yield infoDict
  
  • 编写pipeline
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class PythondemoPipeline(object):
    def process_item(self, item, spider):
        return item

class BaiduStockPipline(object):
    def open_spider(self,spider):
        self.f = open('BaiduStockInfo.txt','w')
        
    def close_spider(self,spider):
        self.f.close()

    def process_item(self,item,spider):
        try:
            line = str(dict(item))+'\n'
            self.f.write(line)
        except:
            pass
        return item
  • 配置pipeline

找到settings.py文件并修改配置

  • 结果如下:

相关资料


Comments