基于Scrapy爬虫框架获取GitHub某用户全部仓库的信息

本文内容来自实验楼Scrapy 爬虫框架基础实践及挑战课程：https://www.shiyanlou.com/courses/1417
源码：https://github.com/shiyanlou/louplus-dm/tree/v2/Answers/week1-challenge-05

使用方法：

下载项目源码
打开spiders目录下的github_repositories_autonext文件，修改start_urls函数中的GitHub仓库所有者名字，如把https://github.com/shiyanlou?tab=repositories修改为https://github.com/huanyouchen?tab=repositories
在项目目录下，执行scrapy crawl github_repositories_autonext命令
查看下载完成的csv文件内容

Scrapy的基本结构

Scrapy结构图如下：

Scrapy结构

Scrapy 的组件包括：

Scrapy Engine：处理系统数据流和事务的引擎。
Scheduler 和 Scheduler Middlewares：调度引擎发过来的请求。
Downloader 和 Downloader Middlewares：下载网页内容的下载器。
Spider ：爬虫系统，处理域名解析规则及网页解析。

Scrapy 的基本用法包括下面几个步骤：

初始化 Scrapy 项目。
实现 Item，用来存储提取信息的容器类。
实现 Spider，用来爬取数据的爬虫类。
从 HTML 页面中提取数据到 Item。
实现 Item Pipeline 来保存 Item 数据。

爬虫目标

指定用户ID: shiyanlou

目标页面：https://github.com/shiyanlou?tab=repositories

获取指定GitHub 用户的所有仓库名称，以及仓库更新时间，将爬到的数据保存为csv文件。

初始化 Scrapy 项目。

# 创建爬虫演示目录
mkdir scrapy-demo
cd scrapy-demo

# 初始化项目
scrapy startproject get_github_repositories
cd get_github_repositories/
scrapy genspider github_repositories github.com

其中，爬虫项目名称是get_github_repositories，爬虫名称是github_repositories。

实现 Item，用来存储提取信息的容器类。

由于爬虫目标信息是要获取每个仓库的名称和更新时间，因此在Item中写入：

class GetGithubRepositoriesItem(scrapy.Item):
    # define the fields for your item here like:
    repo_name = scrapy.Field()
    update_time = scrapy.Field()

实现 Spider，用来爬取数据的爬虫类。

分析目标页面的结构，找出需要的仓库名字和时间，

github 个人仓库页面

可以看出，每个仓库名字和更新时间，都是在ul->li列表下，其中，名字在属性为itemprop='name codeRepository'的a标签下，更新时间在relative-time中，分别使用xpath解析可以得到其内容。

def parse(self, response):
    repos = response.xpath('//li[@itemprop="owns"]')
    for repo in repos:
         item = GetGithubRepositoriesItem()
         item['repo_name'] = repo.xpath(".//a[@itemprop='name codeRepository']/text()").extract_first().strip()
         item['update_time'] = repo.xpath(".//relative-time/@datetime").extract_first()

         yield item

得到第一页内容后，需要往后翻页，得到后面几页所有的仓库信息。点击下一页，观察URL发现，URL内容为：

1	https://github.com/shiyanlou?after=Y3Vyc29yOnYyOpK5MjAxNC0xMC0xM1QxMToxNTo0NCswODowMM4Bf5tW&tab=repositories

下一页的地址通过after参数控制。关键是获取after参数的内容，然后加入到url里，就可以得到下一页的内容。

先介绍第一种方法，手动复制所有的页面的after参数，放在列表中，然后遍历即可

@property
def start_urls(self):
    url_temp =  'https://github.com/shiyanlou?after={}&tab=repositories'
    after = [
        "",   # 第一页没有after参数
        "Y3Vyc29yOnYyOpK5MjAxNy0wNi0wN1QwODozMjo1OCswODowMM4FkpUU",
        "Y3Vyc29yOnYyOpK5MjAxNS0wMi0xMFQxMzowODo0NyswODowMM4B0o4T",
        "Y3Vyc29yOnYyOpK5MjAxNC0xMi0wN1QyMjoxMTozNSswODowMM4Bpo1E",
        "Y3Vyc29yOnYyOpK5MjAxNC0xMC0xM1QxMToxNTo0NCswODowMM4Bf5tW"
    ]

    return (url_temp.format(i) for i in after)

实现 Item Pipeline 来保存 Item 数据

把爬到的数据通过Pandas保存为csv文件，需要在爬虫启动和关闭时，分别设置相应的操作

import pandas as pd

class GetGithubRepositoriesPipeline(object):
    def process_item(self, item, spider):
        # 读取 item 数
        repo_name = item['repo_name']
        update_time = item['update_time']
        # 每条数据组成临时 df_temp
        df_temp = pd.DataFrame([[repo_name, update_time]], columns=['repo_name', 'update_time'])
        # 将 df_temp 合并到 df
        self.df = self.df.append(df_temp, ignore_index=True).sort_values(by=['update_time'], ascending=False)

        return item
    
    #当爬虫启动时
    def open_spider(self, spider):
        # 新建一个带列名的空白 df
        self.df = pd.DataFrame(columns=['repo_name', 'update_time'])

    # 当爬虫关闭时
    def close_spider(self, spider):
        # 将 df 存储为 csv 文件
        pd.DataFrame.to_csv(self.df, "../shiyanlou_repo.csv")

最后，在setting设置中，把下面内容的注释取消

1
2
3

ITEM_PIPELINES = {
    'get_github_repositories.pipelines.GetGithubRepositoriesPipeline': 300,
}

并把ROBOTSTXT_OBEY改为False

1 2	# Obey robots.txt rules ROBOTSTXT_OBEY = False

执行爬虫程序

在爬虫项目目录，执行下面命令：

1	scrapy crawl github_repositories

第二种方法，自动获取after参数

在项目目录，新建另一个爬虫，实现自动获取after参数

1	scrapy genspider github_repositories_autonext github.com

通过Chrome的开发者工具，查看next按钮对应的HTML代码，

当有下一页时


<div class="BtnGroup" data-test-selector="pagination">
 <button class="btn btn-outline BtnGroup-item" disabled="disabled">Previous</button>
 <a rel="nofollow" class="btn btn-outline BtnGroup-item" href="https://github.com/shiyanlou?after=Y3Vyc29yOnYyOpK5MjAxNy0wNi0wN1QwODozMjo1OCswODowMM4FkpUU&amp;tab=repositories">Next</a>
</div>


<div class="BtnGroup" data-test-selector="pagination">
 <a rel="nofollow" class="btn btn-outline BtnGroup-item" href="https://github.com/shiyanlou?before=Y3Vyc29yOnYyOpK5MjAxNy0wNi0wN1QwODozMDowOSswODowMM4FkpM6&amp;tab=repositories">Previous</a>
 <a rel="nofollow" class="btn btn-outline BtnGroup-item" href="https://github.com/shiyanlou?after=Y3Vyc29yOnYyOpK5MjAxNS0wMi0xMFQxMzowODo0NyswODowMM4B0o4T&amp;tab=repositories">Next</a>
</div>

当没有下一页时：

<div class="BtnGroup" data-test-selector="pagination">
   <a rel="nofollow" class="btn btn-outline BtnGroup-item" href="https://github.com/shiyanlou?before=Y3Vyc29yOnYyOpK5MjAxNC0xMC0xMVQwNDowMDoyMiswODowMM4Bgd5Q&amp;tab=repositories">Previous</a>
   <button class="btn btn-outline BtnGroup-item" disabled="disabled">Next</button>
  </div>

可以看出：

如果 Next 按钮没有被禁用，那么表示有下一页，下一页的after参数在a标签的href属性中
如果 Next 按钮被禁用，那么表示没有下一页，下一页的button的disabled属性

# -*- coding: utf-8 -*-
import scrapy
from shiyanlou.items import ShiyanlouItem


class GithubSpider(scrapy.Spider):
    name = 'github_next_page'
    allowed_domains = ['github.com']

    @property
    def start_urls(self):
        return ('https://github.com/shiyanlou?tab=repositories', )

    def parse(self, response):
        repos = response.xpath('//li[@itemprop="owns"]')
        for repo in repos:
            item = ShiyanlouItem()
            item['repo_name'] = repo.xpath(".//a[@itemprop='name codeRepository']/text()").extract_first().strip()
            item['update_time'] = repo.xpath(".//relative-time/@datetime").extract_first()

            yield item

        # 如果 Next 按钮没被禁用，那么表示有下一页
        spans = response.css('div.pagination span.disabled::text')
        if len(spans) == 0 or spans[-1].extract() != 'Next':
            next_url = response.css('div.paginate-container a:last-child::attr(href)').extract_first()
            yield response.follow(next_url, callback=self.parse)

在爬虫项目目录，执行下面命令：

1	scrapy crawl github_repositories_autonext