Python爬虫实战：用requests库抓取米游社原神COS图片并自动保存到本地-编程实验室

Python爬虫实战：用requests库高效抓取米游社原神COS图片

在当今数据驱动的时代，网络爬虫技术已成为开发者必备的核心技能之一。对于Python初学者来说，掌握如何通过API接口获取数据并实现自动化下载，不仅能提升编程能力，还能为后续的数据分析、机器学习等项目打下坚实基础。本文将带领读者从零开始，使用Python的requests库构建一个完整的图片抓取流程，目标是从米游社获取原神COS图片并自动保存到本地。

1. 环境准备与基础概念

在开始编写爬虫之前，我们需要确保开发环境配置正确，并理解几个关键概念。Python 3.6及以上版本是本项目的基础要求，因为它提供了更稳定的异步支持和语法特性。

首先安装必要的库：

pip install requests pillow

requests：用于发送HTTP请求和获取响应
pillow：Python图像处理库，用于验证下载的图片完整性

理解API接口的工作原理至关重要。现代网站通常通过API接口提供数据，这些接口返回结构化的JSON数据，而不是传统的HTML页面。我们的爬虫将模拟浏览器行为，向这些API发送请求并解析返回的JSON数据。

提示：在开始爬取任何网站前，请务必检查该网站的robots.txt文件和使用条款，确保你的爬虫行为符合网站规定。

2. 分析目标网站与接口

成功爬取数据的关键在于准确找到目标网站的数据来源。对于米游社这样的现代网站，我们可以通过浏览器开发者工具来分析和定位API接口。

打开Chrome浏览器，访问米游社COS相关页面
按F12打开开发者工具，切换到"Network"选项卡
刷新页面并观察网络请求，筛选XHR类型的请求
查找包含图片数据的API响应

通过分析，我们可能会发现类似以下的API端点：

https://api.miyoushe.com/cos/v1/list?game=genshin&page=1&size=20

这个接口通常会返回包含图片信息的JSON数据，结构大致如下：

{ "code": 0, "message": "success", "data": { "list": [ { "id": "12345", "title": "原神甘雨COS", "images": [ "https://upload.miyoushe.com/cos/12345/1.jpg", "https://upload.miyoushe.com/cos/12345/2.jpg" ], "author": { "name": "某COSER" } } ], "total": 100 } }

3. 实现基础爬虫功能

有了API接口信息后，我们可以开始编写爬虫的核心代码。首先创建一个Python文件，如miyoushe_crawler.py，然后导入必要的库：

import os import json import requests from PIL import Image from io import BytesIO

接下来，我们定义几个关键函数：

获取API数据函数：

def fetch_api_data(page=1, size=20): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' } url = f'https://api.miyoushe.com/cos/v1/list?game=genshin&page={page}&size={size}' try: response = requests.get(url, headers=headers) response.raise_for_status() return response.json() except requests.exceptions.RequestException as e: print(f"请求失败: {e}") return None

图片下载函数：

def download_image(url, save_path): try: response = requests.get(url, stream=True) if response.status_code == 200: # 验证图片完整性 image = Image.open(BytesIO(response.content)) image.save(save_path) print(f"图片保存成功: {save_path}") return True except Exception as e: print(f"下载失败 {url}: {e}") return False

主处理函数：

def process_cos_item(item, base_dir="downloads"): title = item['title'].replace('/', '_') # 避免路径问题 author = item['author']['name'].replace('/', '_') save_dir = os.path.join(base_dir, f"{author}_{title}") if not os.path.exists(save_dir): os.makedirs(save_dir) for idx, img_url in enumerate(item['images'], 1): ext = os.path.splitext(img_url)[1] or '.jpg' save_path = os.path.join(save_dir, f"{idx}{ext}") download_image(img_url, save_path)

4. 高级功能与优化

基础功能实现后，我们可以考虑添加一些高级特性和优化措施，提升爬虫的稳定性和效率。

4.1 分页处理与增量爬取

大多数API接口都会对数据进行分页，我们需要处理多页数据：

def crawl_multiple_pages(start_page=1, end_page=5): for page in range(start_page, end_page + 1): print(f"正在处理第 {page} 页...") data = fetch_api_data(page=page) if data and data.get('code') == 0: for item in data['data']['list']: process_cos_item(item) else: print(f"第 {page} 页获取失败或数据为空")

4.2 异常处理与重试机制

网络请求可能会因各种原因失败，良好的异常处理和重试机制必不可少：

def safe_fetch(url, max_retries=3, **kwargs): for attempt in range(max_retries): try: response = requests.get(url, **kwargs) response.raise_for_status() return response except requests.exceptions.RequestException as e: if attempt == max_retries - 1: raise print(f"请求失败 (尝试 {attempt + 1}/{max_retries}): {e}") time.sleep(2 ** attempt) # 指数退避 return None

4.3 并发下载优化

使用多线程可以显著提高图片下载速度：

from concurrent.futures import ThreadPoolExecutor def concurrent_download(urls, save_dir, max_workers=5): with ThreadPoolExecutor(max_workers=max_workers) as executor: futures = [] for idx, url in enumerate(urls, 1): ext = os.path.splitext(url)[1] or '.jpg' save_path = os.path.join(save_dir, f"{idx}{ext}") futures.append(executor.submit(download_image, url, save_path)) for future in futures: try: future.result() except Exception as e: print(f"下载过程中出现错误: {e}")

5. 反爬策略与道德考量

在开发爬虫时，我们必须考虑目标网站的反爬机制，并遵守网络爬虫的道德规范。

5.1 常见反爬措施与应对

User-Agent检测：使用常见的浏览器UA
请求频率限制：添加适当的延迟
IP封禁：使用代理IP池（需符合网站规定）
验证码：考虑使用验证码识别服务或手动处理

实现请求间隔的示例代码：

import random import time def delayed_request(url, min_delay=1, max_delay=3): delay = random.uniform(min_delay, max_delay) time.sleep(delay) return requests.get(url)

5.2 爬虫道德规范

尊重网站的robots.txt规则
控制请求频率，避免对服务器造成过大压力
仅爬取公开可用数据，不尝试绕过认证
不爬取个人隐私或敏感信息
遵守网站的服务条款

重要提示：在实际项目中，建议在代码中添加明显的注释说明数据来源，并考虑在爬取前联系网站管理员获取正式许可。

6. 项目扩展与进阶方向

掌握了基础爬虫实现后，可以考虑以下几个进阶方向：

数据存储优化：
- 使用数据库（如MySQL、MongoDB）存储元数据
- 实现去重功能，避免重复下载相同图片
图像处理扩展：
- 自动生成缩略图
- 添加水印或进行简单的图像增强
分布式爬虫：
- 使用Scrapy框架构建更复杂的爬虫
- 结合Redis实现分布式任务队列
自动化部署：
- 使用Docker容器化爬虫
- 设置定时任务自动运行

# 简单的去重示例 import hashlib def get_file_md5(file_path): with open(file_path, 'rb') as f: return hashlib.md5(f.read()).hexdigest() def is_duplicate(image_data, existing_hashes): current_hash = hashlib.md5(image_data).hexdigest() return current_hash in existing_hashes

在实际开发中，我发现合理设置请求头和间隔时间对爬虫的稳定性至关重要。有些网站对频繁请求非常敏感，而有些则相对宽松。建议在开发过程中先从较长的间隔开始测试，然后根据响应情况逐步调整。