从零开始用Python进行小红书数据采集的5个实用方法-编程实验室

从零开始用Python进行小红书数据采集的5个实用方法

【免费下载链接】xhs基于小红书 Web 端进行的请求封装。https://reajason.github.io/xhs/项目地址: https://gitcode.com/gh_mirrors/xh/xhs

你是否遇到过想要分析小红书平台热门笔记却不知如何获取数据？面对复杂的API接口和频繁变化的参数感到无从下手？想通过数据分析发现小红书内容趋势却被技术门槛阻挡？本文将带你掌握5个实用的小红书数据采集方法，帮助你从数据小白变身采集达人。我们将使用Python语言，结合小红书API的特性，从零开始构建完整的数据采集流程，让你轻松获取并分析小红书平台的有价值数据。

基础信息采集：搭建小红书API请求框架

你是否尝试过调用API却总是收到403错误？或者面对返回的JSON数据不知如何提取关键信息？基础信息采集是所有数据分析的第一步，让我们从搭建稳定的API请求框架开始。

新手友好度：★★★★☆

# 导入必要的库 import requests from xhs import XhsClient # 初始化客户端 client = XhsClient( cookie="你的cookie信息", user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36" ) # 验证连接 try: # 获取推荐笔记列表 notes = client.get_recommended_notes() print(f"成功获取推荐笔记列表，共{len(notes)}条笔记") print(f"第一条笔记标题: {notes[0]['title']}") except Exception as e: print(f"连接失败: {str(e)}")

代码解析

这段代码创建了一个基础的小红书API客户端，主要完成了以下功能：

使用XhsClient类初始化客户端，需要提供有效的cookie和User-Agent
调用get_recommended_notes()方法获取推荐笔记列表
异常处理确保程序不会因API调用失败而崩溃

工具推荐

Postman：API调试工具，可用于测试小红书API请求和查看响应结构
EditThisCookie：浏览器扩展，方便导出小红书网站的cookie信息

用户数据采集：构建完整用户画像

想知道热门笔记的作者有什么共同特征？如何分析用户的粉丝增长趋势？用户数据采集将帮助你深入了解小红书平台的用户行为和特征。

新手友好度：★★★☆☆

import time from xhs import XhsClient def get_user_profile(client, user_id): """获取用户基本信息""" try: profile = client.get_user_profile(user_id) return { "user_id": user_id, "nickname": profile["nickname"], "avatar": profile["avatar"], "follower_count": profile["follower_count"], "following_count": profile["following_count"], "note_count": profile["note_count"], "signature": profile["signature"], "level": profile.get("level", 0) } except Exception as e: print(f"获取用户信息失败: {str(e)}") return None def get_user_notes(client, user_id, max_count=20): """获取用户发布的笔记""" notes = [] cursor = "" while len(notes) < max_count: try: response = client.get_user_notes(user_id, cursor=cursor) if not response["items"]: break notes.extend(response["items"]) cursor = response.get("cursor", "") if not cursor: break # 控制请求频率，避免被限制 time.sleep(1) except Exception as e: print(f"获取用户笔记失败: {str(e)}") break return notes[:max_count] # 使用示例 client = XhsClient(cookie="你的cookie信息") user_id = "用户ID" user_profile = get_user_profile(client, user_id) user_notes = get_user_notes(client, user_id, max_count=10) if user_profile: print(f"用户名: {user_profile['nickname']}") print(f"粉丝数: {user_profile['follower_count']}") print(f"发布笔记数: {len(user_notes)}")

工具推荐

MongoDB：文档型数据库，适合存储结构灵活的用户数据
PyCharm Professional：带有数据库工具和API调试功能的IDE

笔记内容采集：完整获取笔记详情

如何批量获取小红书笔记的完整内容？包括文字、图片、视频和相关统计数据？本方法将带你实现从笔记列表到详情内容的完整采集。

新手友好度：★★★★☆

import json import time from xhs import XhsClient def save_note_to_json(note, file_path): """将笔记数据保存为JSON文件""" with open(file_path, 'w', encoding='utf-8') as f: json.dump(note, f, ensure_ascii=False, indent=2) def get_note_details(client, note_ids): """批量获取笔记详情""" note_details = [] for note_id in note_ids: try: note = client.get_note_by_id(note_id) note_details.append({ "note_id": note["note_id"], "title": note["title"], "content": note["desc"], "author_id": note["user"]["user_id"], "author_name": note["user"]["nickname"], "like_count": note["stats"]["like_count"], "comment_count": note["stats"]["comment_count"], "collect_count": note["stats"]["collect_count"], "share_count": note["stats"]["share_count"], "create_time": note["create_time"], "tags": [tag["name"] for tag in note.get("tags", [])], "image_urls": [img["url"] for img in note.get("images_list", [])], "video_url": note.get("video", {}).get("url", "") }) # 保存单条笔记数据 save_note_to_json(note_details[-1], f"notes/{note_id}.json") print(f"成功获取笔记: {note['title']}") # 控制请求频率 time.sleep(1.5) except Exception as e: print(f"获取笔记详情失败 (note_id: {note_id}): {str(e)}") continue return note_details # 使用示例 client = XhsClient(cookie="你的cookie信息") # 先获取推荐笔记ID列表 recommended_notes = client.get_recommended_notes() note_ids = [note["note_id"] for note in recommended_notes[:5]] # 获取笔记详情 note_details = get_note_details(client, note_ids) print(f"成功获取{len(note_details)}条笔记详情")

工具推荐

JMeter：压力测试工具，可用于测试API的并发请求能力
Beautiful Soup：HTML解析库，可用于提取笔记中的特定内容

评论分析：挖掘用户真实反馈

评论是了解用户真实想法的重要渠道，如何高效采集和分析小红书笔记的评论数据？本方法将带你实现评论的批量采集和基础情感分析。

新手友好度：★★☆☆☆

import time import jieba import re from collections import Counter from xhs import XhsClient def clean_comment(text): """清洗评论文本""" # 移除HTML标签 text = re.sub(r'<[^>]*>', '', text) # 移除特殊字符 text = re.sub(r'[^\u4e00-\u9fa5a-zA-Z0-9]', ' ', text) # 去除多余空格 return re.sub(r'\s+', ' ', text).strip() def get_note_comments(client, note_id, max_count=100): """获取笔记评论""" comments = [] cursor = "" while len(comments) < max_count: try: response = client.get_note_comments(note_id, cursor=cursor) if not response["comments"]: break for comment in response["comments"]: cleaned_content = clean_comment(comment["content"]) comments.append({ "comment_id": comment["comment_id"], "content": cleaned_content, "user_nickname": comment["user"]["nickname"], "like_count": comment["like_count"], "create_time": comment["create_time"], "reply_count": comment["reply_count"] }) cursor = response.get("cursor", "") if not cursor: break # 评论接口较为敏感，设置更长的请求间隔 time.sleep(3) except Exception as e: print(f"获取评论失败: {str(e)}") break return comments[:max_count] def analyze_comments(comments): """简单分析评论内容""" # 提取所有评论文本 all_text = " ".join([comment["content"] for comment in comments if comment["content"]]) # 分词 words = jieba.cut(all_text) # 过滤停用词和单字 stop_words = {"的", "了", "在", "是", "我", "有", "和", "就", "不", "人", "都", "一", "一个", "上", "也", "很", "到", "说", "要", "去", "你", "会", "着", "没有", "看", "好", "自己", "这"} filtered_words = [word for word in words if word not in stop_words and len(word) > 1] # 计算词频 word_counts = Counter(filtered_words).most_common(20) return { "comment_count": len(comments), "top_words": word_counts, "avg_likes": sum(comment["like_count"] for comment in comments) / len(comments) if comments else 0 } # 使用示例 client = XhsClient(cookie="你的cookie信息") note_id = "笔记ID" comments = get_note_comments(client, note_id, max_count=50) analysis_result = analyze_comments(comments) print(f"共获取{analysis_result['comment_count']}条评论") print("热门关键词:") for word, count in analysis_result["top_words"][:10]: print(f"{word}: {count}次")

反爬策略对比表

反爬策略	实现难度	效果	风险	适用场景
基础间隔控制	★☆☆☆☆	★★☆☆☆	★☆☆☆☆	低频率采集，个人使用
动态请求头+间隔	★★☆☆☆	★★★☆☆	★★☆☆☆	中等频率采集，小规模数据
代理池+账号池+指纹	★★★★★	★★★★★	★★★☆☆	大规模数据采集，商业应用

工具推荐

SnowNLP：中文文本处理库，可用于情感分析和文本分类
WordCloud：生成词云的Python库，直观展示评论关键词

数据应用：从原始数据到可视化洞察

采集到数据后如何转化为有价值的洞察？本方法将带你学习数据清洗、存储和可视化的完整流程，让你的数据说话。

新手友好度：★★★☆☆

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from wordcloud import WordCloud import numpy as np from matplotlib.font_manager import FontProperties # 设置中文字体 font = FontProperties(fname="SimHei.ttf", size=12) def create_dataframe(notes): """将笔记数据转换为DataFrame""" df = pd.DataFrame(notes) # 转换时间格式 df["create_time"] = pd.to_datetime(df["create_time"], unit="s") # 提取发布日期和小时 df["date"] = df["create_time"].dt.date df["hour"] = df["create_time"].dt.hour # 计算互动率 df["interaction_rate"] = (df["like_count"] + df["comment_count"] + df["collect_count"]) / df["share_count"].replace(0, 1) return df def visualize_note_data(df): """可视化笔记数据""" plt.figure(figsize=(15, 12)) # 1. 互动指标相关性分析 plt.subplot(2, 2, 1) sns.scatterplot(data=df, x="like_count", y="collect_count") plt.title("点赞数与收藏数相关性", fontproperties=font) # 2. 发布时间分布 plt.subplot(2, 2, 2) hour_counts = df["hour"].value_counts().sort_index() sns.barplot(x=hour_counts.index, y=hour_counts.values) plt.title("笔记发布时间分布", fontproperties=font) plt.xlabel("小时", fontproperties=font) plt.ylabel("笔记数量", fontproperties=font) # 3. 互动率分布 plt.subplot(2, 2, 3) sns.boxplot(data=df, y="interaction_rate") plt.title("互动率分布", fontproperties=font) plt.tight_layout() plt.savefig("note_analysis.png") plt.show() def generate_wordcloud(tags_list): """生成标签词云""" all_tags = [tag for tags in tags_list for tag in tags] tag_counts = Counter(all_tags) wc = WordCloud( font_path="SimHei.ttf", background_color="white", max_words=50, width=800, height=400 ).generate_from_frequencies(tag_counts) plt.figure(figsize=(10, 6)) plt.imshow(wc, interpolation="bilinear") plt.axis("off") plt.title("热门笔记标签词云", fontproperties=font) plt.savefig("tag_wordcloud.png") plt.show() # 使用示例 # 假设我们已经获取了笔记数据列表notes # df = create_dataframe(notes) # visualize_note_data(df) # generate_wordcloud(df["tags"].tolist())

工具推荐

Tableau：强大的数据可视化工具，支持创建交互式仪表板
Power BI：微软的商业分析工具，适合数据建模和报表生成

小红书开放平台申请流程

在进行小红书数据采集前，建议通过官方渠道申请API访问权限：

访问小红书开放平台官网，注册开发者账号
创建应用，填写应用信息和用途说明
提交审核，等待平台审核通过
获取API密钥和访问令牌
阅读API文档，了解接口限制和调用规范

你问我答

Q1: 为什么我的API请求总是返回403错误？

A1: 403错误通常表示权限不足或请求被拒绝。可能的原因包括：cookie失效、User-Agent设置不当、请求频率过高或IP被限制。解决方法：刷新cookie、使用真实浏览器的User-Agent、降低请求频率或尝试更换网络环境。

Q2: 如何处理API返回的大量数据？

A2: 对于大量数据，建议采用分页获取方式，使用cursor参数实现增量加载。同时，将数据存储到数据库（如MySQL、MongoDB）而不是内存中，避免内存溢出。对于超大规模数据，可以考虑使用消息队列和分布式爬虫架构。

Q3: 个人使用小红书数据是否合法？

A3: 个人使用需遵守《网络安全法》和平台用户协议，不得侵犯他人隐私和知识产权。建议仅采集公开可访问的非敏感数据，且用途限于学习研究。商业用途需获得平台官方授权，避免法律风险。

通过本文介绍的5个实用方法，你已经掌握了小红书数据采集的核心技能。记住，数据采集应该始终遵守平台规定和相关法律法规，仅用于合法用途和学习研究。合理使用这些工具和技巧，让数据为你的决策提供支持，发现更多有价值的信息！

【免费下载链接】xhs基于小红书 Web 端进行的请求封装。https://reajason.github.io/xhs/项目地址: https://gitcode.com/gh_mirrors/xh/xhs

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

从零开始用Python进行小红书数据采集的5个实用方法