❤️大佬都在学什么？Python爬虫分析C站大佬收藏夹，跟着大佬一起学，你就

女王ˋ般的范儿

❤️年夜佬皆正在教甚么？Python爬虫阐发C站年夜佬珍藏夹，随着年夜佬一同教，您便是下一个年夜佬❤️!

媒介

计较机止业的开展太快了，偶然候几天没有进修，便被时期所丢弃了，因而关于我们法式员而行，最主要的便是要时辰松跟业界静态变革，进修新的手艺，可是许多时分我们又没有明白教甚么好，万一教的新手艺其实不会被普遍利用，太小寡了对进修事情也协助没有年夜，这时候候我们便念要明白年夜佬们皆正在教甚么了，随着年夜佬进修走直路的几率便小许多了。如今便让我们看看C站年夜佬们平常皆珍藏了甚么，年夜佬教甚么随着年夜佬的足步就行了！
法式阐明

经由过程爬与 “CSDN” 获得齐站排名靠前的专主的公然珍藏夹，写进 csv 文件中，按照所获得数据阐发范畴年夜佬们的进修趋向，并经由过程可视化的方法停止展现。
数据爬与

利用 requests 库恳求网页疑息，利用 BeautifulSoup4 战 json 库剖析网页。
获得 CSDN 做者总榜数据

起首，我们需求获得 CSDN 中正在榜的年夜佬，获得他/她们的相干疑息。因为数据是静态减载的(闭于静态减载的更多阐明，能够参考专文《渣男，您为何有那么多蜜斯姐的照片？由于我Python爬虫教的好啊❤️！》)，因而利用开辟者东西，正在收集选项卡中能够找到恳求的 JSON 数据：

察看恳求链接：

https://blog.csdn.net/phoenix/web/blog/all-rank?page=0&pageSize=20
https://blog.csdn.net/phoenix/web/blog/all-rank?page=1&pageSize=20
...

复造代码

能够发明每次恳求 JSON 数据时，会获得20个数据，为了获得排名前100的年夜佬数据，利用以下方法机关恳求：

url_rank_pattern = "https://blog.csdn.net/phoenix/web/blog/all-rank?page={}&pageSize=20"
for i in range(5):
url = url_rank_pattern.format(i)
#声明网页编码方法
response = requests.get(url=url, headers=headers)
response.encoding = 'utf-8'
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')

复造代码

恳求获得 Json 数据后，利用 json 模块剖析数据(固然也能够利用 re 模块，按照本人的爱好挑选就行了)，获得用户疑息，从需供上讲，那里仅需求用户 userName，因而仅剖析 userName 疑息，也能够按照需供获得其他疑息：

userNames = []
information = json.loads(str(soup))
for j in information['data']['allRankListItem']:
# 获得id疑息
userNames.append(j['userName'])

复造代码

获得珍藏夹列表

获得到年夜佬的 userName 疑息后，经由过程主页去察看珍藏夹列表的恳求方法，本文以本人的主页为例(给本人推行一波)，阐发办法取上一步相似，正在主页中切换到“珍藏”选项卡，一样操纵开辟者东西的收集选项卡：

察看恳求珍藏夹列表的地点：

https://blog.csdn.net/community/home-api/v1/get-favorites-created-list?page=1&size=20&noMore=false&blogUsername=LOVEmy134611

复造代码

能够看到那里我们上一步获得的 userName 便用上了，能够经由过程交换 blogUsername 的值去获得列表中年夜佬的珍藏夹列表，一样当珍藏夹数目年夜于20时，能够经由过程修正 page 值去获得一切珍藏夹列表：

collections = "https://blog.csdn.net/community/home-api/v1/get-favorites-created-list?page=1&size=20&noMore=false&blogUsername={}"
for userName in userNames:
url = collections.format(userName)
#声明网页编码方法
response = requests.get(url=url, headers=headers)
response.encoding = 'utf-8'
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')

复造代码

恳求获得 Json 数据后，利用 json 模块剖析数据，获得珍藏夹疑息，从需供上讲，那里仅需求珍藏夹 id，因而仅剖析 id 疑息，也能够按照需供获得其他疑息(比方能够获得存眷人数等疑息，找到最受欢迎的珍藏夹)：

file_id_list = []
information = json.loads(str(soup))
# 获得珍藏夹总数
collection_number = information['data']['total']
# 获得珍藏夹id
for j in information['data']['list']:
file_id_list.append(j['id'])

复造代码

那里各人大要会问，如今 CSDN 没有是有新旧两种主页么，恳求方法能一样么？谜底是：纷歧样，正在阅读器端停止会见时，旧版本利用了不同的恳求接心，可是我们一样可使用新版本的恳求方法去停止获得，因而便没必要辨别新、旧版本的恳求接心了，获得珍藏数据时状况也是一样的。
获得珍藏数据

最初，单击珍藏夹睁开按钮，就能够看到珍藏夹中的内乱容了，然后一样操纵开辟者东西的收集选项卡停止阐发：

察看恳求珍藏夹的地点：

https://blog.csdn.net/community/home-api/v1/get-favorites-item-list?blogUsername=LOVEmy134611&folderId=9406232&page=1&pageSize=200

复造代码

能够看到方才获得的用户 userName 战珍藏夹 id 就能够机关恳求获得珍藏夹中的珍藏疑息了：

file_url = "https://blog.csdn.net/community/home-api/v1/get-favorites-item-list?blogUsername={}&folderId={}&page=1&pageSize=200"
for file_id in file_id_list:
url = file_url.format(userName,file_id)
#声明网页编码方法
response = requests.get(url=url, headers=headers)
response.encoding = 'utf-8'
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')

复造代码

最初用 re 模块剖析：

user = user_dict[userName]
user = preprocess(user)
# 题目
title_list = analysis(r'"title":"(.*?)",', str(soup))
# 链接
url_list = analysis(r'"url":"(.*?)"', str(soup))
# 做者
nickname_list = analysis(r'"nickname":"(.*?)",', str(soup))
# 珍藏日期
date_list = analysis(r'"dateTime":"(.*?)",', str(soup))
for i in range(len(title_list)):
title = preprocess(title_list[i])
url = preprocess(url_list[i])
nickname = preprocess(nickname_list[i])
date = preprocess(date_list[i])

复造代码

爬虫法式完好代码

import timeimport requestsfrom bs4 import BeautifulSoupimport osimport jsonimport reimport csvif not os.path.exists("col_infor.csv"): #创立存储csv文件存储数据 file = open('col_infor.csv', "w", encoding="utf-8-sig",newline='') csv_head = csv.writer(file) #表头 header = ['userName','title','url','anthor','date'] csv_head.writerow(header) file.close()headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}def preprocess(string): return string.replace(',',' ')url_rank_pattern = "https://blog.csdn.net/phoenix/web/blog/all-rank?page={}&pageSize=20"userNames = []user_dict = {}for i in range(5): url = url_rank_pattern.format(i) #声明网页编码方法 response = requests.get(url=url, headers=headers) response.encoding = 'utf-8' response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') information = json.loads(str(soup)) for j in information['data']['allRankListItem']: # 获得id疑息 userNames.append(j['userName']) user_dict[j['userName']] = j['nickName']def get_col_list(page,userName): collections = "https://blog.csdn.net/community/home-api/v1/get-favorites-created-list?page={}&size=20&noMore=false&blogUsername={}" url = collections.format(page,userName) #声明网页编码方法 response = requests.get(url=url, headers=headers) response.encoding = 'utf-8' response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') information = json.loads(str(soup)) return informationdef analysis(item,results): pattern = re.compile(item, re.I|re.M) result_list = pattern.findall(results) return result_listdef get_col(userName, file_id, col_page): file_url = "https://blog.csdn.net/community/home-api/v1/get-favorites-item-list?blogUsername={}&folderId={}&page={}&pageSize=200" url = file_url.format(userName,file_id, col_page) #声明网页编码方法 response = requests.get(url=url, headers=headers) response.encoding = 'utf-8' response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') user = user_dict[userName]
user = preprocess(user)
# 题目
title_list = analysis(r'"title":"(.*?)",', str(soup))
# 链接
url_list = analysis(r'"url":"(.*?)"', str(soup))
# 做者
nickname_list = analysis(r'"nickname":"(.*?)",', str(soup))
# 珍藏日期
date_list = analysis(r'"dateTime":"(.*?)",', str(soup))
for i in range(len(title_list)):
title = preprocess(title_list[i])
url = preprocess(url_list[i])
nickname = preprocess(nickname_list[i])
date = preprocess(date_list[i]) if title and url and nickname and date: with open('col_infor.csv', 'a+', encoding='utf-8-sig') as f: f.write(user + ',' + title + ',' + url + ',' + nickname + ',' + date + '\n') return informationfor userName in userNames: page = 1 file_id_list = [] information = get_col_list(page, userName) # 获得珍藏夹总数 collection_number = information['data']['total'] # 获得珍藏夹id for j in information['data']['list']: file_id_list.append(j['id']) while collection_number > 20: page = page + 1 collection_number = collection_number - 20 information = get_col_list(page, userName) # 获得珍藏夹id for j in information['data']['list']: file_id_list.append(j['id']) collection_number = 0 # 获得珍藏疑息 for file_id in file_id_list: col_page = 1 information = get_col(userName, file_id, col_page) number_col = information['data']['total'] while number_col > 200: col_page = col_page + 1 number_col = number_col - 200 get_col(userName, file_id, col_page) number_col = 0

复造代码

爬与数据成果

展现部门爬与成果：

数据阐发及可视化

最初利用 wordcloud 库，画造词云展现年夜佬珍藏。

from os import path
from PIL import Image
import matplotlib.pyplot as plt
import jieba
from wordcloud import WordCloud, STOPWORDS
import pandas as pd
import matplotlib.ticker as ticker
import numpy as np
import math
import re
df = pd.read_csv('col_infor.csv', encoding='utf-8-sig',usecols=['userName','title','url','anthor','date'])
place_array = df['title'].values
place_list = '，'.join(place_array)
with open('text.txt','a+') as f:
f.writelines(place_list)
###当前文件途径
d = path.dirname(__file__)
# Read the whole text.
file = open(path.join(d, 'text.txt')).read()
##停止分词
#停用词
stopwords = ["的","取","战","倡议","珍藏","利用","了","完成","我","中","您","正在","之"]
text_split = jieba.cut(file) # 已来失落停用词的分词成果 list范例
#来失落停用词的分词成果 list范例
text_split_no = []
for word in text_split:
if word not in stopwords:
text_split_no.append(word)
#print(text_split_no)
text =' '.join(text_split_no)
#布景图片
picture_mask = np.array(Image.open(path.join(d, "path.jpg")))
stopwords = set(STOPWORDS)
stopwords.add("said")
wc = WordCloud(
#设置字体，指定字体途径
font_path=r'C:\Windows\Fonts\simsun.ttc',
# font_path=r'/usr/share/fonts/wps-office/simsun.ttc',
background_color="white",
max_words=2000,
mask=picture_mask,
stopwords=stopwords)
# 天生词云
wc.generate(text)
# 存储图片
wc.to_file(path.join(d, "result.jpg"))

复造代码

免责声明：假如进犯了您的权益，请联络站少，我们会实时删除侵权内乱容，感谢协作！

❤️大佬都在学什么？Python爬虫分析C站大佬收藏夹，跟着大佬一起学， 你就

❤️大佬都在学什么？Python爬虫分析C站大佬收藏夹，跟着大佬一起学，你就