|
<
❤️年夜佬皆正在教甚么?Python爬虫阐发C站年夜佬珍藏夹,随着年夜佬一同教,您便是下一个年夜佬❤️!
媒介
计较机止业的开展太快了,偶然候几天没有进修,便被时期所丢弃了,因而关于我们法式员而行,最主要的便是要时辰松跟业界静态变革,进修新的手艺,可是许多时分我们又没有明白教甚么好,万一教的新手艺其实不会被普遍利用,太小寡了对进修事情也协助没有年夜,这时候候我们便念要明白年夜佬们皆正在教甚么了,随着年夜佬进修走直路的几率便小许多了。如今便让我们看看C站年夜佬们平常皆珍藏了甚么,年夜佬教甚么随着年夜佬的足步就行了!
法式阐明
经由过程爬与 “CSDN” 获得齐站排名靠前的专主的公然珍藏夹,写进 csv 文件中,按照所获得数据阐发范畴年夜佬们的进修趋向,并经由过程可视化的方法停止展现。
数据爬与
利用 requests 库恳求网页疑息,利用 BeautifulSoup4 战 json 库剖析网页。
获得 CSDN 做者总榜数据
起首,我们需求获得 CSDN 中正在榜的年夜佬,获得他/她们的相干疑息。因为数据是静态减载的(闭于静态减载的更多阐明,能够参考专文《渣男,您为何有那么多蜜斯姐的照片?由于我Python爬虫教的好啊❤️!》),因而利用开辟者东西,正在收集选项卡中能够找到恳求的 JSON 数据:
察看恳求链接:
- https://blog.csdn.net/phoenix/web/blog/all-rank?page=0&pageSize=20
- https://blog.csdn.net/phoenix/web/blog/all-rank?page=1&pageSize=20
- ...
复造代码 能够发明每次恳求 JSON 数据时,会获得20个数据,为了获得排名前100的年夜佬数据,利用以下方法机关恳求:
- url_rank_pattern = "https://blog.csdn.net/phoenix/web/blog/all-rank?page={}&pageSize=20"
- for i in range(5):
- url = url_rank_pattern.format(i)
- #声明网页编码方法
- response = requests.get(url=url, headers=headers)
- response.encoding = 'utf-8'
- response.raise_for_status()
- soup = BeautifulSoup(response.text, 'html.parser')
复造代码 恳求获得 Json 数据后,利用 json 模块剖析数据(固然也能够利用 re 模块,按照本人的爱好挑选就行了),获得用户疑息,从需供上讲,那里仅需求用户 userName,因而仅剖析 userName 疑息,也能够按照需供获得其他疑息:
- userNames = []
- information = json.loads(str(soup))
- for j in information['data']['allRankListItem']:
- # 获得id疑息
- userNames.append(j['userName'])
复造代码 获得珍藏夹列表
获得到年夜佬的 userName 疑息后,经由过程主页去察看珍藏夹列表的恳求方法,本文以本人的主页为例(给本人推行一波),阐发办法取上一步相似,正在主页中切换到“珍藏”选项卡,一样操纵开辟者东西的收集选项卡:
察看恳求珍藏夹列表的地点:
- https://blog.csdn.net/community/home-api/v1/get-favorites-created-list?page=1&size=20&noMore=false&blogUsername=LOVEmy134611
复造代码 能够看到那里我们上一步获得的 userName 便用上了,能够经由过程交换 blogUsername 的值去获得列表中年夜佬的珍藏夹列表,一样当珍藏夹数目年夜于20时,能够经由过程修正 page 值去获得一切珍藏夹列表:
- collections = "https://blog.csdn.net/community/home-api/v1/get-favorites-created-list?page=1&size=20&noMore=false&blogUsername={}"
- for userName in userNames:
- url = collections.format(userName)
- #声明网页编码方法
- response = requests.get(url=url, headers=headers)
- response.encoding = 'utf-8'
- response.raise_for_status()
- soup = BeautifulSoup(response.text, 'html.parser')
复造代码 恳求获得 Json 数据后,利用 json 模块剖析数据,获得珍藏夹疑息,从需供上讲,那里仅需求珍藏夹 id,因而仅剖析 id 疑息,也能够按照需供获得其他疑息(比方能够获得存眷人数等疑息,找到最受欢迎的珍藏夹):
- file_id_list = []
- information = json.loads(str(soup))
- # 获得珍藏夹总数
- collection_number = information['data']['total']
- # 获得珍藏夹id
- for j in information['data']['list']:
- file_id_list.append(j['id'])
复造代码 那里各人大要会问,如今 CSDN 没有是有新旧两种主页么,恳求方法能一样么?谜底是:纷歧样,正在阅读器端停止会见时,旧版本利用了不同的恳求接心,可是我们一样可使用新版本的恳求方法去停止获得,因而便没必要辨别新、旧版本的恳求接心了,获得珍藏数据时状况也是一样的。
获得珍藏数据
最初,单击珍藏夹睁开按钮,就能够看到珍藏夹中的内乱容了,然后一样操纵开辟者东西的收集选项卡停止阐发:
察看恳求珍藏夹的地点:
- https://blog.csdn.net/community/home-api/v1/get-favorites-item-list?blogUsername=LOVEmy134611&folderId=9406232&page=1&pageSize=200
复造代码 能够看到方才获得的用户 userName 战珍藏夹 id 就能够机关恳求获得珍藏夹中的珍藏疑息了:
- file_url = "https://blog.csdn.net/community/home-api/v1/get-favorites-item-list?blogUsername={}&folderId={}&page=1&pageSize=200"
- for file_id in file_id_list:
- url = file_url.format(userName,file_id)
- #声明网页编码方法
- response = requests.get(url=url, headers=headers)
- response.encoding = 'utf-8'
- response.raise_for_status()
- soup = BeautifulSoup(response.text, 'html.parser')
复造代码 最初用 re 模块剖析:
- user = user_dict[userName]
- user = preprocess(user)
- # 题目
- title_list = analysis(r'"title":"(.*?)",', str(soup))
- # 链接
- url_list = analysis(r'"url":"(.*?)"', str(soup))
- # 做者
- nickname_list = analysis(r'"nickname":"(.*?)",', str(soup))
- # 珍藏日期
- date_list = analysis(r'"dateTime":"(.*?)",', str(soup))
- for i in range(len(title_list)):
- title = preprocess(title_list[i])
- url = preprocess(url_list[i])
- nickname = preprocess(nickname_list[i])
- date = preprocess(date_list[i])
复造代码 爬虫法式完好代码
- import timeimport requestsfrom bs4 import BeautifulSoupimport osimport jsonimport reimport csvif not os.path.exists("col_infor.csv"): #创立存储csv文件存储数据 file = open('col_infor.csv', "w", encoding="utf-8-sig",newline='') csv_head = csv.writer(file) #表头 header = ['userName','title','url','anthor','date'] csv_head.writerow(header) file.close()headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}def preprocess(string): return string.replace(',',' ')url_rank_pattern = "https://blog.csdn.net/phoenix/web/blog/all-rank?page={}&pageSize=20"userNames = []user_dict = {}for i in range(5): url = url_rank_pattern.format(i) #声明网页编码方法 response = requests.get(url=url, headers=headers) response.encoding = 'utf-8' response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') information = json.loads(str(soup)) for j in information['data']['allRankListItem']: # 获得id疑息 userNames.append(j['userName']) user_dict[j['userName']] = j['nickName']def get_col_list(page,userName): collections = "https://blog.csdn.net/community/home-api/v1/get-favorites-created-list?page={}&size=20&noMore=false&blogUsername={}" url = collections.format(page,userName) #声明网页编码方法 response = requests.get(url=url, headers=headers) response.encoding = 'utf-8' response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') information = json.loads(str(soup)) return informationdef analysis(item,results): pattern = re.compile(item, re.I|re.M) result_list = pattern.findall(results) return result_listdef get_col(userName, file_id, col_page): file_url = "https://blog.csdn.net/community/home-api/v1/get-favorites-item-list?blogUsername={}&folderId={}&page={}&pageSize=200" url = file_url.format(userName,file_id, col_page) #声明网页编码方法 response = requests.get(url=url, headers=headers) response.encoding = 'utf-8' response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') user = user_dict[userName]
- user = preprocess(user)
- # 题目
- title_list = analysis(r'"title":"(.*?)",', str(soup))
- # 链接
- url_list = analysis(r'"url":"(.*?)"', str(soup))
- # 做者
- nickname_list = analysis(r'"nickname":"(.*?)",', str(soup))
- # 珍藏日期
- date_list = analysis(r'"dateTime":"(.*?)",', str(soup))
- for i in range(len(title_list)):
- title = preprocess(title_list[i])
- url = preprocess(url_list[i])
- nickname = preprocess(nickname_list[i])
- date = preprocess(date_list[i]) if title and url and nickname and date: with open('col_infor.csv', 'a+', encoding='utf-8-sig') as f: f.write(user + ',' + title + ',' + url + ',' + nickname + ',' + date + '\n') return informationfor userName in userNames: page = 1 file_id_list = [] information = get_col_list(page, userName) # 获得珍藏夹总数 collection_number = information['data']['total'] # 获得珍藏夹id for j in information['data']['list']: file_id_list.append(j['id']) while collection_number > 20: page = page + 1 collection_number = collection_number - 20 information = get_col_list(page, userName) # 获得珍藏夹id for j in information['data']['list']: file_id_list.append(j['id']) collection_number = 0 # 获得珍藏疑息 for file_id in file_id_list: col_page = 1 information = get_col(userName, file_id, col_page) number_col = information['data']['total'] while number_col > 200: col_page = col_page + 1 number_col = number_col - 200 get_col(userName, file_id, col_page) number_col = 0
复造代码 爬与数据成果
展现部门爬与成果:
数据阐发及可视化
最初利用 wordcloud 库,画造词云展现年夜佬珍藏。
- from os import path
- from PIL import Image
- import matplotlib.pyplot as plt
- import jieba
- from wordcloud import WordCloud, STOPWORDS
- import pandas as pd
- import matplotlib.ticker as ticker
- import numpy as np
- import math
- import re
- df = pd.read_csv('col_infor.csv', encoding='utf-8-sig',usecols=['userName','title','url','anthor','date'])
- place_array = df['title'].values
- place_list = ','.join(place_array)
- with open('text.txt','a+') as f:
- f.writelines(place_list)
- ###当前文件途径
- d = path.dirname(__file__)
- # Read the whole text.
- file = open(path.join(d, 'text.txt')).read()
- ##停止分词
- #停用词
- stopwords = ["的","取","战","倡议","珍藏","利用","了","完成","我","中","您","正在","之"]
- text_split = jieba.cut(file) # 已来失落停用词的分词成果 list范例
- #来失落停用词的分词成果 list范例
- text_split_no = []
- for word in text_split:
- if word not in stopwords:
- text_split_no.append(word)
- #print(text_split_no)
- text =' '.join(text_split_no)
- #布景图片
- picture_mask = np.array(Image.open(path.join(d, "path.jpg")))
- stopwords = set(STOPWORDS)
- stopwords.add("said")
- wc = WordCloud(
- #设置字体,指定字体途径
- font_path=r'C:\Windows\Fonts\simsun.ttc',
- # font_path=r'/usr/share/fonts/wps-office/simsun.ttc',
- background_color="white",
- max_words=2000,
- mask=picture_mask,
- stopwords=stopwords)
- # 天生词云
- wc.generate(text)
- # 存储图片
- wc.to_file(path.join(d, "result.jpg"))
复造代码
免责声明:假如进犯了您的权益,请联络站少,我们会实时删除侵权内乱容,感谢协作! |
1、本网站属于个人的非赢利性网站,转载的文章遵循原作者的版权声明,如果原文没有版权声明,按照目前互联网开放的原则,我们将在不通知作者的情况下,转载文章;如果原文明确注明“禁止转载”,我们一定不会转载。如果我们转载的文章不符合作者的版权声明或者作者不想让我们转载您的文章的话,请您发送邮箱:Cdnjson@163.com提供相关证明,我们将积极配合您!
2、本网站转载文章仅为传播更多信息之目的,凡在本网站出现的信息,均仅供参考。本网站将尽力确保所提供信息的准确性及可靠性,但不保证信息的正确性和完整性,且不对因信息的不正确或遗漏导致的任何损失或损害承担责任。
3、任何透过本网站网页而链接及得到的资讯、产品及服务,本网站概不负责,亦不负任何法律责任。
4、本网站所刊发、转载的文章,其版权均归原作者所有,如其他媒体、网站或个人从本网下载使用,请在转载有关文章时务必尊重该文章的著作权,保留本网注明的“稿件来源”,并自负版权等法律责任。
|