Python爬虫爬取新浪微博内容示例【基于代理IP】 - 行业资讯

本文实例讲述了Python爬虫爬取新浪微博内容。分享给大家供大家参考,具体如下:

用Python编写爬虫,爬取微博大V的微博内容,本文以女神的微博为例(爬新浪m站:)

一般做爬虫爬取网站,首选的都是m站,其次是wap站,最后考虑电脑站。当然,这不是绝对的,有的时候PC站的信息最全,而你又恰好需要全部的信息,那么电脑站是你的首选。一般m站都以米开头后接域名,所以本文开搞的网址就是m.weibo.cn。

1。代理IP

网上有很多免费代理ip,如西刺免费代理ip,自己可找一个可以使用的进行测试。

2。抓包分析

通过抓包获取微博内容地址,这里不再细说,不明白的小伙伴可以自行百度查找相关资料,下面直接上完整的代码

　　　　　　# - * -编码:utf - 8 - * 　　进口urllib.request 　　进口json 　　#定义要爬取的微博大V的微博ID 　　id=' 1259110474 ' 　　#设置代理IP 　　proxy_addr=" 122.241.72.191:808 " 　　#定义页面打开函数　　proxy_addr def use_proxy (url): 　　要求=urllib.request.Request (url) 　　req.add_header(“用户代理”、“Mozilla/5.0 (Windows NT 6.1;WOW64) AppleWebKit/537.36 (KHTML,像壁虎)Chrome/49.0.2623.221 Safari 537.36 SE 2。X MetaSr 1.0”) 　　代理=urllib.request.ProxyHandler ({“http”: proxy_addr}) 　　urllib.request.HTTPHandler刀=urllib.request.build_opener(代理) 　　urllib.request.install_opener(刀) 　　data=https://www.yisu.com/zixun/urllib.request.urlopen(点播).read () .decode (“utf - 8”、“忽略”) 　　返回数据　　#获取微博主页的containerid,爬取微博内容时需要此id 　　def get_containerid (url): 　　data=https://www.yisu.com/zixun/use_proxy (url, proxy_addr) 　　内容=json.loads(数据). get(“数据”) 　　数据在content.get (tabsInfo) . get(“选项卡”): 　　如果(data.get (tab_type)==拔⒉?: 　　containerid=data.get (“containerid”) 　　返回containerid 　　#获取微博大V账号的用户基本信息,如:微博昵称,微博地址,微博头像,关注人数,粉丝数,性别,等级等　　def get_userInfo (id): 　　url=' https://m.weibo.cn/api/container/getIndex& # 63;类型=uid&值=' + id 　　data=https://www.yisu.com/zixun/use_proxy (url, proxy_addr) 　　内容=json.loads(数据). get(“数据”) 　　profile_image_url=content.get(“用户信息”). get (“profile_image_url”) 　　描述=content.get(“用户信息”). get(描述) 　　profile_url=content.get(“用户信息”). get (“profile_url”) 　　验证=content.get(“用户信息”). get(验证) 　　guanzhu=content.get(“用户信息”). get (“follow_count”) 　　name=content.get(“用户信息”). get (“screen_name”) 　　fensi=content.get(“用户信息”). get (“followers_count”) 　　性别=content.get(“用户信息”). get(性别) 　　urank=content.get(“用户信息”). get (“urank”) 　　打印(“微博昵称:“+名称+“\ n”+“微博主页地址:”+ profile_url +“\ n”+“微博头像地址:”+ profile_image_url +“\ n”+“是否认证:”+ str(验证)+“\ n”+“微博说明:“+描述+“\ n”+“关注人数:”+ str (guanzhu) +“\ n”+“粉丝数:”+ str (fensi) +“\ n”+“性别:" +性别+“\ n”+“微博等级:”+ str (urank) +“\ n”) 　　#获取微博内容信息,并保存到文本中,内容包括:每条微博的内容,微博详情页面地址,点赞数,评论数,转发数等　　def get_weibo (id、文件): 　　i=1 　　而真正的: 　　url=' https://m.weibo.cn/api/container/getIndex& # 63;类型=uid&值=' https://www.yisu.com/zixun/+ id 　　weibo_url=' https://m.weibo.cn/api/container/getIndex& # 63;类型=uid&值=' https://www.yisu.com/zixun/+身份证+ '和containerid=' + get_containerid (url) +”,页面=' + str(我) 　　试一试: 　　data=https://www.yisu.com/zixun/use_proxy (weibo_url proxy_addr) 　　内容=json.loads(数据). get(“数据”) 　　卡=content.get(“卡”) 　　如果(len(卡)在0): 　　j的范围(len(卡): 　　打印(“- - - - - -正在爬取第”+ str (i) + "页,第“+ str (j) +“条微博- - - - - -”) 　　card_type=卡[j] . get (“card_type”) 　　如果(card_type==9): 　　mblog=卡[j] . get (“mblog”) 　　attitudes_count=mblog.get (“attitudes_count”) 　　comments_count=mblog.get (“comments_count”) 　　created_at=mblog.get (“created_at”) 　　reposts_count=mblog.get (“reposts_count”) 　　计划=卡[j] . get(“计划”) 　　文本=mblog.get(文本) 　　打开(文件,a,编码=皍tf - 8”)跳频: 　　fh.write("——第" + str(我)+ "页,第“+ str (j) +“条微博- - - - -”+“\ n”) 　　fh.write(“微博地址:“+ str(计划)+ " \ n " + " " + str (created_at) +“\ n”+“微博内容:“+文字+“\ n”+“点赞数:”+ str (attitudes_count) +“\ n”+“评论数:”+ str (comments_count) +“\ n”+“转发数:”+ str (reposts_count) +“\ n”) 　　我+=1 　　其他: 　　打破　　除了例外e: 　　打印(e) 　　通过　　if __name__==癬_main__”: 　　文件=id + " . txt " 　　get_userInfo (id) 　　get_weibo (id、文件)