Python下载网络小说实例代码 - 行业资讯 - 肥雀云

看网络小说一般会攒上一波,然后导入Kindle里面去看,但是攒的多了,机械的Ctrl + C和Ctrl + V实在是,所以就出现了此文。

其实Python我也是小白,用它的目的主要是它强大文本处理能力和网络支持,以及许多好用的库,不需要自己造轮子。而且真心比C方便啊(真是用了才知道)

分析要获取的网页

我要获取的主要的是3个东西:

还有就是注意网页的编码,这个网页的编码是GBK,但在实际运行过程中,我用GBK会出现网页解码错误:

　　
UnicodeDecodeError:“gbk”编解码器不能解码字节的位置2 - 3:非法多字节序列
　　　　
所以换用了gb18030,问题就解决了,因为一般修仙网络小说中,会出现各种王霸之气的文字,你们懂得,所以需要更加牛逼文字库,你们感受一下博大精深的字符编码。
　　

　　
源代码
　　
我就知道,大家要这个,哈哈哈。
　　
主函数
　　　　　　#主函数　　if __name__==癬_main__”: 　　全球numChapter 　　全球NOVERL 　　　　NOVERL='大主宰. txt” 　　# NOVERL='择天记. txt” 　　NOVERL='武动乾坤. txt” 　　　　　　如果(NOVERL==大主宰. txt”): 　　textStartURL=' http://www.bxwx8.org/b/62/62724/11455540.html '; #大主宰第一章的URL 　　textStartURL=' http://www.bxwx8.org/b/62/62724/28019405.html '; #第一千两百三十七章鬼大师　　其他: 　　textStartURL=' http://www.bxwx8.org/b/98/98289/17069215.html '; #择天记第一章URL 　　textStartURL=' http://www.bxwx8.org/b/98/98289/28088874.html '; #择天记第七十八章合剑术　　　　textStartURL=' http://www.bxwx8.org/b/35/35282/5839471.html '; #武动乾坤第一章　　# textStartURL=' http://www.bxwx8.org/b/35/35282/7620539.html '; #武动乾坤　　nextURL=textStartURL; 　　　　isEnd=False 　　　　f=开放(NOVERL,“w”,编码=皍tf - 8”) 　　f.close () 　　　　numChapter=0; 　　而(不是isEnd): 　　nextURL isEnd=findNextTextURL (nextURL) 　　　　打印('年底捕捉!”) 　　打印(“获取到' + str (numChapter) +“章”) 　　　　
获取内容和下一章URL
　　　　　　#找到下一章节的URL 　　#获取小说内容　　def findNextTextURL (url): 　　全球numChapter 　　全球NOVERL 　　#如果nextURL==endURL则返回错误的　　　　如果(NOVERL==大主宰. txt”): 　　endURL=' http://www.bxwx8.org/b/62/62724/index.html ' #大主宰　　headURL=' http://www.bxwx8.org/b/62/62724/' #大主宰　　其他: 　　endURL=' http://www.bxwx8.org/b/98/98289/index.html ' #择天记　　headURL=' http://www.bxwx8.org/b/98/98289/' #择天记　　　　endURL=' http://www.bxwx8.org/b/35/35282/index.html ' #武动乾坤　　headURL=' http://www.bxwx8.org/b/35/35282/' #武动乾坤　　　　isEnd=False 　　　　　　resp=urllib.request.urlopen (url) 　　　　#处理的字符的确是gbk的,但是其中夹杂的部分特殊字符, 　　#是gbk编码中所没有的如果有些特殊字符是GB18030中有的,但是是gbk中没有的。　　#则用gbk去解码,去所不支持的字符,也比如会出的错。　　#所以,此种情况,可以尝试用和当前编码(gbk)所兼容的但所包含字符更多的编码(gb18030)去解码,或许就可以了。　　# allHtml=resp.read () .decode (gbk) # 　　allHtml=resp.read () .decode (gb18030) # 　　　　textSoup=BeautifulSoup (allHtml) 　　　　#章节名　　strChapter=textSoup.find (id=氨晏狻?.getText () .split (r“【”) [0] 　　strChapter=strChapter.split (r ' (') [0] 　　strChapter=strChapter。替换(“正文’,”)+ ' \ n ' 　　numChapter=numChapter + 1 　　strID=' # ' + str (numChapter) +“- - -” 　　strChapter=strID + strChapter 　　　　strChapter=strChapter + ' \ n - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \ n”+ url + ' \ n - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \ n” 　　#小说正文　　strNovel=textSoup.find (id=澳谌荨?.getText () 　　strNovel=strNovel。替换(' ',' \ n ') 　　　　#除去正文中多余的第XXX章　　strMatch=r”第[\ u4e00 - \ u9fa5] +章” 　　list2replace=re.findall (strMatch strNovel) 　　如果list2replace: 　　str2replace=list2replace [0] 　　strNovel=strNovel。替换(str2replace”) 　　　　#合并章节和正文　　strNovel=strChapter + strNovel + ' \ n - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \ n - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \ n” 　　　　#写到txt文件中　　write2TXT (strNovel) 　　　　#获取下一个章节的URL 　　nextURL=re.findall (r 'var next_page=" (\ w) +。html“allHtml) [0] 　　nextURL=nextURL。替换(r’’,”) 　　nextURL=nextURL。替换(r 'var next_page=?”) 　　nextURL=headURL + nextURL 　　　　打印(numChapter) #章节数　　打印(strChapter) #章节名字　　打印((nextURL)) #下一章的URL 　　　　　　如果(endURL==nextURL): 　　isEnd=True 　　　　返回nextURL isEnd
Python下载网络小说实例代码