我的爬虫已经蠢蠢欲动了。
Requests库网络爬取实战 Requests库安装 终端输入pip install requests
爬取网页的通用代码框架 1 2 3 4 5 6 7 8 9 10 11 12 13 14 import requestsdef getHTMLText (url ): try : r = requests.get(url, timeout=30 ) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except : return "产生异常" if __name__ == "__main__" : url = "http://www.baidu.com" print(getHTMLText(url))
模拟浏览器向服务器提供http请求 有些网站能够根据头文件拒绝爬虫访问,故更改头文件为火狐5.0
1 2 3 4 5 6 7 8 9 10 import requestsurl = "http://ip138.com" try : kv = {'user-agent' :'Mozilla/5.0' } r = requests.get(url,headers=kv) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[1000 :2000 ]) except : print("爬取失败" )
爬取京东商品 1 2 3 4 5 6 7 8 9 10 import requestsurl = "https://item.jd.com/100004245954.html" try : r = requests.get("https://item.jd.com/100004245954.html" ) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[:1000 ]) except : print("爬取失败" )
百度/360关键词提交 自动向搜索引擎提交关键词并获得结果。
百度搜索代码 1 2 3 4 5 6 7 8 9 10 import requestskeyword = "gkdoe" try : kv = {'wd' :keyword} r = requests.get("https://www.baidu.com/s" ,params=kv) print(r.request.url) r.raise_for_status() print(len (r.text)) except : print("爬取失败" )
360搜索代码 1 2 3 4 5 6 7 8 9 10 import requestskeyword = "gkdoe" try : kv = {'q' :keyword} r = requests.get("https://www.so.com/s" ,params=kv) print(r.request.url) r.raise_for_status() print(len (r.text)) except : print("爬取失败" )
网络图片的爬取与存储 图片爬取代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import requestsimport osurl = "http://image.ngchina.com.cn/2019/0523/20190523103156143.jpg" root = "D://picture//" path = root + url.split('/' )[-1 ] try : if not os.path.exists(root): os.mkdir(root) if not os.path.exists(path): r = requests.get(url) with open (path, 'wb' ) as f: f.write(r.content) f.close() print("文件保存成功" ) else : print("文件已存在" ) except : print("爬取失败" )
IP地址归属地的自动查询 1 2 3 4 5 6 7 8 9 10 import requestsurl = "http://ip138.com/ips138.asp?ip=" try : r = requests.get(url + '55.55.55.55' ) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[-500 :]) except : print("爬取失败" )
BeautifulSoup库 BeautifulSoup库通俗来说是【解析、遍历、维护“标签树”(例如html、xml等格式的数据对象)的功能库】
BeautifulSoup库安装 终端输入pip install beautifulsoup4
两行代码解析信息 1 2 from bs4 import BeautifulSoupsoup = BeautifulSoup('<p>data</p>' , 'html.parser' )
第一个参数是需要BeautifulSoup解析的html格式信息,可用<p>data</p>
代替 第二个参数是解析器,这里使用的是html.parser
使用BeautifulSoup库的demo 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 from bs4 import BeautifulSoupimport requestsurl = 'http://python123.io/ws/demo.html' r = requests.get(url) demo = r.text soup = BeautifulSoup(demo, "html.parser" ) """ demo 表示被解析的html格式的内容 html.parser表示解析用的解析器 """ print(soup) print(soup.prettify())
BeautifulSoup类的基本元素 得到一个BeautifulSoup对象后,一般通过BeautifulSoup类的基本元素来提取html中的内容
提取html中的信息 1 2 3 4 5 print(soup.title) print(soup.a) print(soup.a.name) print(soup.a.parent.name) print(soup.a.parent.parent.name)
1 2 3 4 5 print('a标签类型是:' , type (soup.a)) print('第一个a标签的属性是:' , soup.a.attrs) print('a标签属性的类型是:' , type (soup.a.attrs)) print('a标签的class属性是:' , soup.a.attrs['class' ]) print('a标签的href属性是:' , soup.a.attrs['href' ])
1 2 3 print('第一个a标签的内容是:' , soup.a.string) print('a标签的非属性字符串的类型是:' , type (soup.a.string)) print('第一个p标签的内容是:' , soup.p.string)
find_all()方法 常用通过find_all()方法来查找标签元素:<>.find_all(name, attrs, recursive, string, **kwargs)
,返回一个列表类型,存储查找的结果 • name:对标签名称的检索字符串 • attrs:对标签属性值的检索字符串,可标注属性检索 • recursive:是否对子孙全部检索,默认True • string:<>…</>中字符串区域的检索字符串
1 2 print('所有a标签的内容:' , soup.find_all('a' )) print('a标签和b标签的内容:' , soup.find_all(['a' , 'b' ]))
1 2 3 4 for t in soup.find_all('a' ): print('t的值是:' , t) print('t的类型是:' , type (t)) print('a标签中的href属性是:' , t.get('href' ))
1 2 for i in soup.find_all(True ): print('标签名称:' , i.name)
1 2 3 print('href属性为http..的a标签元素是:' , soup.find_all('a' , href='http://www.icourse163.org/course/BIT-268001' )) print('class属性为title的标签元素是:' , soup.find_all(class_='title' )) print('id属性为link1的标签元素是:' , soup.find_all(id ='link1' ))
1 2 3 4 print(soup.head) print(soup.head.contents) print(soup.body.contents) """对于一个标签的儿子节点,不仅包括标签节点,也包括字符串节点,比如返回结果中的 \n"""
1 2 print(len (soup.body.contents)) print(soup.body.contents[1 ])
1 2 3 print(type (soup.body.children)) for i in soup.body.children: print(i.name)
中国大学排名定向爬虫 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 from bs4 import BeautifulSoupimport requestsimport bs4def getHTMLText (url ): try : kv = {'user-agent' : 'Mozilla/5.0' } r = requests.get(url, headers=kv, timeout=30 ) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except : return "产生异常" def fillUnivList (ulist, html ): soup = BeautifulSoup(html, "html.parser" ) for tr in soup.find('tbody' ).children: if isinstance (tr, bs4.element.Tag): tds = tr('td' ) ulist.append([tds[0 ].string, tds[1 ].string, tds[3 ].string]) pass def printUnivList (ulist, num ): tplt = "{0:^8}\t{1:{3}^10}\t{2:^9}" print(tplt.format ("排名" , "学校名称" , "总分" , chr (12288 ))) for i in range (num): u = ulist[i] print(tplt.format (u[0 ], u[1 ], u[2 ], chr (12288 ))) def main (): uinfo = [] url = 'http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html' html = getHTMLText(url) fillUnivList(uinfo, html) printUnivList(uinfo, 100 ) main()