Skip to content

Commit

Permalink
勘误表-修改书籍错误
Browse files Browse the repository at this point in the history
  • Loading branch information
qiyeboy committed Jul 3, 2017
1 parent 4bb0aa8 commit 68c8f14
Show file tree
Hide file tree
Showing 10 changed files with 191 additions and 135 deletions.
Binary file added 122页.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added 123页.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added 138页.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion ch04/4.3.2.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
soup.title.name = 'mytitle'
print soup.title
print soup.mytitle
soup.title.name = 'title'
soup.mytitle.name = 'title'
print soup.p['class']
print soup.p.get('class')

Expand Down
2 changes: 1 addition & 1 deletion ch05/5.3.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ def _format_addr(s):
#收件人地址
to_addr = '[email protected]'
#163网易邮箱服务器地址
smtp_server = 'smtp.163.com '
smtp_server = 'smtp.163.com'
#设置邮件信息
msg = MIMEText('Python爬虫运行异常,异常信息为遇到HTTP 403', 'plain', 'utf-8')
msg['From'] = _format_addr('一号爬虫 <%s>' % from_addr)
Expand Down
5 changes: 4 additions & 1 deletion ch06/HtmlParser.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,10 @@ def _get_new_urls(self,page_url,soup):
'''
new_urls = set()
#抽取符合要求的a标签
links = soup.find_all('a',href=re.compile(r'/view/\d+\.htm'))
#原书代码
# links = soup.find_all('a',href=re.compile(r'/view/\d+\.htm'))
#2017-07-03 更新,原因百度词条的链接形式发生改变
links = soup.find_all('a', href=re.compile(r'/item/.*'))
for link in links:
#提取href属性
new_url = link['href']
Expand Down
10 changes: 4 additions & 6 deletions ch06/SpiderMan.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
#coding:utf-8
from ch06.URLManager import UrlManager
from URLManager import UrlManager
from HtmlDownloader import HtmlDownloader
from HtmlParser import HtmlParser

from ch06.HtmlDownloader import HtmlDownloader

from ch06.HtmlParser import HtmlParser

from ch06.DataOutput import DataOutput
from DataOutput import DataOutput


class SpiderMan(object):
Expand Down
274 changes: 150 additions & 124 deletions ch06/baike.html

Large diffs are not rendered by default.

5 changes: 4 additions & 1 deletion ch07/SpiderNode/HtmlParser.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,10 @@ def _get_new_urls(self,page_url,soup):
'''
new_urls = set()
#抽取符合要求的a标签
links = soup.find_all('a',href=re.compile(r'/view/\d+\.htm'))
# 原书代码
# links = soup.find_all('a', href=re.compile(r'/view/\d+\.htm'))
#2017-07-03 更新,原因百度词条的链接形式发生改变
links = soup.find_all('a',href=re.compile(r'/item/.*'))
for link in links:
#提取href属性
new_url = link['href']
Expand Down
28 changes: 27 additions & 1 deletion 勘误表.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,30 @@ for url in ["ImageUrl_"+str(i) for i in range(10)]:
<br>
将sqlite3改成MySQLdb

感谢 @lg-Cat73
感谢 @lg-Cat73

#### 3. 书籍P122-P123 CSS选择器表达式多出空格
原因:印刷错误
<br>
![](122页.png)
<br>
![](123页.png)
<br>
感谢 @Judy0513

#### 4. 书籍P144和P156 soup.find_all('a', href = re.compile(r'/view/\d+.htm'))正则表达式有误
原因:百度词条的链接链接结构发生改变,不属于程序错误。
修改如下:
```python
links = soup.find_all('a', href=re.compile(r'/item/.*'))
```
感谢 @Judy0513
#### 5.书籍P138页代码有误,多出空格
原因:笔误
<br>
![](138页.png)
<br>
修改如下:
```python
smtp_server = 'smtp.163.com'
```

0 comments on commit 68c8f14

Please sign in to comment.