👉 项目官网:https://www.python-office.com/ 👈

github star

大家好,这里是程序员晚枫,正在all in AI编程实战

今天教你怎么用Python抓取网页数据,不用手动复制粘贴!

1、安装需要的库

1
pip install requests beautifulsoup4

2、抓取网页内容

1
2
3
4
5
6
7
8
9
10
11
import requests

# 获取网页内容
url = 'https://www.example.com'
response = requests.get(url)

# 查看状态码(200=成功)
print(f'状态码: {response.status_code}')

# 查看内容
print(response.text[:500]) # 只显示前500个字符

3、解析HTML提取数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from bs4 import BeautifulSoup

html = '''
<html>
<body>
<div class="product">
<h2>产品A</h2>
<span class="price">99元</span>
</div>
<div class="product">
<h2>产品B</h2>
<span class="price">199元</span>
</div>
</body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')

# 提取所有产品
products = soup.find_all('div', class_='product')

for p in products:
name = p.find('h2').text
price = p.find('span', class_='price').text
print(f'{name}: {price}')

4、实战案例:抓取天气数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import requests
import office

# 获取天气预报网页
url = 'https://weather.example.com/chongqing'

try:
response = requests.get(url, timeout=5)
soup = BeautifulSoup(response.text, 'html.parser')

# 提取天气信息(根据实际网页结构调整)
temp = soup.find('span', class_='temperature').text
condition = soup.find('span', class_='condition').text

print(f'今天天气: {condition}')
print(f'温度: {temp}')

# 保存到Excel
data = [['日期', '天气', '温度'], ['今天', condition, temp]]
office.excel.write(path='天气记录.xlsx', data=data)

except Exception as e:
print(f'抓取失败: {e}')

5、实战案例:抓取新闻标题

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import requests
from bs4 import BeautifulSoup
import office

def crawl_news():
"""抓取新闻标题"""

# 新闻网站URL
url = 'https://news.example.com'

try:
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, 'html.parser')

# 查找新闻列表(根据实际网页结构调整)
news_items = soup.find_all('a', class_='news-title')

data = []
for item in news_items[:20]: # 只取前20条
title = item.text.strip()
link = item.get('href', '')
data.append([title, link])

# 保存
office.excel.write(path='新闻标题.xlsx', data=data)
print(f'已抓取 {len(data)} 条新闻')

except Exception as e:
print(f'抓取失败: {e}')

crawl_news()

6、实战案例:批量抓取商品价格

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import requests
from bs4 import BeautifulSoup
import office
import time

def crawl_prices(product_list):
"""批量抓取商品价格"""

results = []

for keyword in product_list:
# 搜索URL
url = f'https://search.example.com?q={keyword}'

try:
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, 'html.parser')

# 提取价格(根据实际网页结构调整)
price_elem = soup.find('span', class_='price')
price = price_elem.text if price_elem else '未找到'

results.append([keyword, price])
print(f'{keyword}: {price}')

except Exception as e:
results.append([keyword, f'抓取失败: {e}'])

time.sleep(1) # 避免请求太快

# 保存结果
office.excel.write(path='价格对比.xlsx', data=results)
print('价格抓取完成!')

# 抓取多个商品
crawl_prices(['iPhone', 'iPad', 'MacBook'])

7、常见问题

Q:抓取失败,返回403错误?

A:网站禁止爬虫。可以加请求头伪装浏览器:

1
2
headers = {'User-Agent': 'Mozilla/5.0...'}
response = requests.get(url, headers=headers)

Q:数据提取不出来?

A:网页可能是 JS 动态加载的,需要用 Selenium 工具。先学基础,抓能抓的就行。

Q:抓取太多被封IP了?

A:加延时 time.sleep(2),或者用代理IP。

8、下讲预告

学会了网页抓取,下一讲我们学 剪贴板操作:读写系统剪贴板。

敬请期待!


有问题欢迎加微信 python-office 进群交流~

程序员晚枫专注AI编程培训,小白看完他和图灵社区合作的教程《30讲 · AI编程训练营》就能上手做AI项目。

🎓 AI 编程实战课程

想系统学习 AI 编程?程序员晚枫的 AI 编程实战课 帮你从零上手!