35-网页数据抓取：用Python自动获取网上数据

大家好，这里是程序员晚枫，正在all in AI编程实战。

今天教你怎么用Python抓取网页数据，不用手动复制粘贴！

1、安装需要的库

1	pip install requests beautifulsoup4

2、抓取网页内容

import requests

# 获取网页内容
url = 'https://www.example.com'
response = requests.get(url)

# 查看状态码（200=成功）
print(f'状态码: {response.status_code}')

# 查看内容
print(response.text[:500])  # 只显示前500个字符

3、解析HTML提取数据

from bs4 import BeautifulSoup

html = '''
<html>
<body>
<div class="product">
    <h2>产品A</h2>
    <span class="price">99元</span>
</div>
<div class="product">
    <h2>产品B</h2>
    <span class="price">199元</span>
</div>
</body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')

# 提取所有产品
products = soup.find_all('div', class_='product')

for p in products:
    name = p.find('h2').text
    price = p.find('span', class_='price').text
    print(f'{name}: {price}')

4、实战案例：抓取天气数据

import requests
import office

# 获取天气预报网页
url = 'https://weather.example.com/chongqing'

try:
    response = requests.get(url, timeout=5)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 提取天气信息（根据实际网页结构调整）
    temp = soup.find('span', class_='temperature').text
    condition = soup.find('span', class_='condition').text
    
    print(f'今天天气: {condition}')
    print(f'温度: {temp}')
    
    # 保存到Excel
    data = [['日期', '天气', '温度'], ['今天', condition, temp]]
    office.excel.write(path='天气记录.xlsx', data=data)
    
except Exception as e:
    print(f'抓取失败: {e}')

5、实战案例：抓取新闻标题

import requests
from bs4 import BeautifulSoup
import office

def crawl_news():
    """抓取新闻标题"""
    
    # 新闻网站URL
    url = 'https://news.example.com'
    
    try:
        response = requests.get(url, timeout=10)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # 查找新闻列表（根据实际网页结构调整）
        news_items = soup.find_all('a', class_='news-title')
        
        data = []
        for item in news_items[:20]:  # 只取前20条
            title = item.text.strip()
            link = item.get('href', '')
            data.append([title, link])
        
        # 保存
        office.excel.write(path='新闻标题.xlsx', data=data)
        print(f'已抓取 {len(data)} 条新闻')
        
    except Exception as e:
        print(f'抓取失败: {e}')

crawl_news()

6、实战案例：批量抓取商品价格

import requests
from bs4 import BeautifulSoup
import office
import time

def crawl_prices(product_list):
    """批量抓取商品价格"""
    
    results = []
    
    for keyword in product_list:
        # 搜索URL
        url = f'https://search.example.com?q={keyword}'
        
        try:
            response = requests.get(url, timeout=10)
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # 提取价格（根据实际网页结构调整）
            price_elem = soup.find('span', class_='price')
            price = price_elem.text if price_elem else '未找到'
            
            results.append([keyword, price])
            print(f'{keyword}: {price}')
            
        except Exception as e:
            results.append([keyword, f'抓取失败: {e}'])
        
        time.sleep(1)  # 避免请求太快
    
    # 保存结果
    office.excel.write(path='价格对比.xlsx', data=results)
    print('价格抓取完成！')

# 抓取多个商品
crawl_prices(['iPhone', 'iPad', 'MacBook'])

7、常见问题

Q：抓取失败，返回403错误？

A：网站禁止爬虫。可以加请求头伪装浏览器：

1 2	headers = {'User-Agent': 'Mozilla/5.0...'} response = requests.get(url, headers=headers)

Q：数据提取不出来？

A：网页可能是 JS 动态加载的，需要用 Selenium 工具。先学基础，抓能抓的就行。

Q：抓取太多被封IP了？

A：加延时 time.sleep(2)，或者用代理IP。

8、下讲预告

学会了网页抓取，下一讲我们学 剪贴板操作：读写系统剪贴板。

敬请期待！

有问题欢迎加微信 python-office 进群交流~

程序员晚枫专注AI编程培训，小白看完他和图灵社区合作的教程《30讲 · AI编程训练营》就能上手做AI项目。

🎓 AI 编程实战课程

想系统学习 AI 编程？程序员晚枫的 AI 编程实战课 帮你从零上手！

👉 课程报名：点击这里报名，前3讲免费试听
👉 免费试看：B站免费试看前3讲，先看看适不适合自己