
👉 项目官网:https://www.python-office.com/ 👈

大家好,这里是程序员晚枫,正在all in AI编程实战。
今天教你怎么用Python抓取网页数据,不用手动复制粘贴!
1、安装需要的库
1
| pip install requests beautifulsoup4
|
2、抓取网页内容
1 2 3 4 5 6 7 8 9 10 11
| import requests
url = 'https://www.example.com' response = requests.get(url)
print(f'状态码: {response.status_code}')
print(response.text[:500])
|
3、解析HTML提取数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| from bs4 import BeautifulSoup
html = ''' <html> <body> <div class="product"> <h2>产品A</h2> <span class="price">99元</span> </div> <div class="product"> <h2>产品B</h2> <span class="price">199元</span> </div> </body> </html> '''
soup = BeautifulSoup(html, 'html.parser')
products = soup.find_all('div', class_='product')
for p in products: name = p.find('h2').text price = p.find('span', class_='price').text print(f'{name}: {price}')
|
4、实战案例:抓取天气数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| import requests import office
url = 'https://weather.example.com/chongqing'
try: response = requests.get(url, timeout=5) soup = BeautifulSoup(response.text, 'html.parser') temp = soup.find('span', class_='temperature').text condition = soup.find('span', class_='condition').text print(f'今天天气: {condition}') print(f'温度: {temp}') data = [['日期', '天气', '温度'], ['今天', condition, temp]] office.excel.write(path='天气记录.xlsx', data=data) except Exception as e: print(f'抓取失败: {e}')
|
5、实战案例:抓取新闻标题
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| import requests from bs4 import BeautifulSoup import office
def crawl_news(): """抓取新闻标题""" url = 'https://news.example.com' try: response = requests.get(url, timeout=10) soup = BeautifulSoup(response.text, 'html.parser') news_items = soup.find_all('a', class_='news-title') data = [] for item in news_items[:20]: title = item.text.strip() link = item.get('href', '') data.append([title, link]) office.excel.write(path='新闻标题.xlsx', data=data) print(f'已抓取 {len(data)} 条新闻') except Exception as e: print(f'抓取失败: {e}')
crawl_news()
|
6、实战案例:批量抓取商品价格
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
| import requests from bs4 import BeautifulSoup import office import time
def crawl_prices(product_list): """批量抓取商品价格""" results = [] for keyword in product_list: url = f'https://search.example.com?q={keyword}' try: response = requests.get(url, timeout=10) soup = BeautifulSoup(response.text, 'html.parser') price_elem = soup.find('span', class_='price') price = price_elem.text if price_elem else '未找到' results.append([keyword, price]) print(f'{keyword}: {price}') except Exception as e: results.append([keyword, f'抓取失败: {e}']) time.sleep(1) office.excel.write(path='价格对比.xlsx', data=results) print('价格抓取完成!')
crawl_prices(['iPhone', 'iPad', 'MacBook'])
|
7、常见问题
Q:抓取失败,返回403错误?
A:网站禁止爬虫。可以加请求头伪装浏览器:
1 2
| headers = {'User-Agent': 'Mozilla/5.0...'} response = requests.get(url, headers=headers)
|
Q:数据提取不出来?
A:网页可能是 JS 动态加载的,需要用 Selenium 工具。先学基础,抓能抓的就行。
Q:抓取太多被封IP了?
A:加延时 time.sleep(2),或者用代理IP。
8、下讲预告
学会了网页抓取,下一讲我们学 剪贴板操作:读写系统剪贴板。
敬请期待!
有问题欢迎加微信 python-office 进群交流~
程序员晚枫专注AI编程培训,小白看完他和图灵社区合作的教程《30讲 · AI编程训练营》就能上手做AI项目。
🎓 AI 编程实战课程
想系统学习 AI 编程?程序员晚枫的 AI 编程实战课 帮你从零上手!