我用 Python 搭了一套网页数据采集系统:从反爬绕过到结构化存储,附 5 个实战案例

发布时间:2026/6/18 22:49:34
我用 Python 搭了一套网页数据采集系统:从反爬绕过到结构化存储,附 5 个实战案例
我用 Python 搭了一套网页数据采集系统从反爬绕过到结构化存储附 5 个实战案例适合需要从网站批量采集数据文章、商品、评论等的开发者和运营。本文用 Python Playwright BeautifulSoup 搭了一套完整的网页采集系统附反爬绕过方案和 5 个真实案例。背景为什么需要网页采集内容创作者和运营每天都需要从网上获取数据采集竞品文章标题和阅读量监控商品价格变化收集用户评论做分析抓取行业资讯做选题参考手动复制粘贴效率极低。用 Python 自动化一天能采集几万条数据。技术选型方案适合场景反爬能力学习成本requests BeautifulSoup静态页面弱⭐Playwright动态页面JS 渲染强⭐⭐Scrapy大规模采集中⭐⭐⭐Selenium兼容旧系统中⭐⭐我的组合Playwright处理动态页面 BeautifulSoup解析 HTML SQLite存储模块 1基础采集器fromplaywright.sync_apiimportsync_playwrightfrombs4importBeautifulSoupimportsqlite3importtimeimportrandomclassWebScraper:网页采集器def__init__(self,db_pathscraper.db):self.dbsqlite3.connect(db_path)self._init_db()def_init_db(self):self.db.execute( CREATE TABLE IF NOT EXISTS scraped ( id INTEGER PRIMARY KEY AUTOINCREMENT, url TEXT, title TEXT, content TEXT, scraped_at DATETIME DEFAULT CURRENT_TIMESTAMP ) )self.db.commit()deffetch(self,url,wait_forNone):获取页面内容支持 JS 渲染withsync_playwright()asp:browserp.chromium.launch(headlessTrue)pagebrowser.new_page()# 设置 User-Agent 模拟真实浏览器page.set_extra_http_headers({User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36})page.goto(url,wait_untilnetworkidle)ifwait_for:page.wait_for_selector(wait_for)htmlpage.content()browser.close()returnhtmldefparse(self,html,selectors):解析 HTML 提取数据soupBeautifulSoup(html,html.parser)results[]itemssoup.select(selectors[container])foriteminitems:data{}forkey,selectorinselectors[fields].items():elemitem.select_one(selector)data[key]elem.get_text(stripTrue)ifelemelseresults.append(data)returnresultsdefsave(self,url,data_list):保存到数据库fordataindata_list:self.db.execute(INSERT INTO scraped (url, title, content) VALUES (?, ?, ?),(url,data.get(title,),data.get(content,)))self.db.commit()print(f保存{len(data_list)}条数据)defscrape(self,url,selectors,delay2):完整的采集流程print(f采集:{url})htmlself.fetch(url)dataself.parse(html,selectors)self.save(url,data)# 随机延迟避免被封time.sleep(delayrandom.uniform(0,2))returndata模块 2反爬绕过方案方案 1随机 User-AgentUSER_AGENTS[Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/119.0.0.0,Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/17.2,]defget_random_ua():returnrandom.choice(USER_AGENTS)方案 2代理 IP 轮换PROXY_LIST[http://proxy1:8080,http://proxy2:8080,http://proxy3:8080,]deffetch_with_proxy(url,proxy_list):使用代理 IP 采集proxyrandom.choice(proxy_list)withsync_playwright()asp:browserp.chromium.launch(headlessTrue,proxy{server:proxy})pagebrowser.new_page()page.goto(url,wait_untilnetworkidle)htmlpage.content()browser.close()returnhtml方案 3模拟人类行为defhuman_like_scroll(page):模拟人类滚动行为for_inrange(random.randint(3,8)):page.mouse.wheel(0,random.randint(200,600))time.sleep(random.uniform(0.5,1.5))defhuman_like_click(page,selector):模拟人类点击带随机偏移elempage.locator(selector)boxelem.bounding_box()ifbox:xbox[x]random.uniform(5,box[width]-5)ybox[y]random.uniform(5,box[height]-5)page.mouse.click(x,y)time.sleep(random.uniform(0.3,0.8))方案 4Cookie 持久化defsave_cookies(context,pathcookies.json):保存登录 Cookiecookiescontext.cookies()importjsonwithopen(path,w)asf:json.dump(cookies,f)defload_cookies(context,pathcookies.json):加载已保存的 Cookieimportjsonifos.path.exists(path):withopen(path)asf:cookiesjson.load(f)context.add_cookies(cookies)方案 5请求频率控制classRateLimiter:请求频率限制器def__init__(self,max_requests_per_minute20):self.max_rpmmax_requests_per_minute self.requests[]defwait_if_needed(self):如果请求太快就等待nowtime.time()# 清理 1 分钟前的记录self.requests[tfortinself.requestsifnow-t60]iflen(self.requests)self.max_rpm:wait_time60-(now-self.requests[0])print(f频率限制等待{wait_time:.1f}秒)time.sleep(wait_time)self.requests.append(time.time())模块 35 个实战案例案例 1采集 CSDN 热榜文章defscrape_csdn_hot():采集 CSDN 人工智能热榜scraperWebScraper()selectors{container:.blog-list-item-top,fields:{title:.blog-list-item-top a,link:.blog-list-item-top ahref,views:.blog-list-item-top .view-num}}urlhttps://blog.csdn.net/nav/aidatascraper.scrape(url,selectors)foritemindata[:10]:print(f{item[title]}-{item.get(views,N/A)}阅读)returndata案例 2采集新闻资讯defscrape_news(url,selectors):通用新闻采集scraperWebScraper()returnscraper.scrape(url,selectors)# 示例采集某个技术博客selectors{container:article.post,fields:{title:h2 a,summary:.post-excerpt,date:.post-date}}案例 3监控商品价格defmonitor_price(url,price_selector,product_name):监控商品价格变化scraperWebScraper()htmlscraper.fetch(url)soupBeautifulSoup(html,html.parser)price_elemsoup.select_one(price_selector)ifprice_elem:price_textprice_elem.get_text(stripTrue)# 提取数字importre pricere.search(r[\d,]\.?\d*,price_text)ifprice:price_valfloat(price.group().replace(,,))print(f{product_name}: ¥{price_val})# 存入数据库scraper.db.execute(INSERT INTO prices (product, price, recorded_at) VALUES (?, ?, datetime(now)),(product_name,price_val))scraper.db.commit()returnprice_valreturnNone案例 4采集评论数据defscrape_comments(url,comment_selector,max_pages5):采集多页评论scraperWebScraper()all_comments[]forpage_numinrange(1,max_pages1):page_urlf{url}?page{page_num}htmlscraper.fetch(page_url)soupBeautifulSoup(html,html.parser)commentssoup.select(comment_selector)forcommentincomments:all_comments.append(comment.get_text(stripTrue))print(f第{page_num}页:{len(comments)}条评论)# 随机延迟time.sleep(random.uniform(2,5))print(f共采集{len(all_comments)}条评论)returnall_comments案例 5定时采集 变化通知importscheduledefscheduled_scrape():定时采集并检查变化scraperWebScraper()# 采集当前数据currentscraper.scrape(https://example.com/data,selectors)# 对比上次数据lastscraper.db.execute(SELECT content FROM scraped ORDER BY scraped_at DESC LIMIT 10).fetchall()# 检查新增内容last_contents{row[0]forrowinlast}new_items[itemforitemincurrentifitem.get(content)notinlast_contents]ifnew_items:print(f发现{len(new_items)}条新内容)# 发送通知邮件/企业微信等send_notification(new_items)# 每小时采集一次schedule.every(1).hours.do(scheduled_scrape)踩坑记录坑 1页面加载不完全就采集症状采集到的数据是空的或者不完整。原因页面是 JS 动态渲染的HTML 下载完但 JS 还没执行完。解决用wait_untilnetworkidle等待网络空闲或者用wait_for_selector等待特定元素出现。坑 2被网站封 IP症状采集了 100 页后突然全部 403。原因请求频率太高触发了反爬机制。解决加随机延迟2-5 秒、轮换 User-Agent、使用代理 IP。坑 3Cookie 过期症状登录态采集突然失败返回登录页面。原因Cookie 过期了。解决定期重新登录获取新 Cookie或者用 Cookie 持久化方案。坑 4数据编码问题症状采集到的中文全是乱码。原因页面编码不是 UTF-8。解决用page.content()获取的 HTML 自动处理编码不要手动 decode。坑 5采集速度太慢症状Playwright 每页要 3-5 秒1000 页要 1 小时。原因Playwright 要启动浏览器开销大。解决用browser_context复用浏览器实例不要每页都启动新的。或者简单页面用 requests 替代 Playwright。总结3 条核心经验Playwright 是动态页面采集的最佳方案。能处理 JS 渲染、模拟登录、模拟人类行为比 Selenium 更快更稳定。反爬的核心是像人。随机延迟、轮换 UA、模拟滚动和点击让采集行为看起来像真人浏览。采集频率控制比反爬技巧更重要。宁可慢一点也不要被封 IP。被封了要换 IP 或等很久才能恢复。你有做过网页采集吗遇到过什么反爬问题评论区交流。