一:示例輸出
二:示例結(jié)果
三:示例說明
你是否也曾思考過——京東上成千上萬的商品,消費者到底都在評論什么?本次我們通過構(gòu)建一套系統(tǒng)化爬蟲方案,成功抓取了京東平臺上975個熱銷商品的多維度評論數(shù)據(jù),總計獲取8546條有效評論。下面為大家揭秘我們的技術(shù)方案與實操過程。
三、精準(zhǔn)采集代碼實現(xiàn)
1. 評論列表頁解析
import re
import json
from bs4 import BeautifulSoup
def parse_comment_list(html):
soup = BeautifulSoup(html, "html.parser")
script_tag = soup.find("script", id="J-product評論-列表")
if not script_tag:
return None, 0
# 提取JSON數(shù)據(jù)(京東評論數(shù)據(jù)通過JS變量存儲)
json_str = re.search(r'window.__INITIAL_STATE__=(.*?);</script>', str(script_tag)).group(1)
data = json.loads(json_str)
comments = []
for item in data["comments"]:
comments.append({
"comment_id": item["id"],
"content": item["content"],
"score": item["score"],
"user_name": item["userName"],
"creation_time": item["creationTime"],
"useful_votes": item["usefulVoteCount"],
"reply_count": item["replyCount"],
"images": [img["imgUrl"] for img in item.get("images", [])],
"user_level": item["userLevelName"],
"product_model": item.get("productColor", "") + " " + item.get("productSize", "")
})
total_comments = data["productCommentSummary"]["commentCount"]
has_next = data["page"]["pageNo"] < data["page"]["pageTotal"]
return comments, total_comments, has_next
2. 深度采集循環(huán)(含分頁)
Result Object: --------------------------------------- { "items": { "totalpage": "100", "total_results": 20000, "page_size": 10, "page": "1", "item": [ { "rate_id": "21992238159", "rate_content": "物流和產(chǎn)品都不錯,性價比高,贊贊贊 質(zhì)量非常好,客服態(tài)度非常非常贊,有問題及時給解決了購物體驗很棒,商品物美價廉,質(zhì)量優(yōu)秀。物流迅速,商家服務(wù)貼心,售后無憂。高顏值,高品質(zhì),非常好,一分錢一分貨,材質(zhì)外觀和質(zhì)量一看就很上檔次,非常喜歡", "rate_date": "2024-12-23 13:49:08", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/268223/27/1816/23111/6768f9d3F79259578/1f946da747fb3842.jpg.dpg", "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/267912/12/1941/21072/6768f9d3F8603a860/4778a28cfc8bb02d.jpg.dpg" ], "display_user_nick": "唐***月", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "21940879411", "rate_content": "這條充電線質(zhì)量非常好,線材柔軟,使用壽命長。充電速度快,兼容性強(qiáng),適用于多種設(shè)備。外觀設(shè)計簡潔大方,白色外觀顯得干凈整潔。而且價格合理,性價比很高。使用了一段時間,沒有出現(xiàn)任何質(zhì)量問題,非常滿意。推薦給需要充電線的朋友們,絕對物超所值!", "rate_date": "2024-12-13 23:39:55", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/228791/35/34183/56412/675c553fFa70cd813/2e1022f9a25e945a.jpg.dpg", "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/195256/23/50656/47240/675c5541F79f5af5e/e4d60651ed0c626f.jpg.dpg" ], "display_user_nick": "j***j", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "21971245211", "rate_content": "快遞很快,質(zhì)量棒極了,建議購買強(qiáng)烈推薦!商品物超所值,質(zhì)量可靠。物流快,商家服務(wù)熱情,售后服務(wù)完善。物流很快, 產(chǎn)品很快就收到了,比想象中還好,不錯不錯!希望能耐用商品質(zhì)量非常好,外觀設(shè)計新穎,物流速度快,商家服務(wù)態(tài)度好,性價比高。", "rate_date": "2024-12-20 06:09:37", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/262487/32/530/222631/67649996F59b29dbb/41daef4d63774912.jpg.dpg", "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/254171/36/1720/29817/6764999eF1e3da91c/49f9e4e3c0f7b489.jpg.dpg" ], "display_user_nick": "j***b", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "22505131588", "rate_content": "這款充電線質(zhì)量真心不錯! 用了兩年,依然如新,充電速度也很快,完全滿足日常需求。非常滿意的一次購物體驗! ", "rate_date": "2025-03-05 20:55:47", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/261336/2/28436/316471/67c849d2Fd7ca649e/834ef8563c36f823.jpg.dpg" ], "display_user_nick": "鄭***c", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "21921806331", "rate_content": "真的超級喜歡,非常支持,質(zhì)量非常好,與賣家描述的完全一致,非常滿意,真的很喜歡,完全超出期望值,發(fā)貨速度非常快,包裝非常仔細(xì)、嚴(yán)實,物流公司服務(wù)態(tài)度很好,運(yùn)送速度很快,很滿意的一次購物", "rate_date": "2024-12-10 14:29:28", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/151256/8/50806/62364/6757dfc7Fb0e0d3d1/cfcfcea8f15baeb6.jpg.dpg" ], "display_user_nick": "j***6", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "22023511937", "rate_content": "這個商品的質(zhì)量真是太好了,用起來非常順手,效果也很滿意。外觀精美,不僅提升了使用體驗,還為家居增添了美感。價格雖然高了一些,但相比其優(yōu)良的品質(zhì)和體驗,絕對是物超所值。強(qiáng)烈推薦給追求品質(zhì)生活的你!", "rate_date": "2024-12-29 08:22:00", "pics": [], "display_user_nick": "馳***生", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "23041323552", "rate_content": "沖電器大小適中,沖電非常的快并且不發(fā)熱。非常不錯!", "rate_date": "2025-04-21 17:06:36", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/283321/39/23600/2544182/68060a9bF35b67b24/1ccea6f930ef3986.jpg.dpg", "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/271623/13/24033/2449752/68060a9bFf4aa3b47/184ce7b5a88a25f4.jpg.dpg" ], "display_user_nick": "j***c", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "22823734129", "rate_content": "很好的充電套裝,線足夠長,充電也夠快,非常滿意。", "rate_date": "2025-04-04 17:55:47", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/281947/15/15093/51962/67efac76Ff463f982/a596114283fc4e3a.jpg.dpg" ], "display_user_nick": "雪***泳", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "22036250935", "rate_content": "東西質(zhì)量非常好,與賣家描述的完全一致,非常滿意\n做工質(zhì)感:好\n充電速度:好\n便攜性能:好\n安全性能:好\n其他特色:好", "rate_date": "2024-12-31 12:06:34", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/262528/8/5909/64887/67736dcaF7461323d/7565029f66830601.jpg.dpg" ], "display_user_nick": "j***a", "videos": [], "auction_sku": null, "add_feedback": null }, { "rate_id": "22666913189", "rate_content": "非常不錯,質(zhì)感很好,充電快", "rate_date": "2025-03-22 17:55:18", "pics": [ "http://img30.360buyimg.com/shaidan/s1080x1080_jfs/t1/276088/1/7952/71045/67de8905Fecb7c861/5bf2b9f89b0b7897.jpg.dpg" ], "display_user_nick": "o***g", "videos": [], "auction_sku": null, "add_feedback": null } ], "_ddf": "fb" }, "secache": "5dc2b1edf5008bcf6577411b1f5fbd16", "secache_time": 1749537314, "secache_date": "2025-06-10 14:35:14", "translate_status": "", "translate_time": 0, "language": { "default_lang": "cn", "current_lang": "cn" }, "error": "", "reason": "", "error_code": "0000", "cache": 0, "api_info": "today:71 max:10000 all[374=71+49+254];expires:2030-10-30", "execution_time": "4.646", "server_time": "Beijing/2025-06-10 14:35:14", "client_ip": "106.6.46.187", "call_args": { "num_iid": "10114820943599", "data": "1" }, "api_type": "jd", "translate_language": "zh-CN", "translate_engine": "google_new", "server_memory": "3.33MB", "request_id": "gw-3.6847d21e3ae75", "last_id": "4513851984"; }
四、性能優(yōu)化建議
- 分布式爬蟲架構(gòu):
plaintext
┌───────────┐ ┌───────────┐ ┌───────────┐ │ 調(diào)度中心 │ │ 爬蟲節(jié)點 │ │ 數(shù)據(jù)倉庫 │ │ (Redis) │←──→│ (Scrapy)│←──→│ (MongoDB)│ └───────────┘ └───────────┘ └───────────┘ ↑ ↑ ↑ ├────────────┼────────────┤ │ ┌──────┼──────┐ │ └───→│ 代理池│←──────┘ │ └──────┼──────┘ ┌──┴───┐ │ 清洗 │ └──────┘ - 增量采集:
通過 Redis 記錄最后采集時間和評論 ID,僅采集新更新的評論,減少重復(fù)請求。
五、注意事項
- 京東反爬升級應(yīng)對:
- 定期檢查頁面結(jié)構(gòu)變化(如評論數(shù)據(jù)存儲位置從 JS 變量改為 JSON 接口)
- 使用
Selenium +undetected-chromedriver
繞過最新反爬檢測
- 代碼維護(hù)成本:
爬蟲代碼需頻繁適配京東頁面更新,建議搭配Playwright
等自動化工具提升健壯性。
通過以上方案,可在合規(guī)前提下實現(xiàn)京東評論的精準(zhǔn)采集,確有必要時再使用爬蟲,并嚴(yán)格控制采集規(guī)模。