使用 BeautifulSoup 和 Requests 抓取 Sefun 網站產品列表範例

環境

Python 3.12
venv (虛擬環境)

安裝依賴

pip install BeautifulSoup4
pip install requests
pip install lxml

從導航標籤中找到所有的 CategoryId

觀察導覽列中 html 結構

TIP

抓取所有包含 CategoryId 的連結，並提取 CategoryId 和標題。

HTML 結構

html


<ul class="nav navbar-nav">
    <li class="nav-item dropdown d-none d-md-block"><a
        href="/Goods/Index?Type=0&amp;CategoryId=efc8afab-ce41-470a-aefe-39646f0f464e" class="dropdown-toggle nav-link"
        style="padding: 23px 15px;">
        喜餅禮盒
        <i aria-hidden="true" class="fa fa-angle-down"></i></a>
        <ul class="dropdown-menu dropdown-menu-left">
            <li><a href="/Goods/Index?Type=0&amp;CategoryId=3f09f862-6449-4d8f-90bd-c1f5caaa2eb7" class="dropdown-item">
                西式喜餅
            </a></li>
            <li><a href="/Goods/Index?Type=0&amp;CategoryId=a6917eb9-fd30-4ab9-b768-684e2174d643" class="dropdown-item">
                中式大餅
            </a></li>
            <li><a href="/Goods/Index?Type=0&amp;CategoryId=a11d5a30-e792-4846-88c3-3dea663d9850" class="dropdown-item">
                中西喜餅
            </a></li>
            <li><a href="/Goods/Index?Type=0&amp;CategoryId=79efd7ec-8368-4fc7-b3bb-c947519c7726" class="dropdown-item">
                經典雙層喜餅
            </a></li>
            <li><a href="/Goods/Index?Type=0&amp;CategoryId=3416033d-62fd-4643-80a4-fc7bceda2ff0" class="dropdown-item">
                喜餅宅配試吃
            </a></li>
        </ul>
    </li>
    <!-- ... -->
</ul>

TIP

以下是使用 BeautifulSoup 和 requests 從 Sefun 網站抓取分類 ID 的範例程式碼：

Sample Code

python

import requests
from bs4 import BeautifulSoup
import json

# 獲取網頁的 HTML 內容
url = 'https://www.sefunnet.com/Home/Index'
html_doc = requests.get(url).text

# 使用 BeautifulSoup 解析 HTML 內容
soup = BeautifulSoup(html_doc, 'lxml')

# 找到 'nav' 標籤
nav_tag = soup.find('nav')

# 找到所有 href 中包含 'CategoryId' 的 'a' 標籤
a_tags = nav_tag.find_all('a', href=lambda href: href and 'CategoryId' in href)

# 從 'a' 標籤中提取內文和 category_id
result = []
for a in a_tags:
    href = a['href']
    title = a.get_text(strip=True)
    category_id = href.split('CategoryId=')[-1]
    result.append({"category_id": category_id, "href": href, "title": title})

# 將結果轉換為 JSON 格式
json_result = json.dumps(result, ensure_ascii=False, indent=2)

print(json_result)

Refactored Code

python


import requests
from bs4 import BeautifulSoup
import json

"""
This script fetches the HTML content from a URL, parses the content using BeautifulSoup,
finds the 'nav' tag, finds all 'a' tags with href containing 'CategoryId', extracts the inner text and category_id
from the 'a' tags, converts the result to JSON, and writes the JSON result to a file.
"""


def fetch_html(url):
    """Fetch the HTML content from the given URL."""
    response = requests.get(url)
    response.raise_for_status()
    return response.text


def parse_html(html):
    """Parse the HTML content using BeautifulSoup."""
    return BeautifulSoup(html, 'lxml')


def extract_categories(nav_tag, base_url):
    """Extract categories from 'nav' tag and return as a list of dictionaries."""
    categories = []
    a_tags = nav_tag.find_all('a', href=lambda href: href and 'CategoryId' in href)
    existing_category_ids = set()
    for a in a_tags:
        href = base_url + a['href']
        title = a.get_text(strip=True)
        category_id = href.split('CategoryId=')[-1]
        if category_id not in existing_category_ids:
            existing_category_ids.add(category_id)
            categories.append({"category_id": category_id, "href": href, "title": title})
    return categories


def save_json(data, filename):
    """Save the given data to a JSON file."""
    with open(filename, 'w', encoding='utf-8') as jsonfile:
        json.dump(data, jsonfile, ensure_ascii=False, indent=2)


def main():
    domain_url = 'https://www.sefunnet.com'
    html_doc = fetch_html(domain_url)
    soup = parse_html(html_doc)
    nav_tag = soup.find('nav')

    if nav_tag:
        category_list = extract_categories(nav_tag, domain_url)
        save_json(category_list, 'category_list.json')
    else:
        print("No 'nav' tag found.")


if __name__ == '__main__':
    main()

從既有的 JSON 檔案中讀取 CategoryId，並爬取所有 CategoryId 底下的商品

TIP

觀察既有的 JSON 檔案結構，並從中讀取 CategoryId。

json

[
  {
    "category_id": "efc8afab-ce41-470a-aefe-39646f0f464e",
    "href": "https://www.sefunnet.com/Goods/Index?Type=0&CategoryId=efc8afab-ce41-470a-aefe-39646f0f464e",
    "title": "喜餅禮盒"
  }
]

TIP

觀察商品列表頁面的 HTML 結構

html


<div class="productBox"><a href="javascript:void(0)"
                           onclick="GAOnclickEvent({&quot;GoodsId&quot;:&quot;15fc86b2-9676-423f-a8d9-cd608d7dee5e&quot;,&quot;GoodsDetailId&quot;:&quot;e567e88c-74f1-4d23-bdf2-cd368146c296&quot;,&quot;GoodsName&quot;:&quot;餅乾鐵盒．幾米完美小孩 (經典珍藏版)&quot;,&quot;Type&quot;:0,&quot;ShipType&quot;:0,&quot;GoodsImageUrl&quot;:[{&quot;ImageUrl&quot;:&quot;https://sefunnetblob.blob.core.windows.net/product/1698928215650.jpeg&quot;,&quot;Index&quot;:1},{&quot;ImageUrl&quot;:&quot;https://sefunnetblob.blob.core.windows.net/product/1692113543559.jpeg&quot;,&quot;Index&quot;:2},{&quot;ImageUrl&quot;:&quot;https://sefunnetblob.blob.core.windows.net/product/1692113543890.jpeg&quot;,&quot;Index&quot;:3},{&quot;ImageUrl&quot;:&quot;https://sefunnetblob.blob.core.windows.net/product/1692113543336.jpeg&quot;,&quot;Index&quot;:4},{&quot;ImageUrl&quot;:&quot;https://sefunnetblob.blob.core.windows.net/product/1692113543716.jpeg&quot;,&quot;Index&quot;:5}],&quot;OriginalPrice&quot;:630.0,&quot;IsSpecialOffer&quot;:false,&quot;SpecialPrice&quot;:null,&quot;StockStatus&quot;:0,&quot;StockStatusName&quot;:&quot;庫存正常&quot;,&quot;SellingPrice&quot;:630.0,&quot;IsDiscount&quot;:false,&quot;MaximumPurchases&quot;:null,&quot;MinimumPurchases&quot;:1},'3f09f862-6449-4d8f-90bd-c1f5caaa2eb7','西式喜餅')"
                           class="productImage clearfix">
    <div class="image-box"><img src="https://sefunnetblob.blob.core.windows.net/product/1698928215650.jpeg"
                                alt="products-img" class="w-100"></div>
</a>
    <div class="productCaption clearfix"><a href="javascript:void(0)"
                                            onclick="GAOnclickEvent({&quot;GoodsId&quot;:&quot;15fc86b2-9676-423f-a8d9-cd608d7dee5e&quot;,&quot;GoodsDetailId&quot;:&quot;e567e88c-74f1-4d23-bdf2-cd368146c296&quot;,&quot;GoodsName&quot;:&quot;餅乾鐵盒．幾米完美小孩 (經典珍藏版)&quot;,&quot;Type&quot;:0,&quot;ShipType&quot;:0,&quot;GoodsImageUrl&quot;:[{&quot;ImageUrl&quot;:&quot;https://sefunnetblob.blob.core.windows.net/product/1698928215650.jpeg&quot;,&quot;Index&quot;:1},{&quot;ImageUrl&quot;:&quot;https://sefunnetblob.blob.core.windows.net/product/1692113543559.jpeg&quot;,&quot;Index&quot;:2},{&quot;ImageUrl&quot;:&quot;https://sefunnetblob.blob.core.windows.net/product/1692113543890.jpeg&quot;,&quot;Index&quot;:3},{&quot;ImageUrl&quot;:&quot;https://sefunnetblob.blob.core.windows.net/product/1692113543336.jpeg&quot;,&quot;Index&quot;:4},{&quot;ImageUrl&quot;:&quot;https://sefunnetblob.blob.core.windows.net/product/1692113543716.jpeg&quot;,&quot;Index&quot;:5}],&quot;OriginalPrice&quot;:630.0,&quot;IsSpecialOffer&quot;:false,&quot;SpecialPrice&quot;:null,&quot;StockStatus&quot;:0,&quot;StockStatusName&quot;:&quot;庫存正常&quot;,&quot;SellingPrice&quot;:630.0,&quot;IsDiscount&quot;:false,&quot;MaximumPurchases&quot;:null,&quot;MinimumPurchases&quot;:1},'3f09f862-6449-4d8f-90bd-c1f5caaa2eb7','西式喜餅')"
                                            class="text-center"><h5 style="font-size: 14px;">餅乾鐵盒．幾米完美小孩
        (經典珍藏版) </h5></a> <h4 class="text-center">$630</h4></div>
</div>

TIP

提取 onClick 屬性中的商品 ID 和商品名稱。

Sample Code

python

import json
import requests
import re
from bs4 import BeautifulSoup

"""
This script reads the category_list.json file, extracts product details from the category list,
converts the result to JSON, and writes the JSON result to a file.
"""

# read from the json file
with open('category_list.json', 'r') as jsonfile:
    category_list = json.load(jsonfile)

# result product list
product_list = []

# extract product details from the category list
for category in category_list:
    href = category['href']

    # Fetch the HTML content from the URL
    html_doc = requests.get(href).text

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_doc, 'lxml')

    # Find the 'div' tag with class 'product'
    product_divs = soup.find_all('div', class_='productBox')

    # Extract product details from the 'div' tags
    for product_div in product_divs:
        product_name = product_div.find('h5').get_text(strip=True)
        product_price = product_div.find('h4', class_='text-center').get_text(strip=True)
        product_a_tag = product_div.find('a', onclick=True)
        match = re.search(r'"GoodsId":"([^"]+)"', product_a_tag['onclick'])
        product_id = match.group(1) if match else None
        product_list.append({
            "category_id": category['category_id'],
            "category_title": category['title'],
            "id": product_id,
            "name": product_name,
            "price": product_price,
            "url": "https://www.sefunnet.com/Goods/Detail?id=" + product_id
        })
# Convert the result to JSON
json_result = json.dumps(product_list, ensure_ascii=False, indent=2)

# write the JSON result to a file
with open('product_list.json', 'w') as jsonfile:
    jsonfile.write(json_result)

Refactored Sample Code

python

import json
import requests
import re
from bs4 import BeautifulSoup


def read_json_file(filename):
    """Read JSON data from a file."""
    with open(filename, 'r', encoding='utf-8') as file:
        return json.load(file)


def fetch_html(url):
    """Fetch the HTML content from the given URL."""
    response = requests.get(url)
    response.raise_for_status()
    return response.text


def parse_html(html):
    """Parse the HTML content using BeautifulSoup."""
    return BeautifulSoup(html, 'lxml')


def extract_product_details(category, base_url='https://www.sefunnet.com'):
    """Extract product details from a category page."""
    html_doc = fetch_html(category['href'])
    soup = parse_html(html_doc)
    product_divs = soup.find_all('div', class_='productBox')

    products = []
    existing_product_ids = set()
    for product_div in product_divs:
        product_name = product_div.find('h5').get_text(strip=True)
        product_price = product_div.find('h4', class_='text-center').get_text(strip=True)
        product_a_tag = product_div.find('a', onclick=True)
        match = re.search(r'"GoodsId":"([^"]+)"', product_a_tag['onclick'])
        product_id = match.group(1) if match else None

        if product_id and product_id not in existing_product_ids:
            existing_product_ids.add(product_id)
            products.append({
                "category_id": category['category_id'],
                "category_title": category['title'],
                "id": product_id,
                "name": product_name,
                "price": product_price,
                "url": f"{base_url}/Goods/Detail?id={product_id}"
            })
    return products


def save_json(data, filename):
    """Save the given data to a JSON file."""
    with open(filename, 'w', encoding='utf-8') as file:
        json.dump(data, file, ensure_ascii=False, indent=2)


def main():
    category_list = read_json_file('category_list.json')
    all_products = []
    all_products_id_set = set()

    for category in category_list:
        products = extract_product_details(category)
        for product in products:
            if product['id'] not in all_products_id_set:
                all_products.append(product)
                all_products_id_set.add(product['id'])

    save_json(all_products, 'product_list.json')


if __name__ == '__main__':
    main()

json

[
  {
    "category_id": "efc8afab-ce41-470a-aefe-39646f0f464e",
    "category_title": "喜餅禮盒",
    "id": "15fc86b2-9676-423f-a8d9-cd608d7dee5e",
    "name": "餅乾鐵盒．幾米完美小孩 (經典珍藏版)",
    "price": "$630",
    "url": "https://www.sefunnet.com/Goods/Detail?id=15fc86b2-9676-423f-a8d9-cd608d7dee5e"
  }
]

Python 中的 lambda 表達式

在 Python 中，lambda 是一種匿名函數，可以用來創建簡單的即時函數，而無需使用 def 關鍵字來正式定義它。lambda 函數的語法如下：

python

lambda arguments: expression

在上面的 BeautifulSoup 程式碼中：

python

a_tags = soup.find_all('a', href=lambda href: href and 'CategoryId' in href)

此處的 lambda 函數作用如下：

href 是傳遞給 lambda 函數的參數。
href and 'CategoryId' in href 是 lambda 函數的表達式。

這個 lambda 函數檢查 href 是否不為 None，以及 href 屬性中是否包含字符串 'CategoryId' 。如果兩個條件都為真，函數返回 True，否則返回 False。這樣可以有效地篩選出 href 屬性中包含字符串 'CategoryId' 的 a 標籤。

為了進一步說明，這裡是使用命名函數而非 lambda 的等效代碼片段：

python

def contains_category_id(href):
    return href and 'CategoryId' in href

a_tags = soup.find_all('a', href=contains_category_id)

這個 contains_category_id 函數執行與 lambda 函數相同的檢查，並且可以以同樣的方式傳遞給 find_all。在這裡使用 lambda 只是為了更簡潔地在行內定義函數。

使用 BeautifulSoup 和 Requests 抓取 Sefun 網站產品列表範例 ​

環境 ​

安裝依賴 ​

從導航標籤中找到所有的 CategoryId ​

觀察導覽列中 html 結構 ​

從既有的 JSON 檔案中讀取 CategoryId，並爬取所有 CategoryId 底下的商品 ​

Python 中的 lambda 表達式 ​

使用 BeautifulSoup 和 Requests 抓取 Sefun 網站產品列表範例

環境

安裝依賴

從導航標籤中找到所有的 CategoryId

觀察導覽列中 html 結構

從既有的 JSON 檔案中讀取 CategoryId，並爬取所有 CategoryId 底下的商品

Python 中的 lambda 表達式