爬蟲 & Python

大綱

一、規格書
 二、Python 環境建置
 三、Python 爬蟲套件
 四、程式邏輯
 五、遇到的問題

一、規格書

爬 3 個網站的「一覽頁」和「店鋪詳細頁」：
- baito
- medley
- relax
第一階段：抓取一覽頁的「店鋪 url」和「所在地區」。
第二階段：依「所在地區」排序，再依序撈取「店鋪詳細頁」的資訊。

二、Python 環境建置

Python 3.10 跟之前的版本有很大的差異，例如：查詢 Python 版本的 CLI 不一樣。

Python 3.9

python3 –version
Python 指令在 3.10 以上才有

python -V

1. PyCharm 編輯器

JetBrains，phpStorm 開發商。

2. 安裝 Python

在 mac 或 windows
- Python 官網下載安裝包
在 Linux
- 安裝 pyenv (管理 Python 版本的工具)，再用 pyenv 安裝 Python。

3. 安裝 pyenv 和 pip

使用 Python 開發時，會安裝 2 個必要工具，pyenv 和 pip，分別對應像是 node 的 nvm 和 npm。

	Python	node
管理版本	pyenv	nvm
管理套件	pip	npm

安裝 pyenv

管理 Python 版本的工具，像 node 的 nvm。
pyenv 安裝教學
查看可以安裝的版本

pyenv install -l
安裝其中一種 Python 的版本

pyenv install -v 3.12.2
列出已安裝過的版本

pyenv versions
切換 Python 的版本，三種模式 local、global 或 shell。
- 在當前目錄下設置 Python 版本，這只對當前目錄及其子目錄有效。
  
  pyenv local 3.12.2
- 設置全局 Python 版本，所有 shell 都有效。
  
  pyenv global 3.12.2
- 在當前 shell 中切換 Python 版本，當 shell 退出再進入，會跳回原本預設的版本。
  
  pyenv shell 3.12.2

創建 python 虛擬環境

創建虛擬環境

python -m venv ./.venv
開通虛擬環境（Linux/macOS）

source .venv/bin/activate
退出虛擬環境

deactivate

安裝 pip

管理 Python 套件的工具，像 node 的 npm install 建立 package.json。
安裝 pip

sudo yum install python3-pip
列出安裝過的套件和版本

pip list -v
安裝/解安裝套件 pip install [套件名稱]，例如：

pip install beautifulsoup4
解安裝套件

pip uninstall beautifulsoup4
套件安裝資訊

pip show beautifulsoup4
建立 requirements.txt，專案裡所需套件的清單

pip freeze > requirements.txt
安裝 requirements.txt 清單的套件

pip install -r requirements.txt

三、Python 爬蟲套件

BeautifulSoup

安裝 beautifulsoup4

pip install beautifulsoup4
使用 BeautifulSoup 抓取資料

from bs4 import BeautifulSoup
import requests

# fetch url
fetchUrl = "https://baito.mynavi.jp/tokyo/ss-1/ssm-1/sss-1/kd-61/"
r = requests.get(fetchUrl)

# new BeautifulSoup
soup = BeautifulSoup(r.text, "html.parser")

# 用 css selector 取得 element
element = soup.select(".searchResult > .researchCount")
print(element)

Selenium + BeautifulSoup

安裝 Selenium

pip install selenium
在 Linux 安裝 google chrome

yum install https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm -y
Selenium 開起 chrome 抓取 dom，再用 BeautifulSoup 抓取資料。

from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.common.by import By

# 啟動 driver
option = webdriver.ChromeOptions()

# 背景開啟瀏覽器
# option.add_argument('headless')

# 開啟瀏覽器
driver = webdriver.Chrome(option)

# fetch url
fetchUrl = "https://baito.mynavi.jp/cl-002280403865/job-94305516/?reserve_w_29450608"
driver.get(fetchUrl)

# 等待頁面渲染
time.sleep(1)

# tab 会社情報
button = driver.find_element(By.CSS_SELECTOR, ".tab02")
button.click()
time.sleep(3)

# 取得 dom
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
address = soup.select(".workPlaceAddressDiamondMark")
for a in address:
    print(a.text)

PyMySQL 連線資料庫

import pymysql

db = pymysql.connect(host='',
                     user='',
                     password='',
                     database='')

cursor = db.cursor()

四、程式邏輯

撈取一覽頁的邏輯：
依規格書要求的件數撈取 200 件，從第一頁開始，直到滿足 200 件 break 迴圈。要注意的是，若規格書要求 200 件，但實際上只有 199 件，也要 break 迴圈。

/urlList/melody.py

targetUrl = [
    {'url': 'https://job-medley.com/est/pref13/employment3/', 'number': 200},
    {'url': 'https://job-medley.com/ans/pref14/employment3/', 'number': 132},
    {'url': 'https://job-medley.com/ans/pref27/employment3/', 'number': 126},
    ...
]

fetchMelodyJob.py

from urlList import melody
import requests
from bs4 import BeautifulSoup

for idx, list in enumerate(melody.targetUrl):
    page = 1
    count = 0
    while count < list['number']:
        # 從第 1 頁開始 (https://job-medley.com/est/pref13/employment3/?page=2)
        fetchUrl = list['url'] if page == 1 else list['url'] + "?page=" + str(page)
        
        # request 之前 sleep 1~10 秒
        time.sleep(random.randint(1, 10))
        r = requests.get(fetchUrl)

        # new BeautifulSoup
        soup = BeautifulSoup(r.text, "html.parser")

        # 取得一覽頁總件數
        shopCount = soup.select(".c-search-condition-title__important")[0].text

        # 取得資訊
        shopBlock = soup.select(".c-offers-sort-area .c-offers-sort-area__contents .o-gutter-row .o-gutter-row__item")
        for sb in shopBlock:
            data = {
                "job_url": sb.select(".job-url"), # 店鋪詳細 url
                "station": sb.select(".station")  # 店鋪所在地
            }

            # 儲存資料庫
            insert_mysql(fetchUrl, data)
    
            # 計算筆數
            count += 1

            # 若滿足件數，繼續撈取下一個一覽頁
            if count >= list['number']:
                break

        # 若已達一覽頁總件數，繼續撈取下一個一覽頁
        if count >= int(shopCount):
            break

撈取店鋪詳細頁的邏輯：
依車站排序，撈取店舖 url。 Insert 資料庫，並 update flag 紀錄已經 fetch 過此店舖。

fetchMelodyJobDetail.py

import requests
from bs4 import BeautifulSoup

def select_jobs():
    sql = ("SELECT id, job_url, station, is_requested," +
           "CASE WHEN station LIKE '% 尼崎駅%' THEN 1 " +
           "     WHEN station LIKE '% 明石駅%' THEN 2 " +
           "     WHEN station LIKE '% 高松駅%' THEN 3 " +
           " FROM `melody`" +
           " WHERE is_requested = 0" +
           " ORDER BY ordering, id ASC" +
           " LIMIT 100")
           
def update_sql(id):
    try:
        sql = "UPDATE melody SET is_requested = 1 WHERE id = " + str(id)
        cursor.execute(sql)
        db.commit()
    except:
        write_log('sql update error: melody id - ' + str(id))
        db.rollback()

############################## main ##############################
jobs = select_jobs()
for job in jobs:
    #取得資訊
    fetchUrl = job['job_url']
    r = requests.get(fetchUrl, headers={'user-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'})
    soup = BeautifulSoup(r.text, "html.parser")
    ldJson = soup.select('script[type="application/ld+json"]')
    ...
    
    # 儲存資料庫
    insert_mysql(fetchUrl, data)
    
    # 更新資料表 is_requested = 1
    update_sql(job['id'])

tel 資料：
使用 google map api，但這要收費。
Google 地圖平台計費方式
一個月 $200 美元的抵用金，每個 request $0.003 美元，大約可以 call 66,666 次。

import googlemaps

gmaps = googlemaps.Client(key="[MY_TOKEN]")
search_result = gmaps.places(facilityName)
tel = ""
if len(search_result['results']) > 0:
    # 先找到 <位置ID>
    place_id = search_result['results'][0]['place_id']

    # 用 <位置ID> 找店家詳細
    place_details = gmaps.place(place_id=place_id, fields=['name', 'formatted_phone_number'])
    tel = place_details['result'].get('formatted_phone_number')

    # 印出 tel
    print(tel)

偽裝 google bot: Google 檢索器和擷取程式 (使用者代理程式) 總覽

r = requests.get(fetchUrl, headers={'user-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'})

一覽頁 table 設計：

id	fetch_url	job_url	station	is_request
1	一覽頁 url	店鋪 A url	表参道駅徒歩4分	1
2	一覽頁 url	店鋪 B url	日比谷駅徒歩1分	0

五、遇到的問題

chrome 只有支援 x86 的 Linux。目前沒有支援 arm64 的 chrome，使用 node 的木偶人也是如此，木偶人官網有寫。

uname -m
建立爬蟲中斷機制，以免發生中斷又要從頭執行。我使用了 2 個方法：
- 一覽頁的爬蟲：每次重新執行時，確認每個一覽頁已經 insert 了多少「店鋪件數」，已達件數便不再 fetch。
  
  SELECT count(*) FROM baito WHERE fetch_url LIKE ‘[一覽頁 url]%’
- 店鋪詳細頁的爬蟲：table 新增一個欄位 (is_request)，紀錄已經 fetch 過的 url，每次重新執行時，只 select 沒 fetch 過的 url。
  
  SELECT * FROM baito WHERE is_requested = 0