TDK を取得する方法 2

このスクリプトは、list.txt に記載された URL またはローカルHTMLファイルのパス からWebページを読み込み、ヘッダー情報（メタタグなど）を取得し、Excelファイル（xlsx形式）として保存するツールです。
以前作成したEXCELマクロバージョンより高速で取得できます。

URLやローカルHTMLファイルから指定のメタ情報を抽出し、整形したExcelファイルに出力します。
処理前にユーザー確認を行い、必要なライブラリを自動インストール。
HTTPリクエストでヘッダー情報を取得し、HTML解析はBeautifulSoupで実施。
Excel出力は見やすさを考慮し、罫線や交互行の色分け、カラム幅調整を行います。
処理完了後にファイルを自動で開くので利便性も高いです。

・実行前に処理開始の確認ダイアログを表示（インストール前）
・必要なライブラリ（requests、openpyxl、bs4）を自動チェック・インストール
・「list.txt」からURLやローカルHTMLファイルパスを読み込み
・HTTP/HTTPSはHEADでヘッダー取得、GETでHTML解析
・ローカルファイルは先頭20行を読み込み解析
・指定メタ情報16項目を抽出しExcelに書き出し
・Excelは罫線と交互行の背景色（グレー）を付与、カラム幅を自動調整
・ファイル名は「yyyymmdd_html_header.xlsm」の形式で保存
・処理完了を通知しExcelを自動で開く

■ ファイル構成と前提

ファイル名	用途
`list.txt`	調査対象のURL・パスを列挙（1行1件）
`*.py`（本スクリプト）	実行スクリプト

■ 主な機能

list.txt は UTF-8 で保存
スクリプトと list.txt は同じフォルダに配置

# --_header_get.py
import sys
import os
import subprocess
import tkinter as tk
from tkinter import messagebox as msgbox
from datetime import datetime
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import requests
from openpyxl import Workbook
from openpyxl.styles import PatternFill, Border, Side
from openpyxl.utils import get_column_letter

# --- 実行前確認（インストール前に実行確認） ---
root = tk.Tk()
root.withdraw()
if not msgbox.askyesno("確認", "Webメタ情報の取得を開始しますか？"):
    print("キャンセルされました。")
    sys.exit()

# --- ライブラリ確認・インストール ---
required_libraries = ["requests", "openpyxl", "bs4"]

def install_if_missing(packages):
    for pkg in packages:
        try:
            __import__(pkg)
        except ImportError:
            print(f"{pkg} が見つかりません。インストールします...")
            subprocess.check_call([sys.executable, "-m", "pip", "install", pkg])

install_if_missing(required_libraries)

# --- URLリストの読み込み ---
list_file = "list.txt"
if not os.path.exists(list_file):
    msgbox.showerror("エラー", "list.txt が見つかりません。")
    sys.exit()

with open(list_file, "r", encoding="utf-8") as f:
    urls = [line.strip() for line in f if line.strip()]

# --- メタ情報取得関数 ---
def get_metadata(url):
    result = {
        "URL": url,
        "Last-Modified": "",
        "拡張子": "",
        "title": "",
        "description": "",
        "keywords": "",
        "author": "",
        "robots": "",
        "copyright": "",
        "og:title": "",
        "og:description": "",
        "og:image": "",
        "og:url": "",
        "og:type": "",
        "site_name": "",
        "fb:admins": ""
    }

    parsed = urlparse(url)
    ext = os.path.splitext(parsed.path)[1]
    result["拡張子"] = ext

    if parsed.scheme in ["http", "https"]:
        # HTTP/HTTPSの場合HEADリクエストでLast-Modified取得
        try:
            headers = requests.head(url, timeout=10).headers
            result["Last-Modified"] = headers.get("Last-Modified", "")
        except Exception as e:
            result["Last-Modified"] = f"取得失敗: {e}"
        try:
            res = requests.get(url, timeout=10)
            soup = BeautifulSoup(res.text, "html.parser")
        except Exception as e:
            result["title"] = f"取得失敗: {e}"
            return result
    else:
        # ローカルファイルの場合は先頭20行を読み込み
        if not os.path.exists(url):
            result["title"] = "ファイルが存在しません"
            return result
        try:
            with open(url, "r", encoding="utf-8", errors="ignore") as f:
                lines = [next(f) for _ in range(20)]
            content = "".join(lines)
            soup = BeautifulSoup(content, "html.parser")
        except Exception as e:
            result["title"] = f"ファイル読み込み失敗: {e}"
            return result

    def get_meta(name=None, prop=None):
        if name:
            tag = soup.find("meta", attrs={"name": name})
        elif prop:
            tag = soup.find("meta", attrs={"property": prop})
        return tag["content"] if tag and tag.has_attr("content") else ""

    result["title"] = soup.title.string.strip() if soup.title and soup.title.string else ""

    for name in ["description", "keywords", "author", "robots", "copyright"]:
        result[name] = get_meta(name=name)

    for prop in ["og:title", "og:description", "og:image", "og:url", "og:type", "og:site_name", "fb:admins"]:
        key = prop if prop != "og:site_name" else "site_name"
        result[key] = get_meta(prop=prop)

    return result

# --- 結果取得 ---
results = []
for url in urls:
    results.append(get_metadata(url))

# --- Excel出力 ---
wb = Workbook()
ws = wb.active
ws.title = "WebMeta"

headers = [
    "URL", "Last-Modified", "拡張子", "title", "description", "keywords",
    "author", "robots", "copyright", "og:title", "og:description", "og:image",
    "og:url", "og:type", "site_name", "fb:admins"
]
ws.append(headers)

fill_gray = PatternFill(start_color="DDDDDD", end_color="DDDDDD", fill_type="solid")
thin_border = Border(
    left=Side(style="thin"), right=Side(style="thin"),
    top=Side(style="thin"), bottom=Side(style="thin")
)

for row_index, data in enumerate(results, start=2):
    ws.append([data[h] for h in headers])
    for col_index in range(1, len(headers) + 1):
        cell = ws.cell(row=row_index, column=col_index)
        cell.border = thin_border
        if row_index % 2 == 0:
            cell.fill = fill_gray

# ヘッダー装飾
for col_index in range(1, len(headers) + 1):
    cell = ws.cell(row=1, column=col_index)
    cell.border = thin_border

# カラム幅調整
for col in ws.columns:
    max_length = 0
    col_letter = get_column_letter(col[0].column)
    for cell in col:
        try:
            if cell.value:
                max_length = max(max_length, len(str(cell.value)))
        except:
            pass
    ws.column_dimensions[col_letter].width = max_length + 2

# --- 保存 ---
filename = f"{datetime.now().strftime('%Y%m%d')}_html_header.xlsm"
wb.save(filename)

# 処理完了メッセージとファイルオープン
msgbox.showinfo("完了", f"処理が完了しました。\n保存ファイル: {filename}")

try:
    os.startfile(filename)
except Exception as e:
    msgbox.showwarning("警告", f"ファイルを開けませんでした: {e}")

# --- ウィンドウ維持 ---
root.deiconify()
root.title("処理完了")
label = tk.Label(root, text="処理が完了しました。\nウィンドウを閉じて終了してください。", padx=20, pady=20)
label.pack()
btn = tk.Button(root, text="終了", command=root.destroy, padx=10, pady=5)
btn.pack()
root.mainloop()

20250614_header_get.py ダウンロード

カテゴリー:WEBディレクション便利ツール技術

タグ:[WEBディレクション][技術]excel Python

このサイトについて
広告代理店や制作会社、事業会社などで、ドメイン取得からサイト構築・運用、連載記事のライター、アバターの中の人、ホスティングサービス管理者、占いサイトのディレクション、イベントサイトの運営、ネット通販サイト構築、ユーザサポート室開設、損保会社ＤＸディレクション・・・面白い仕事をいっぱいさせていただきました。あれやこれやの経験を綴ってみようと思います。

■ ファイル構成と前提

■ 主な機能

関連投稿

AI 「チャットGDP」について 「得意」「不得意」を聞いてみました。

自分のパソコンにPythonをインストールする

６割のプロ

TDK を取得する 方法

DLL をまとめて整理 （python)

どうしたら自分の思い通りの制作物が作ってもらえるか？③作業は分担している

AI 「チャットGDP」について「得意」「不得意」を聞いてみました。

TDK を取得する方法

DLL をまとめて整理　（python)