「Tabulator 4.9.3 → 5.0.6」乗り換え、兼「声優世代表のおとも」

一応 5.0.6 が現時点で CDN にあがってる公式最新。

2021-11-03追記: 最低でも、initialHeaderFilter を使いたい場合は、5.0.6 には乗り換えないこと。
5.0.6 では動作してない。本日時点では issue tracker には挙がってないし、ワタシも当座は慌てて報告するつもりもない。
細かい話は独立したネタとして書いた。今あなたがみているこのページで書いたコードが影響を受けている話なので、セットで読んでもらえるとありがたい。

PDF ドキュメントダウンロードの例が動作しないなど、5.0 ドキュメントはまだ
不完全なようなので、不安であれば 4.x にとどめておくほうがいいかもね。ともあれ。

4.2.7 から 4.9.3 に乗り換えたが、5.0.6 もやってみて「tooltips はいずれ使えなくなるぜぼうず」警告が出たので 4.9.3 で止めた、てこと。いいねぇ、ちゃんとドキュメントに全部書いてあるわ。ゆえに:

微拡張込みの ver 4

  1 # -*- coding: utf-8 -*-
  2 # require: python 3, bs4
  3 """
  4 「wikipedia声優ページ一覧」(など)を入力として、およそ世代関係を把握しやすい
  5 html テーブルを生成する。
  6 
  7 該当アーティストが情報を公開しているかどうかと wikipedia 調べがすべて一致する
  8 かどうかは誰も保証出来ない。正しい公開情報が真実である保証もない。という、元
  9 データそのものについての注意はあらかじめしておく。そして、wikipedia における
 10 ペンディング扱い(「要出典」など)やその他補足情報はこのスクリプトは無視している
 11 ので、「確実かそうでないか」がこの道具の結果だけでは判別出来ないことにも注意。
 12 (wikipedia 本体を読んでいる限りは、正しくない可能性がまだ残っている場合、
 13 そうであるとわかることが多い。)
 14 
 15 ワタシの目下の目的が「声優世代表のおとも」なので、「生年月日非公開/不明」が困る。
 16 なので、wikipedia が管理しようとしている「活動期間」を補助的に使おうと考えた。
 17 これが活用出来る場合は「生誕年＝活動開始-20年くらい」みたいな推測に使える。
 18 
 19 この情報の精度には、当たり前ながらかなりのバラつきがあるし、要約の仕方も統一感
 20 がない。たとえば「田中ちえ美」の声優活動は最低でも「サクラクエスト」の2017年に
 21 始まっているが、当該ページ執筆者が「声優アーティスト活動開始」の定義に基いて
 22 「音楽活動の活動期間は2021年-」と記述してしまっていて、かつ、声優活動としての
 23 開始を記述していない。ゆえにこの情報だけを拾うと「田中ちえ美の活動期間は2021年-」
 24 であると誤って判断してしまいかねない。そもそもが「サクラクエスト」にてキャラク
 25 ターソングを出しているので、定義次第では「音楽活動の活動期間は2021年-」も誤って
 26 いる。また、「2011年」のように年を特定出来ずに「2010年代」と要約しているページ
 27 も多く、これも結構使いにくい。
 28 
 29 ので、「テレビアニメ」などの配下の「2012年」みたいな年ごとまとめ見出しの最小値を、
 30 活動期間の情報として補助的に拾ってる。
 31 
 32 活動期間は年齢の推測にも使えるけれど、「声優世代表のおとも」として考える場合は、
 33 生年月日と活動期間の両方が既知でこその面白さがある。たとえば黒沢ともよ、宮本侑芽、
 34 諸星すみれ、浪川大輔などの子役出身者の例。あるいは逆に「遅咲き」と言われる役者。
 35 """
 36 import io
 37 import os
 38 import sys
 39 import tempfile
 40 import shutil
 41 import ssl
 42 import re
 43 import json
 44 import urllib.request
 45 from urllib.request import urlretrieve as urllib_urlretrieve
 46 from urllib.request import quote as urllib_quote
 47 
 48 
 49 import bs4  # require: beutifulsoup4
 50 
 51 
 52 __MYNAME__, _ = os.path.splitext(
 53     os.path.basename(sys.modules[__name__].__file__))
 54 #
 55 __USER_AGENT__ = "\
 56 Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
 57 AppleWebKit/537.36 (KHTML, like Gecko) \
 58 Chrome/91.0.4472.124 Safari/537.36"
 59 _htctxssl = ssl.create_default_context()
 60 _htctxssl.check_hostname = False
 61 _htctxssl.verify_mode = ssl.CERT_NONE
 62 https_handler = urllib.request.HTTPSHandler(context=_htctxssl)
 63 opener = urllib.request.build_opener(https_handler)
 64 opener.addheaders = [('User-Agent', __USER_AGENT__)]
 65 urllib.request.install_opener(opener)
 66 #
 67 
 68 
 69 _urlretrieved = dict()
 70 
 71 
 72 def _urlretrieve(url):
 73     if url in _urlretrieved:
 74         return _urlretrieved[url]
 75 
 76     def _gettemppath(s):
 77         tmptopdir = os.path.join(tempfile.gettempdir(), __MYNAME__)
 78         if not os.path.exists(tmptopdir):
 79             os.makedirs(tmptopdir)
 80         import hashlib, base64
 81         ep = base64.urlsafe_b64encode(
 82             hashlib.md5(s.encode("utf-8")).digest()
 83         ).partition(b"=")[0].decode()
 84         return os.path.join(tmptopdir, ep)
 85 
 86     cachefn = _gettemppath(url)
 87     if os.path.exists(cachefn):
 88         res = cachefn
 89     else:
 90         res, _ = urllib_urlretrieve(url, filename=cachefn)
 91     _urlretrieved[url] = res
 92     return res
 93 
 94 
 95 def _from_wp(actorpagename):
 96     baseurl = "https://ja.wikipedia.org/wiki/"
 97     pn = urllib_quote(actorpagename, encoding="utf-8")
 98     fn = _urlretrieve(baseurl + pn)
 99 
100     result = {"wikipedia": actorpagename, "名前": [actorpagename.partition("_")[0]]}
101     with io.open(fn, "r", encoding="utf-8") as fi:
102         soup = bs4.BeautifulSoup(fi.read(), features="html.parser")
103         try:
104             categos = [
105                 a.text
106                 for a in soup.find("div", {"id": "mw-normal-catlinks"}).find_all("a")]
107         except Exception:
108             return result
109         try:
110             trecords = iter(
111                 soup.find("table", {"class": "infobox"}).find("tbody").find_all("tr"))
112         except Exception:
113             return result
114         tr = next(trecords)
115         result["名前"] += [
116             re.sub(r"\[[^\[\]]+\]", "", sp.text)
117             for sp in tr.find("th").find_all("span")]
118 
119         actst = float("inf")
120         for tr in trecords:
121             th, td = tr.find("th"), tr.find("td")
122             if not th or not td:
123                 continue
124             k = th.text.replace("\n", "").replace(
125                 "誕生日", "生年月日").replace("生誕", "生年月日")
126             v = ""
127             if td:
128                 v = re.sub(
129                     r"\[[^\[\]]+\]", "", td.text.replace("\n", "")).strip()
130                 if k == "活動期間":
131                     m = re.search(r"(\d+)年\s*\-", v)
132                     if m:
133                         actst = min(float(m.group(1)), actst)
134                     continue
135                 elif k == "生年月日":
136                     m1 = re.search(r"\d+-\d+-\d+", v)
137                     m2 = re.search(r"(\d+)月(\d+)日", v)
138                     if m1:
139                         v = m1.group(0)
140                     elif m2:
141                         v = "0000-{:02d}-{:02d}".format(
142                             *list(map(int, m2.group(1, 2))))
143                     else:
144                         v = ""
145             result[k] = v
146         for dtt in [
147                 dt.text
148                 for dt in soup.find_all("dt") if re.match(r"\d+年", dt.text)]:
149             actst = min(float(re.search(r"(\d+)年", dtt).group(1)), actst)
150         try:
151             result["actst"] = "{:04d}".format(int(actst))
152         except OverflowError:
153             result["actst"] = "0000"
154         if not result.get("性別"):
155             if any([("男性" in c or "男優" in c) for c in categos]):
156                 result["性別"] = "男性"
157             elif any([("女性" in c or "女優" in c) for c in categos]):
158                 result["性別"] = "女性"
159         result["性別"] = result["性別"][0]
160     return result
161 
162 
163 if __name__ == '__main__':
164     def _yrgrp(ymd, actst):
165         y = 0
166         if ymd:
167             y, _, md = ymd.partition("-")
168             y = int(y)
169             if y and md < "04-02":
170                 y -= 1
171         return list(
172             map(lambda s: s.replace("0000", "????"),
173                 ["{:04d}".format(y), actst, ymd]))
174 
175     actorpages = list(
176         map(lambda s: s.strip(),
177             io.open("wppagenames.txt", encoding="utf-8").read().strip().split("\n")))
178     result = []
179     for a in actorpages:
180         inf = _from_wp(a)
181         g = _yrgrp(inf.get("生年月日", ""), inf.get("actst"))
182         result.append((
183             g[0], g[1], g[2],
184             inf.get("性別", ""),
185             ("<a href='https://ja.wikipedia.org/wiki/{pn}' target=_blank>{pn}</a>".format(
186                 pn=inf["wikipedia"])),
187             ", ".join(inf["名前"]),
188             inf.get("血液型", "-"),
189             inf.get("出生地", "-"),
190             inf.get("出身地", "-"),
191             inf.get("愛称", "-"),
192             inf.get("身長", "-"),
193             inf.get("事務所", "-"),
194         ))
195     result.sort()
196     with io.open("actor_basinf.html", "w", encoding="utf-8") as fo:
197         coln = [
198             "生誕年度", "活動開始年？", "生年月日",
199             "性別", "wikipedia", "名前",
200             "血液型",
201             "出生地", "出身地", "愛称", "身長", "事務所",
202             ]
203         print("""\
204 <html>
205 <head jang="ja">
206 <meta charset="UTF-8">
207 <link href="https://cdnjs.cloudflare.com/ajax/libs/tabulator/5.0.6/css/tabulator_site.min.css" rel="stylesheet">
208 <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/tabulator/5.0.6/js/tabulator.min.js"></script>
209 
210 <!--
211 {}
212   -->
213 </head>""".format(__doc__), file=fo)
214         print("""\
215 <body>
216 <div id="actor_basinf"></div>
217 <script>
218 function _dt2int(dt) {
219     function _pad(n) {
220         let ns = "" + Math.abs(n);
221         if (ns.length === 1) {
222             ns = "0" + ns;
223         }
224         return ns;
225     }
226     return parseInt(dt.getFullYear() + _pad(dt.getMonth() + 1) + _pad(dt.getDate()));
227 }
228 var nowi = _dt2int(new Date());
229 function _calcage(cell, formatterParams) {
230     var v = cell.getValue().replace(new RegExp("\-", "g"), "");
231     if (v.startsWith("????")) {
232         return "";
233     }
234     v = parseInt(v);
235     return "" + parseInt((nowi - v) / 10000);
236 }
237 
238 
239 var actor_basinf_data = """, file=fo)
240         json.dump(
241             [dict(zip(coln, row)) for row in result],
242             fo, ensure_ascii=False, indent=4)
243         print("""
244 var table = new Tabulator("#actor_basinf", {
245     "height": "800px",
246     "columnDefaults": {
247         /* 注意: 「tooltips」ではない。「tooltip」である。 */
248         "tooltip": true,
249     },
250     "columns": [
251         {
252             "field": "生誕年度",
253             "title": "生誕年度",
254             "headerFilter": "input",
255             "headerFilterFunc": "regex",
256         },
257         {
258             "field": "活動開始年？",
259             "title": "活動開始年？",
260             "headerFilter": "input",
261             "headerFilterFunc": "regex",
262         },
263         {
264             "field": "生年月日",
265             "title": "生年月日",
266         },
267         {
268             "field": "生年月日",
269             "formatter": _calcage,
270             "headerTooltip": "存命の場合は年齢そのもの。亡くなっている方の場合は「生きていれば～歳」。 "
271         },
272         {
273             "field": "性別",
274             "title": "性別",
275             "headerFilter": "input",
276         },
277         {
278             "field": "名前",
279             "title": "名前",
280             "headerFilter": "input",
281             "headerFilterFunc": "regex",
282         },
283         {
284             "field": "wikipedia",
285             "title": "wikipedia",
286             "headerFilter": "input",
287             "headerFilterFunc": "regex",
288             "formatter": "html",
289         },
290         {
291             "field": "血液型",
292             "title": "血液型",
293             "headerFilter": "input",
294             "headerFilterFunc": "regex",
295         },
296         {
297             "field": "出生地",
298             "title": "出生地",
299             "headerFilter": "input",
300             "headerFilterFunc": "regex",
301         },
302         {
303             "field": "出身地",
304             "title": "出身地",
305             "headerFilter": "input",
306             "headerFilterFunc": "regex",
307         },
308         {
309             "field": "愛称",
310             "title": "愛称",
311             "headerFilter": "input",
312             "headerFilterFunc": "regex",
313         },
314         {
315             "field": "身長",
316             "title": "身長",
317             "headerFilter": "input",
318             "headerFilterFunc": "regex",
319         },
320         {
321             "field": "事務所",
322             "title": "事務所",
323             "headerFilter": "input",
324             "headerFilterFunc": "regex",
325         }
326   ], 
327   "layout": "fitColumns",
328   "data": actor_basinf_data
329 });
330 </script>
331 </html>
332 """, file=fo)

オプションの構造に階層を持たせた、てことだね。きっといい設計変更に違いない、と思う。

ちふわけで:

機能をどのくらい活用してるかによって乗り換えの難易度は違うとは思うけれど、ここまでちゃんとマイグレーションについてきちんと書いてくれてれば、まぁそんなに困ることはないと思うね。

ところで、「Tabulator バージョン乗り換え」とは全然無関係に、「没時年齢」も表示したくなった。冗長データとして計算済データを作っておくのではなく、没年月日から計算で求めるフォーマッタで実現する場合、その計算は2つのフィールドを参照することになる。

記憶では以前 Tabulator で遊んでた数年前はもっとこういった実現のための機能を探すのに苦労してた記憶があるのだが、たぶんそのときの記憶が間違いでないなら、かなりドキュメントが使いやすくなってる。たぶん「Component Objects」みたいにまとまった記述は、以前はなかったと思うんだよなぁ。細かなパーマネントリンクがないという点だけは相変わらず不満なままだが、さすがにここまでちゃんとしてると、あまり文句を言わないでおいてあげようか、と、優しい気持ちになる。

やりたいことのための改造は「function _calcage」部分のみ:

ver 5、だったっけ?

  1 # -*- coding: utf-8 -*-
  2 # require: python 3, bs4
  3 """
  4 「wikipedia声優ページ一覧」(など)を入力として、およそ世代関係を把握しやすい
  5 html テーブルを生成する。
  6 
  7 該当アーティストが情報を公開しているかどうかと wikipedia 調べがすべて一致する
  8 かどうかは誰も保証出来ない。正しい公開情報が真実である保証もない。という、元
  9 データそのものについての注意はあらかじめしておく。そして、wikipedia における
 10 ペンディング扱い(「要出典」など)やその他補足情報はこのスクリプトは無視している
 11 ので、「確実かそうでないか」がこの道具の結果だけでは判別出来ないことにも注意。
 12 (wikipedia 本体を読んでいる限りは、正しくない可能性がまだ残っている場合、
 13 そうであるとわかることが多い。)
 14 
 15 ワタシの目下の目的が「声優世代表のおとも」なので、「生年月日非公開/不明」が困る。
 16 なので、wikipedia が管理しようとしている「活動期間」を補助的に使おうと考えた。
 17 これが活用出来る場合は「生誕年＝活動開始-20年くらい」みたいな推測に使える。
 18 
 19 この情報の精度には、当たり前ながらかなりのバラつきがあるし、要約の仕方も統一感
 20 がない。たとえば「田中ちえ美」の声優活動は最低でも「サクラクエスト」の2017年に
 21 始まっているが、当該ページ執筆者が「声優アーティスト活動開始」の定義に基いて
 22 「音楽活動の活動期間は2021年-」と記述してしまっていて、かつ、声優活動としての
 23 開始を記述していない。ゆえにこの情報だけを拾うと「田中ちえ美の活動期間は2021年-」
 24 であると誤って判断してしまいかねない。そもそもが「サクラクエスト」にてキャラク
 25 ターソングを出しているので、定義次第では「音楽活動の活動期間は2021年-」も誤って
 26 いる。また、「2011年」のように年を特定出来ずに「2010年代」と要約しているページ
 27 も多く、これも結構使いにくい。
 28 
 29 ので、「テレビアニメ」などの配下の「2012年」みたいな年ごとまとめ見出しの最小値を、
 30 活動期間の情報として補助的に拾ってる。
 31 
 32 活動期間は年齢の推測にも使えるけれど、「声優世代表のおとも」として考える場合は、
 33 生年月日と活動期間の両方が既知でこその面白さがある。たとえば黒沢ともよ、宮本侑芽、
 34 諸星すみれ、浪川大輔などの子役出身者の例。あるいは逆に「遅咲き」と言われる役者。
 35 """
 36 import io
 37 import os
 38 import sys
 39 import tempfile
 40 import shutil
 41 import ssl
 42 import re
 43 import json
 44 import urllib.request
 45 from urllib.request import urlretrieve as urllib_urlretrieve
 46 from urllib.request import quote as urllib_quote
 47 
 48 
 49 import bs4  # require: beutifulsoup4
 50 
 51 
 52 __MYNAME__, _ = os.path.splitext(
 53     os.path.basename(sys.modules[__name__].__file__))
 54 #
 55 __USER_AGENT__ = "\
 56 Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
 57 AppleWebKit/537.36 (KHTML, like Gecko) \
 58 Chrome/91.0.4472.124 Safari/537.36"
 59 _htctxssl = ssl.create_default_context()
 60 _htctxssl.check_hostname = False
 61 _htctxssl.verify_mode = ssl.CERT_NONE
 62 https_handler = urllib.request.HTTPSHandler(context=_htctxssl)
 63 opener = urllib.request.build_opener(https_handler)
 64 opener.addheaders = [('User-Agent', __USER_AGENT__)]
 65 urllib.request.install_opener(opener)
 66 #
 67 
 68 
 69 _urlretrieved = dict()
 70 
 71 
 72 def _urlretrieve(url):
 73     if url in _urlretrieved:
 74         return _urlretrieved[url]
 75 
 76     def _gettemppath(s):
 77         tmptopdir = os.path.join(tempfile.gettempdir(), __MYNAME__)
 78         if not os.path.exists(tmptopdir):
 79             os.makedirs(tmptopdir)
 80         import hashlib, base64
 81         ep = base64.urlsafe_b64encode(
 82             hashlib.md5(s.encode("utf-8")).digest()
 83         ).partition(b"=")[0].decode()
 84         return os.path.join(tmptopdir, ep)
 85 
 86     cachefn = _gettemppath(url)
 87     if os.path.exists(cachefn):
 88         res = cachefn
 89     else:
 90         res, _ = urllib_urlretrieve(url, filename=cachefn)
 91     _urlretrieved[url] = res
 92     return res
 93 
 94 
 95 def _from_wp(actorpagename):
 96     baseurl = "https://ja.wikipedia.org/wiki/"
 97     pn = urllib_quote(actorpagename, encoding="utf-8")
 98     fn = _urlretrieve(baseurl + pn)
 99 
100     result = {"wikipedia": actorpagename, "名前": [actorpagename.partition("_")[0]]}
101     with io.open(fn, "r", encoding="utf-8") as fi:
102         soup = bs4.BeautifulSoup(fi.read(), features="html.parser")
103         try:
104             categos = [
105                 a.text
106                 for a in soup.find("div", {"id": "mw-normal-catlinks"}).find_all("a")]
107         except Exception:
108             return result
109         try:
110             trecords = iter(
111                 soup.find("table", {"class": "infobox"}).find("tbody").find_all("tr"))
112         except Exception:
113             return result
114         tr = next(trecords)
115         result["名前"] += [
116             re.sub(r"\[[^\[\]]+\]", "", sp.text)
117             for sp in tr.find("th").find_all("span")]
118 
119         actst = float("inf")
120         for tr in trecords:
121             th, td = tr.find("th"), tr.find("td")
122             if not th or not td:
123                 continue
124             k = th.text.replace("\n", "").replace(
125                 "誕生日", "生年月日").replace("生誕", "生年月日")
126             v = ""
127             if td:
128                 v = re.sub(
129                     r"\[[^\[\]]+\]", "", td.text.replace("\n", "")).strip()
130                 if k == "活動期間":
131                     m = re.search(r"(\d+)年\s*\-", v)
132                     if m:
133                         actst = min(float(m.group(1)), actst)
134                     continue
135                 elif k in ("生年月日", "没年月日"):
136                     m1 = re.search(r"\d+-\d+-\d+", v)
137                     m2 = re.search(r"(\d+)月(\d+)日", v)
138                     if m1:
139                         v = m1.group(0)
140                     elif m2:
141                         v = "0000-{:02d}-{:02d}".format(
142                             *list(map(int, m2.group(1, 2))))
143                     else:
144                         v = ""
145             result[k] = v
146         for dtt in [
147                 dt.text
148                 for dt in soup.find_all("dt") if re.match(r"\d+年", dt.text)]:
149             actst = min(float(re.search(r"(\d+)年", dtt).group(1)), actst)
150         try:
151             result["actst"] = "{:04d}".format(int(actst))
152         except OverflowError:
153             result["actst"] = "0000"
154         if not result.get("性別"):
155             if any([("男性" in c or "男優" in c) for c in categos]):
156                 result["性別"] = "男性"
157             elif any([("女性" in c or "女優" in c) for c in categos]):
158                 result["性別"] = "女性"
159         result["性別"] = result["性別"][0]
160     return result
161 
162 
163 if __name__ == '__main__':
164     def _yrgrp(ymd, actst):
165         y = 0
166         if ymd:
167             y, _, md = ymd.partition("-")
168             y = int(y)
169             if y and md < "04-02":
170                 y -= 1
171         return list(
172             map(lambda s: s.replace("0000", "????"),
173                 ["{:04d}".format(y), actst, ymd]))
174 
175     actorpages = list(
176         map(lambda s: s.strip(),
177             io.open("wppagenames.txt", encoding="utf-8").read().strip().split("\n")))
178     result = []
179     for a in actorpages:
180         inf = _from_wp(a)
181         g = _yrgrp(inf.get("生年月日", ""), inf.get("actst"))
182         result.append((
183             g[0], g[1], g[2],
184             inf.get("没年月日", ""),
185             inf.get("性別", ""),
186             ("<a href='https://ja.wikipedia.org/wiki/{pn}' target=_blank>{pn}</a>".format(
187                 pn=inf["wikipedia"])),
188             ", ".join(inf["名前"]),
189             inf.get("血液型", "-"),
190             inf.get("出生地", "-"),
191             inf.get("出身地", "-"),
192             inf.get("愛称", "-"),
193             inf.get("身長", "-"),
194             inf.get("事務所", "-"),
195         ))
196     result.sort()
197     with io.open("actor_basinf.html", "w", encoding="utf-8") as fo:
198         coln = [
199             "生誕年度", "活動開始年？", "生年月日", "没年月日",
200             "性別", "wikipedia", "名前",
201             "血液型",
202             "出生地", "出身地", "愛称", "身長", "事務所",
203             ]
204         print("""\
205 <html>
206 <head jang="ja">
207 <meta charset="UTF-8">
208 <link href="https://cdnjs.cloudflare.com/ajax/libs/tabulator/5.0.6/css/tabulator_site.min.css" rel="stylesheet">
209 <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/tabulator/5.0.6/js/tabulator.min.js"></script>
210 
211 <!--
212 {}
213   -->
214 </head>""".format(__doc__), file=fo)
215         print("""\
216 <body>
217 <div id="actor_basinf"></div>
218 <script>
219 function _dt2int(dt) {
220     function _pad(n) {
221         let ns = "" + Math.abs(n);
222         if (ns.length === 1) {
223             ns = "0" + ns;
224         }
225         return ns;
226     }
227     return parseInt(dt.getFullYear() + _pad(dt.getMonth() + 1) + _pad(dt.getDate()));
228 }
229 var nowi = _dt2int(new Date());
230 function _calcage(cell, formatterParams) {
231     var v = cell.getValue().replace(new RegExp("\-", "g"), "");
232     if (v.startsWith("????")) {
233         return "";
234     }
235     v = parseInt(v);
236     var result = "" + parseInt((nowi - v) / 10000);
237     result += "歳";
238     var d = cell.getRow().getCell("没年月日").getValue().replace(new RegExp("\-", "g"), "");
239     if (d) {
240         result += " (";
241         result += parseInt((parseInt(d) - v) / 10000);
242         result += "歳没)";
243     }
244     return result;
245 }
246 
247 
248 var actor_basinf_data = """, file=fo)
249         json.dump(
250             [dict(zip(coln, row)) for row in result],
251             fo, ensure_ascii=False, indent=4)
252         print("""
253 var table = new Tabulator("#actor_basinf", {
254     "height": "800px",
255     "columnDefaults": {
256         /* 注意: 「tooltips」ではない。「tooltip」である。 */
257         "tooltip": true,
258     },
259     "columns": [
260         {
261             "field": "生誕年度",
262             "title": "生誕年度",
263             "headerFilter": "input",
264             "headerFilterFunc": "regex",
265         },
266         {
267             "field": "活動開始年？",
268             "title": "活動開始年？",
269             "headerFilter": "input",
270             "headerFilterFunc": "regex",
271         },
272         {
273             "field": "生年月日",
274             "title": "生年月日",
275         },
276         {
277             "field": "生年月日",
278             "formatter": _calcage,
279             "headerTooltip": "存命の場合は年齢そのもの。亡くなっている方の場合は「生きていれば～歳」。 "
280         },
281         {
282             "field": "没年月日",
283             "title": "没年月日",
284         },
285         {
286             "field": "性別",
287             "title": "性別",
288             "headerFilter": "input",
289         },
290         {
291             "field": "名前",
292             "title": "名前",
293             "headerFilter": "input",
294             "headerFilterFunc": "regex",
295         },
296         {
297             "field": "wikipedia",
298             "title": "wikipedia",
299             "headerFilter": "input",
300             "headerFilterFunc": "regex",
301             "formatter": "html",
302         },
303         {
304             "field": "血液型",
305             "title": "血液型",
306             "headerFilter": "input",
307             "headerFilterFunc": "regex",
308         },
309         {
310             "field": "出生地",
311             "title": "出生地",
312             "headerFilter": "input",
313             "headerFilterFunc": "regex",
314         },
315         {
316             "field": "出身地",
317             "title": "出身地",
318             "headerFilter": "input",
319             "headerFilterFunc": "regex",
320         },
321         {
322             "field": "愛称",
323             "title": "愛称",
324             "headerFilter": "input",
325             "headerFilterFunc": "regex",
326         },
327         {
328             "field": "身長",
329             "title": "身長",
330             "headerFilter": "input",
331             "headerFilterFunc": "regex",
332         },
333         {
334             "field": "事務所",
335             "title": "事務所",
336             "headerFilter": "input",
337             "headerFilterFunc": "regex",
338         }
339   ], 
340   "layout": "fitColumns",
341   "data": actor_basinf_data
342 });
343 </script>
344 </html>
345 """, file=fo)

てふわけで:

ちょこまか改善しながら、ついでに気付いた機能があったので、それについてもやっとく。いつから使えたのか調べてはないけど「headerSortTristate:true」に気付いたのである。日常的にこれが気になる人は結構いると思う。ヘッダクリックがソートの昇順降順を指示するのを意味するのであれば、「指定しない」状態に戻せないと困るのだが、「headerSortTristate」してないと「クリックしちゃったら最後…」になってしまう。

てわけで:

細かな修正をしつつの ver 6、でしたっけ?

  1 # -*- coding: utf-8 -*-
  2 # require: python 3, bs4
  3 """
  4 「wikipedia声優ページ一覧」(など)を入力として、およそ世代関係を把握しやすい
  5 html テーブルを生成する。
  6 
  7 該当アーティストが情報を公開しているかどうかと wikipedia 調べがすべて一致する
  8 かどうかは誰も保証出来ない。正しい公開情報が真実である保証もない。という、元
  9 データそのものについての注意はあらかじめしておく。そして、wikipedia における
 10 ペンディング扱い(「要出典」など)やその他補足情報はこのスクリプトは無視している
 11 ので、「確実かそうでないか」がこの道具の結果だけでは判別出来ないことにも注意。
 12 (wikipedia 本体を読んでいる限りは、正しくない可能性がまだ残っている場合、
 13 そうであるとわかることが多い。)
 14 
 15 ワタシの目下の目的が「声優世代表のおとも」なので、「生年月日非公開/不明」が困る。
 16 なので、wikipedia が管理しようとしている「活動期間」を補助的に使おうと考えた。
 17 これが活用出来る場合は「生誕年＝活動開始-20年くらい」みたいな推測に使える。
 18 
 19 この情報の精度には、当たり前ながらかなりのバラつきがあるし、要約の仕方も統一感
 20 がない。たとえば「田中ちえ美」の声優活動は最低でも「サクラクエスト」の2017年に
 21 始まっているが、当該ページ執筆者が「声優アーティスト活動開始」の定義に基いて
 22 「音楽活動の活動期間は2021年-」と記述してしまっていて、かつ、声優活動としての
 23 開始を記述していない。ゆえにこの情報だけを拾うと「田中ちえ美の活動期間は2021年-」
 24 であると誤って判断してしまいかねない。そもそもが「サクラクエスト」にてキャラク
 25 ターソングを出しているので、定義次第では「音楽活動の活動期間は2021年-」も誤って
 26 いる。また、「2011年」のように年を特定出来ずに「2010年代」と要約しているページ
 27 も多く、これも結構使いにくい。
 28 
 29 ので、「テレビアニメ」などの配下の「2012年」みたいな年ごとまとめ見出しの最小値を、
 30 活動期間の情報として補助的に拾ってる。
 31 
 32 活動期間は年齢の推測にも使えるけれど、「声優世代表のおとも」として考える場合は、
 33 生年月日と活動期間の両方が既知でこその面白さがある。たとえば黒沢ともよ、宮本侑芽、
 34 諸星すみれ、浪川大輔などの子役出身者の例。あるいは逆に「遅咲き」と言われる役者。
 35 
 36 なお、日本語版では不明となっているのに英語版には記載されていることがあり、この
 37 スクリプトはそれも拾うが、これがアーティスト自身の公開非公開選択の方針とはより
 38 ズレる可能性があることには一応注意。(日本語版では判明情報を本人方針尊重で隠して
 39 いるのに、英語版ではそれが届いていない、のようなこと。)
 40 """
 41 import io
 42 import os
 43 import sys
 44 import tempfile
 45 import shutil
 46 import ssl
 47 import re
 48 import json
 49 import urllib
 50 import urllib.request
 51 from urllib.request import urlretrieve as urllib_urlretrieve
 52 from urllib.request import quote as urllib_quote
 53 
 54 
 55 import bs4  # require: beutifulsoup4
 56 
 57 
 58 __MYNAME__, _ = os.path.splitext(
 59     os.path.basename(sys.modules[__name__].__file__))
 60 #
 61 __USER_AGENT__ = "\
 62 Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
 63 AppleWebKit/537.36 (KHTML, like Gecko) \
 64 Chrome/91.0.4472.124 Safari/537.36"
 65 _htctxssl = ssl.create_default_context()
 66 _htctxssl.check_hostname = False
 67 _htctxssl.verify_mode = ssl.CERT_NONE
 68 https_handler = urllib.request.HTTPSHandler(context=_htctxssl)
 69 opener = urllib.request.build_opener(https_handler)
 70 opener.addheaders = [('User-Agent', __USER_AGENT__)]
 71 urllib.request.install_opener(opener)
 72 #
 73 
 74 
 75 _urlretrieved = dict()
 76 
 77 
 78 def _urlretrieve(url):
 79     if url in _urlretrieved:
 80         return _urlretrieved[url]
 81 
 82     def _gettemppath(s):
 83         tmptopdir = os.path.join(tempfile.gettempdir(), __MYNAME__)
 84         if not os.path.exists(tmptopdir):
 85             os.makedirs(tmptopdir)
 86         import hashlib, base64
 87         ep = base64.urlsafe_b64encode(
 88             hashlib.md5(s.encode("utf-8")).digest()
 89         ).partition(b"=")[0].decode()
 90         return os.path.join(tmptopdir, ep)
 91 
 92     cachefn = _gettemppath(url)
 93     if os.path.exists(cachefn):
 94         res = cachefn
 95     else:
 96         try:
 97             res, _ = urllib_urlretrieve(url, filename=cachefn)
 98         except Exception:
 99             from urllib.request import unquote as urllib_unquote
100             print(url, repr(urllib_unquote(url)))
101             raise
102     _urlretrieved[url] = res
103     return res
104 
105 
106 def _from_wp_jp(actorpagename):
107     baseurl = "https://ja.wikipedia.org/wiki/"
108     pn = urllib_quote(actorpagename, encoding="utf-8")
109     fn = _urlretrieve(baseurl + pn)
110 
111     result = {"wikipedia": actorpagename, "名前": [actorpagename.partition("_")[0]]}
112     with io.open(fn, "r", encoding="utf-8") as fi:
113         soup = bs4.BeautifulSoup(fi.read(), features="html.parser")
114         entr = soup.find("a", {"class": "interlanguage-link-target", "lang": "en"})
115         if entr:
116             result["wikipedia_en"] = entr.attrs["href"]
117         try:
118             categos = [
119                 a.text
120                 for a in soup.find("div", {"id": "mw-normal-catlinks"}).find_all("a")]
121         except Exception:
122             return result
123         try:
124             trecords = iter(
125                 soup.find("table", {"class": "infobox"}).find("tbody").find_all("tr"))
126         except Exception:
127             return result
128         tr = next(trecords)
129         result["名前"] += [
130             re.sub(r"\[[^\[\]]+\]", "", sp.text)
131             for sp in tr.find("th").find_all("span")]
132 
133         actst = float("inf")
134         actst_rough = None
135         for tr in trecords:
136             th, td = tr.find("th"), tr.find("td")
137             if not th or not td:
138                 continue
139             k = th.text.replace("\n", "").replace(
140                 "誕生日", "生年月日").replace("生誕", "生年月日")
141             k = re.sub("身長.*$", "身長/体重", k)
142             if k in ("事務所", "レーベル"):
143                 k = "事務所・レーベル"
144             v = ""
145             if td:
146                 if td.find("li"):
147                     td = "\n".join([li.text for li in td.find_all("li")])
148                 elif td.find("br"):
149                     td = bs4.BeautifulSoup(
150                         re.sub(r"<br\s*/?>", r"\n", str(td)), features="html.parser").text.strip()
151                 else:
152                     td = td.text
153                 v = re.sub(
154                     r"\[[^\[\]]+\]", "\n", td).strip()
155                 if k == "活動期間":
156                     m = re.match(r"(\d+)年(代)?", v)
157                     if m:
158                         if not m.group(2):
159                             actst = min(float(m.group(1)), actst)
160                         else:
161                             actst_rough = m.group(0)
162                     continue
163                 elif k in ("生年月日", "没年月日"):
164                     m1 = re.search(r"\d+-\d+-\d+", v)
165                     m2 = re.search(r"(\d+)月(\d+)日", v)
166                     if m1:
167                         v = m1.group(0)
168                     elif m2:
169                         v = "0000-{:02d}-{:02d}".format(
170                             *list(map(int, m2.group(1, 2))))
171                     else:
172                         v = ""
173             v = re.sub(
174                 r"(、\s*)+", "、",
175                 "、".join(re.sub(r"\n+", r"\n", v).split("\n")))
176             if k in ("デビュー作", "事務所・レーベル", "共同作業者", "ジャンル",):
177                 v = "、".join(list(filter(None, [result.get(k), v])))
178             if k == "身長/体重":
179                 v = re.sub(r"\s*、\s*cm", " cm", v)  # なんで??
180             result[k] = v
181         for stt, sta in (("dt", {}), ("b", {})):
182             for dtt in [
183                     dt.text
184                     for dt in soup.find_all(stt, sta) if re.match(r"\d+年$", dt.text)]:
185                 actst = min(float(re.search(r"(\d+)年", dtt).group(1)), actst)
186         try:
187             result["actst"] = "{:04d}".format(int(actst))
188         except OverflowError:
189             if actst_rough:
190                 result["actst"] = actst_rough
191             else:
192                 result["actst"] = "0000"
193         if not result.get("生年月日"):
194             result["生年月日"] = "0000-??-??"
195         if not result.get("性別"):
196             if any([("男性" in c or "男優" in c) for c in categos]):
197                 result["性別"] = "男性"
198             elif any([("女性" in c or "女優" in c) for c in categos]):
199                 result["性別"] = "女性"
200             else:
201                 #print(result["名前"])
202                 result["性別"] = "　"
203         result["性別"] = result["性別"][0]
204     return result
205 
206 
207 def _from_wp_en(result):
208     # 基本的に日本語版を信じ、欠落のものだけ英語版に頼ることにする。
209     if "0000" not in result["生年月日"]:
210         return result
211     try:
212         fn = _urlretrieve(result["wikipedia_en"])
213     except urllib.error.HTTPError:
214         return result
215     with io.open(fn, "r", encoding="utf-8") as fi:
216         soup = bs4.BeautifulSoup(fi.read(), features="html.parser")
217         try:
218             trecords = iter(
219                 soup.find("table", {"class": "infobox"}).find("tbody").find_all("tr"))
220         except Exception:
221             return result
222         for tr in trecords:
223             th, td = tr.find("th"), tr.find("td")
224             if not th or not td:
225                 continue
226             k_en = th.text.strip()
227             if k_en == "Born":
228                 if "0000" in result["生年月日"]:
229                     bd = td.find("span", {"class": "bday"})
230                     if bd:
231                         bd = re.sub(r"\[[^\[\]]+\]", "", bd.text)
232                         result["生年月日"] = bd
233     return result
234 
235 
236 def _from_wp(actorpagename):
237     result = _from_wp_jp(actorpagename)
238     if "wikipedia_en" in result:
239         _from_wp_en(result)
240     return result
241 
242 
243 if __name__ == '__main__':
244     def _yrgrp(ymd, actst):
245         y = 0
246         if ymd:
247             y, _, md = ymd.partition("-")
248             y = int(y)
249             if y and md < "04-02":
250                 y -= 1
251         return list(
252             map(lambda s: s.replace("0000", "????"),
253                 ["{:04d}".format(y), actst, ymd]))
254 
255     actorpages = list(set(
256         map(lambda s: s.strip(),
257             io.open("wppagenames.txt", encoding="utf-8").read().strip().split("\n"))))
258     result = []
259     for a in filter(None, actorpages):
260         inf = _from_wp(a)
261         g = _yrgrp(inf.get("生年月日", ""), inf.get("actst", "0000"))
262         result.append((
263             g[0], g[1], g[2],
264             inf.get("没年月日", ""),
265             inf.get("性別", ""),
266             inf["wikipedia"],
267             ", ".join(inf["名前"]),
268             inf.get("血液型", "-"),
269             "、".join(list(filter(None, [inf.get("出生地", ""), inf.get("出身地", "")]))),
270             inf.get("愛称", "-"),
271             inf.get("身長/体重", "-"),
272             inf.get("事務所・レーベル", "-"),
273             inf.get("デビュー作", ""),
274             inf.get("共同作業者", ""),
275         ))
276     result.sort()
277     with io.open("actor_basinf.html", "w", encoding="utf-8") as fo:
278         coln = [
279             "by",  # 生誕年度
280             "as",  # 活動開始年？
281             "bymd",  # 生年月日
282             "dymd",  # 没年月日
283             "gen",  # 性別
284             "wp",  # wikipedia
285             "nm",  # 名前
286             "bld",  # 血液型
287             "bor",  # 出生地・出身地
288             "nn",  # 愛称
289             "tw",  # 身長/体重
290             "bel",  # 事務所・レーベル
291             "fst",  # デビュー作
292             "tea",  # 共同作業者
293             ]
294         print("""\
295 <html>
296 <head jang="ja">
297 <meta charset="UTF-8">
298 <link href="https://cdnjs.cloudflare.com/ajax/libs/tabulator/5.0.6/css/tabulator_site.min.css" rel="stylesheet">
299 <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/tabulator/5.0.6/js/tabulator.min.js"></script>
300 
301 <!--
302 {}
303   -->
304 </head>""".format(__doc__), file=fo)
305         print("""\
306 <body>
307 <div style="margin: 1em 0em">
308 <select id="filter-gender">
309   <option value=""></option>
310   <option value="男">男性のみ</option>
311   <option value="女">女性のみ</option>
312 </select>
313 <select id="filter-alive">
314   <option></option>
315   <option>存命者のみ</option>
316   <option>死没者のみ</option>
317 </select>
318 </div>
319 <div id="actor_basinf"></div>
320 <script>
321 var filter_gender = document.getElementById("filter-gender");
322 function _update_genderfilter() {
323     let i = filter_gender.selectedIndex;
324     let v = filter_gender.options[i].value;
325     let eflts = table.getFilters(true);
326     for (let efi in eflts) {
327         let f = eflts[efi];
328         if (f["field"] == "gen") {
329             table.removeFilter(f["field"], f["type"], f["value"]);
330             break;
331         }
332     }
333     if (v) {
334         table.addFilter("gen", "regex", v);
335     }
336 }
337 filter_gender.addEventListener("change", _update_genderfilter);
338 var filter_alive = document.getElementById("filter-alive");
339 function _update_alivefilter() {
340     let i = filter_alive.selectedIndex;
341     let eflts = table.getFilters(true);
342     for (let efi in eflts) {
343         let f = eflts[efi];
344         if (f["field"] == "dymd") {
345             table.removeFilter(f["field"], f["type"], f["value"]);
346             break;
347         }
348     }
349     if (i == 1) {
350         table.addFilter("dymd", "=", "");
351     } else if (i == 2) {
352         table.addFilter("dymd", "!=", "");
353     }
354 }
355 filter_alive.addEventListener("change", _update_alivefilter);
356 
357 function _dt2int(dt) {
358     function _pad(n) {
359         let ns = "" + Math.abs(n);
360         if (ns.length === 1) {
361             ns = "0" + ns;
362         }
363         return ns;
364     }
365     return parseInt(dt.getFullYear() + _pad(dt.getMonth() + 1) + _pad(dt.getDate()));
366 }
367 var nowi = _dt2int(new Date());
368 function _calcage(cell, formatterParams) {
369     var v = cell.getValue().replace(new RegExp("\-", "g"), "");
370     if (v.startsWith("????")) {
371         return "";
372     }
373     v = parseInt(v);
374     var result = "" + parseInt((nowi - v) / 10000);
375     result += "歳";
376     var d = cell.getRow().getCell("dymd").getValue().replace(new RegExp("\-", "g"), "");
377     if (d) {
378         result += " (";
379         result += parseInt((parseInt(d) - v) / 10000);
380         result += "歳没)";
381     }
382     return result;
383 }
384 
385 
386 var actor_basinf_data = """, file=fo)
387         json.dump(
388             [dict(zip(coln, row)) for row in result],
389             fo, ensure_ascii=False, indent=4)
390         print("""
391 var table = new Tabulator("#actor_basinf", {
392     "height": "800px",
393     "columnDefaults": {
394         /* 注意: 「tooltips」ではない。「tooltip」である。 */
395         "tooltip": true,
396 
397         /* これはいつから使えたのかな? かつては「columnDefaults」の外にいたやつ。 */
398         "headerSortTristate": true,
399     },
400     "columns": [
401         {
402             "field": "by",
403             "title": "生誕年度",
404             "headerFilter": "input",
405             "headerFilterFunc": "regex",
406         },
407         {
408             "field": "as",
409             "title": "活動開始年？",
410             "headerFilter": "input",
411             "headerFilterFunc": "regex",
412         },
413         {
414             "field": "bymd",
415             "title": "生年月日",
416         },
417         {
418             "field": "bymd",
419             "formatter": _calcage,
420             "headerTooltip": "存命の場合は年齢そのもの。亡くなっている方の場合は「生きていれば～歳」。 "
421         },
422         {
423             "field": "dymd",
424             "title": "没年月日",
425         },
426         {
427             "field": "gen",
428             "title": "性別",
429         },
430         {
431             "field": "nm",
432             "title": "名前",
433             "headerFilter": "input",
434             "headerFilterFunc": "regex",
435         },
436         {
437             "field": "wp",
438             "title": "wikipedia",
439             "headerFilter": "input",
440             "headerFilterFunc": "regex",
441             "formatter": function (cell, formatterParams, onRendered) {
442                 let pn = cell.getValue();
443                 return "<a href='https://ja.wikipedia.org/wiki/" +
444                     pn + "' target=_blank>" + pn + "</a>";
445             },
446         },
447         {
448             "field": "bld",
449             "title": "血液型",
450             "headerFilter": "input",
451             "headerFilterFunc": "regex",
452         },
453         {
454             "field": "bor",
455             "title": "出生地・出身地",
456             "headerFilter": "input",
457             "headerFilterFunc": "regex",
458         },
459         {
460             "field": "nn",
461             "title": "愛称",
462             "headerFilter": "input",
463             "headerFilterFunc": "regex",
464         },
465         {
466             "field": "tw",
467             "title": "身長/体重",
468             "headerFilter": "input",
469             "headerFilterFunc": "regex",
470         },
471         {
472             "field": "bel",
473             "title": "事務所・レーベル",
474             "headerFilter": "input",
475             "headerFilterFunc": "regex",
476         },
477         {
478             "field": "fst",
479             "title": "デビュー作",
480             "headerFilter": "input",
481             "headerFilterFunc": "regex",
482         },
483         {
484             "field": "tea",
485             "title": "共同作業者",
486             "headerFilter": "input",
487             "headerFilterFunc": "regex",
488         },
489   ], 
490   "layout": "fitColumns",
491   "data": actor_basinf_data
492 });
493 </script>
494 </html>
495 """, file=fo)