PHP の html_entity_decode だけではどうにもならなくて

好きで PHP やってるわけじゃねー。

なんらかの理由で、そう、典型的には「TinyMCEなどのビジュアルエディタによる変換で」、HTMLエンティティへのエンコード済みテキストがあったとする:

 1 <pre>
 2 #include &lt;iostream&gt;
 3 static const char* s = &quot;abcde&quot;;
 4 
 5 int main()
 6 {
 7 &nbsp;&nbsp;&nbsp;&nbsp;if (s[1] == &#39;b&#39;) {
 8 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;std::cout &lt;&lt; &quot;It makes me huge frustration.&quot; &lt;&lt; std::endl;
 9 &nbsp;&nbsp;&nbsp;&nbsp;}
10 &nbsp;&nbsp;&nbsp;&nbsp;return 0;
11 }
12 </pre>

「ワタシがブラウザ」であるならば、こう表示するであろう:

 1 #include <iostream>
 2 static const char* s = "abcde";
 3 
 4 int main()
 5 {
 6     if (s[1] == 'b') {
 7         std::cout << "It makes me huge frustration." << std::endl;
 8     }
 9     return 0;
10 }

そう。ブラウザ相手に HTML エンティティが残っていても誰も困らない。これを適切な表示に読み替えるのはブラウザの責務だから。

だからといって html_entity_decode が「ブラウザ相手であればいい」で終わっちゃってどーすんのよ。

つまり。「ブラウザのような」処理を書く必要がある場合は、「ブラウザはこう解釈する」を模擬する必要がある。pygmentsにはさ、「ブラウザが表示してくれるであろうテキスト」ではなく、「書き手が意図したテキスト(すなわちブラウザでみたままのテキスト)」を渡さないと意味がないでしょう? つまりこの種の処理では、

1 &nbsp;

も含め、全て対応する文字に変換しなければならない。

html_entity_decode のマニュアルにはさ、こんなことが書いてあるわけね:

なんかね、この翻訳違和感あってさ、英文はこうね:

More precisely, this function decodes all the entities (including all numeric entities) that a) are necessarily valid for the chosen document type — i.e., for XML, this function does not decode named entities that might be defined in some DTD — and b) whose character or characters are in the coded character set associated with the chosen encoding and are permitted in the chosen document type. All other entities are left as is.

「that are necessarily valid for the chosen document type」って、「変換しないと不正になっちゃうもの(DTDに違反する)」ってことのような気がするのだが、だとしたら翻訳はヘンだ。アタシが読み間違えてる? (訳出としてはきっとこの場合「妥当であることが不可欠であるもの」てな感じ? と思うんだけどな。)

で。変換しない、とは書いてあるが、変換したい場合にどうすればいいのか、書いてないんだ、これが。HTML の場合は

1 &nbsp;
2 &lt;
3 &amp;
4 &gt;

などの名前付き実体参照が該当する。これを変換しない、という。もう一度言うけど、「ブラウザ相手でいい」ならこれでも良いのだ。けど相手はブラウザじゃない、場合にどうすりゃいいのだ。(言い換えれば、出力が HTML でない場合、ね。)

例によって答えは StackOverflow にあったんだけれど、その前に、答えを見つけた StackOverflow の該当ページのこれを引用しとく

Example array for the HTML 4.01 named entities as specified by W3C as used above follows. It contains 252 entities. If you want to support XHTML, then there is one more (I put it at the end):

  1 $HTML401NamedToNumeric = array(
  2     '&nbsp;'     => '&#160;',  # no-break space = non-breaking space, U+00A0 ISOnum
  3     '&iexcl;'    => '&#161;',  # inverted exclamation mark, U+00A1 ISOnum
  4     '&cent;'     => '&#162;',  # cent sign, U+00A2 ISOnum
  5     '&pound;'    => '&#163;',  # pound sign, U+00A3 ISOnum
  6     '&curren;'   => '&#164;',  # currency sign, U+00A4 ISOnum
  7     '&yen;'      => '&#165;',  # yen sign = yuan sign, U+00A5 ISOnum
  8     '&brvbar;'   => '&#166;',  # broken bar = broken vertical bar, U+00A6 ISOnum
  9     '&sect;'     => '&#167;',  # section sign, U+00A7 ISOnum
 10     '&uml;'      => '&#168;',  # diaeresis = spacing diaeresis, U+00A8 ISOdia
 11     '&copy;'     => '&#169;',  # copyright sign, U+00A9 ISOnum
 12     '&ordf;'     => '&#170;',  # feminine ordinal indicator, U+00AA ISOnum
 13     '&laquo;'    => '&#171;',  # left-pointing double angle quotation mark = left pointing guillemet, U+00AB ISOnum
 14     '&not;'      => '&#172;',  # not sign, U+00AC ISOnum
 15     '&shy;'      => '&#173;',  # soft hyphen = discretionary hyphen, U+00AD ISOnum
 16     '&reg;'      => '&#174;',  # registered sign = registered trade mark sign, U+00AE ISOnum
 17     '&macr;'     => '&#175;',  # macron = spacing macron = overline = APL overbar, U+00AF ISOdia
 18     '&deg;'      => '&#176;',  # degree sign, U+00B0 ISOnum
 19     '&plusmn;'   => '&#177;',  # plus-minus sign = plus-or-minus sign, U+00B1 ISOnum
 20     '&sup2;'     => '&#178;',  # superscript two = superscript digit two = squared, U+00B2 ISOnum
 21     '&sup3;'     => '&#179;',  # superscript three = superscript digit three = cubed, U+00B3 ISOnum
 22     '&acute;'    => '&#180;',  # acute accent = spacing acute, U+00B4 ISOdia
 23     '&micro;'    => '&#181;',  # micro sign, U+00B5 ISOnum
 24     '&para;'     => '&#182;',  # pilcrow sign = paragraph sign, U+00B6 ISOnum
 25     '&middot;'   => '&#183;',  # middle dot = Georgian comma = Greek middle dot, U+00B7 ISOnum
 26     '&cedil;'    => '&#184;',  # cedilla = spacing cedilla, U+00B8 ISOdia
 27     '&sup1;'     => '&#185;',  # superscript one = superscript digit one, U+00B9 ISOnum
 28     '&ordm;'     => '&#186;',  # masculine ordinal indicator, U+00BA ISOnum
 29     '&raquo;'    => '&#187;',  # right-pointing double angle quotation mark = right pointing guillemet, U+00BB ISOnum
 30     '&frac14;'   => '&#188;',  # vulgar fraction one quarter = fraction one quarter, U+00BC ISOnum
 31     '&frac12;'   => '&#189;',  # vulgar fraction one half = fraction one half, U+00BD ISOnum
 32     '&frac34;'   => '&#190;',  # vulgar fraction three quarters = fraction three quarters, U+00BE ISOnum
 33     '&iquest;'   => '&#191;',  # inverted question mark = turned question mark, U+00BF ISOnum
 34     '&Agrave;'   => '&#192;',  # latin capital letter A with grave = latin capital letter A grave, U+00C0 ISOlat1
 35     '&Aacute;'   => '&#193;',  # latin capital letter A with acute, U+00C1 ISOlat1
 36     '&Acirc;'    => '&#194;',  # latin capital letter A with circumflex, U+00C2 ISOlat1
 37     '&Atilde;'   => '&#195;',  # latin capital letter A with tilde, U+00C3 ISOlat1
 38     '&Auml;'     => '&#196;',  # latin capital letter A with diaeresis, U+00C4 ISOlat1
 39     '&Aring;'    => '&#197;',  # latin capital letter A with ring above = latin capital letter A ring, U+00C5 ISOlat1
 40     '&AElig;'    => '&#198;',  # latin capital letter AE = latin capital ligature AE, U+00C6 ISOlat1
 41     '&Ccedil;'   => '&#199;',  # latin capital letter C with cedilla, U+00C7 ISOlat1
 42     '&Egrave;'   => '&#200;',  # latin capital letter E with grave, U+00C8 ISOlat1
 43     '&Eacute;'   => '&#201;',  # latin capital letter E with acute, U+00C9 ISOlat1
 44     '&Ecirc;'    => '&#202;',  # latin capital letter E with circumflex, U+00CA ISOlat1
 45     '&Euml;'     => '&#203;',  # latin capital letter E with diaeresis, U+00CB ISOlat1
 46     '&Igrave;'   => '&#204;',  # latin capital letter I with grave, U+00CC ISOlat1
 47     '&Iacute;'   => '&#205;',  # latin capital letter I with acute, U+00CD ISOlat1
 48     '&Icirc;'    => '&#206;',  # latin capital letter I with circumflex, U+00CE ISOlat1
 49     '&Iuml;'     => '&#207;',  # latin capital letter I with diaeresis, U+00CF ISOlat1
 50     '&ETH;'      => '&#208;',  # latin capital letter ETH, U+00D0 ISOlat1
 51     '&Ntilde;'   => '&#209;',  # latin capital letter N with tilde, U+00D1 ISOlat1
 52     '&Ograve;'   => '&#210;',  # latin capital letter O with grave, U+00D2 ISOlat1
 53     '&Oacute;'   => '&#211;',  # latin capital letter O with acute, U+00D3 ISOlat1
 54     '&Ocirc;'    => '&#212;',  # latin capital letter O with circumflex, U+00D4 ISOlat1
 55     '&Otilde;'   => '&#213;',  # latin capital letter O with tilde, U+00D5 ISOlat1
 56     '&Ouml;'     => '&#214;',  # latin capital letter O with diaeresis, U+00D6 ISOlat1
 57     '&times;'    => '&#215;',  # multiplication sign, U+00D7 ISOnum
 58     '&Oslash;'   => '&#216;',  # latin capital letter O with stroke = latin capital letter O slash, U+00D8 ISOlat1
 59     '&Ugrave;'   => '&#217;',  # latin capital letter U with grave, U+00D9 ISOlat1
 60     '&Uacute;'   => '&#218;',  # latin capital letter U with acute, U+00DA ISOlat1
 61     '&Ucirc;'    => '&#219;',  # latin capital letter U with circumflex, U+00DB ISOlat1
 62     '&Uuml;'     => '&#220;',  # latin capital letter U with diaeresis, U+00DC ISOlat1
 63     '&Yacute;'   => '&#221;',  # latin capital letter Y with acute, U+00DD ISOlat1
 64     '&THORN;'    => '&#222;',  # latin capital letter THORN, U+00DE ISOlat1
 65     '&szlig;'    => '&#223;',  # latin small letter sharp s = ess-zed, U+00DF ISOlat1
 66     '&agrave;'   => '&#224;',  # latin small letter a with grave = latin small letter a grave, U+00E0 ISOlat1
 67     '&aacute;'   => '&#225;',  # latin small letter a with acute, U+00E1 ISOlat1
 68     '&acirc;'    => '&#226;',  # latin small letter a with circumflex, U+00E2 ISOlat1
 69     '&atilde;'   => '&#227;',  # latin small letter a with tilde, U+00E3 ISOlat1
 70     '&auml;'     => '&#228;',  # latin small letter a with diaeresis, U+00E4 ISOlat1
 71     '&aring;'    => '&#229;',  # latin small letter a with ring above = latin small letter a ring, U+00E5 ISOlat1
 72     '&aelig;'    => '&#230;',  # latin small letter ae = latin small ligature ae, U+00E6 ISOlat1
 73     '&ccedil;'   => '&#231;',  # latin small letter c with cedilla, U+00E7 ISOlat1
 74     '&egrave;'   => '&#232;',  # latin small letter e with grave, U+00E8 ISOlat1
 75     '&eacute;'   => '&#233;',  # latin small letter e with acute, U+00E9 ISOlat1
 76     '&ecirc;'    => '&#234;',  # latin small letter e with circumflex, U+00EA ISOlat1
 77     '&euml;'     => '&#235;',  # latin small letter e with diaeresis, U+00EB ISOlat1
 78     '&igrave;'   => '&#236;',  # latin small letter i with grave, U+00EC ISOlat1
 79     '&iacute;'   => '&#237;',  # latin small letter i with acute, U+00ED ISOlat1
 80     '&icirc;'    => '&#238;',  # latin small letter i with circumflex, U+00EE ISOlat1
 81     '&iuml;'     => '&#239;',  # latin small letter i with diaeresis, U+00EF ISOlat1
 82     '&eth;'      => '&#240;',  # latin small letter eth, U+00F0 ISOlat1
 83     '&ntilde;'   => '&#241;',  # latin small letter n with tilde, U+00F1 ISOlat1
 84     '&ograve;'   => '&#242;',  # latin small letter o with grave, U+00F2 ISOlat1
 85     '&oacute;'   => '&#243;',  # latin small letter o with acute, U+00F3 ISOlat1
 86     '&ocirc;'    => '&#244;',  # latin small letter o with circumflex, U+00F4 ISOlat1
 87     '&otilde;'   => '&#245;',  # latin small letter o with tilde, U+00F5 ISOlat1
 88     '&ouml;'     => '&#246;',  # latin small letter o with diaeresis, U+00F6 ISOlat1
 89     '&divide;'   => '&#247;',  # division sign, U+00F7 ISOnum
 90     '&oslash;'   => '&#248;',  # latin small letter o with stroke, = latin small letter o slash, U+00F8 ISOlat1
 91     '&ugrave;'   => '&#249;',  # latin small letter u with grave, U+00F9 ISOlat1
 92     '&uacute;'   => '&#250;',  # latin small letter u with acute, U+00FA ISOlat1
 93     '&ucirc;'    => '&#251;',  # latin small letter u with circumflex, U+00FB ISOlat1
 94     '&uuml;'     => '&#252;',  # latin small letter u with diaeresis, U+00FC ISOlat1
 95     '&yacute;'   => '&#253;',  # latin small letter y with acute, U+00FD ISOlat1
 96     '&thorn;'    => '&#254;',  # latin small letter thorn, U+00FE ISOlat1
 97     '&yuml;'     => '&#255;',  # latin small letter y with diaeresis, U+00FF ISOlat1
 98     '&fnof;'     => '&#402;',  # latin small f with hook = function = florin, U+0192 ISOtech
 99     '&Alpha;'    => '&#913;',  # greek capital letter alpha, U+0391
100     '&Beta;'     => '&#914;',  # greek capital letter beta, U+0392
101     '&Gamma;'    => '&#915;',  # greek capital letter gamma, U+0393 ISOgrk3
102     '&Delta;'    => '&#916;',  # greek capital letter delta, U+0394 ISOgrk3
103     '&Epsilon;'  => '&#917;',  # greek capital letter epsilon, U+0395
104     '&Zeta;'     => '&#918;',  # greek capital letter zeta, U+0396
105     '&Eta;'      => '&#919;',  # greek capital letter eta, U+0397
106     '&Theta;'    => '&#920;',  # greek capital letter theta, U+0398 ISOgrk3
107     '&Iota;'     => '&#921;',  # greek capital letter iota, U+0399
108     '&Kappa;'    => '&#922;',  # greek capital letter kappa, U+039A
109     '&Lambda;'   => '&#923;',  # greek capital letter lambda, U+039B ISOgrk3
110     '&Mu;'       => '&#924;',  # greek capital letter mu, U+039C
111     '&Nu;'       => '&#925;',  # greek capital letter nu, U+039D
112     '&Xi;'       => '&#926;',  # greek capital letter xi, U+039E ISOgrk3
113     '&Omicron;'  => '&#927;',  # greek capital letter omicron, U+039F
114     '&Pi;'       => '&#928;',  # greek capital letter pi, U+03A0 ISOgrk3
115     '&Rho;'      => '&#929;',  # greek capital letter rho, U+03A1
116     '&Sigma;'    => '&#931;',  # greek capital letter sigma, U+03A3 ISOgrk3
117     '&Tau;'      => '&#932;',  # greek capital letter tau, U+03A4
118     '&Upsilon;'  => '&#933;',  # greek capital letter upsilon, U+03A5 ISOgrk3
119     '&Phi;'      => '&#934;',  # greek capital letter phi, U+03A6 ISOgrk3
120     '&Chi;'      => '&#935;',  # greek capital letter chi, U+03A7
121     '&Psi;'      => '&#936;',  # greek capital letter psi, U+03A8 ISOgrk3
122     '&Omega;'    => '&#937;',  # greek capital letter omega, U+03A9 ISOgrk3
123     '&alpha;'    => '&#945;',  # greek small letter alpha, U+03B1 ISOgrk3
124     '&beta;'     => '&#946;',  # greek small letter beta, U+03B2 ISOgrk3
125     '&gamma;'    => '&#947;',  # greek small letter gamma, U+03B3 ISOgrk3
126     '&delta;'    => '&#948;',  # greek small letter delta, U+03B4 ISOgrk3
127     '&epsilon;'  => '&#949;',  # greek small letter epsilon, U+03B5 ISOgrk3
128     '&zeta;'     => '&#950;',  # greek small letter zeta, U+03B6 ISOgrk3
129     '&eta;'      => '&#951;',  # greek small letter eta, U+03B7 ISOgrk3
130     '&theta;'    => '&#952;',  # greek small letter theta, U+03B8 ISOgrk3
131     '&iota;'     => '&#953;',  # greek small letter iota, U+03B9 ISOgrk3
132     '&kappa;'    => '&#954;',  # greek small letter kappa, U+03BA ISOgrk3
133     '&lambda;'   => '&#955;',  # greek small letter lambda, U+03BB ISOgrk3
134     '&mu;'       => '&#956;',  # greek small letter mu, U+03BC ISOgrk3
135     '&nu;'       => '&#957;',  # greek small letter nu, U+03BD ISOgrk3
136     '&xi;'       => '&#958;',  # greek small letter xi, U+03BE ISOgrk3
137     '&omicron;'  => '&#959;',  # greek small letter omicron, U+03BF NEW
138     '&pi;'       => '&#960;',  # greek small letter pi, U+03C0 ISOgrk3
139     '&rho;'      => '&#961;',  # greek small letter rho, U+03C1 ISOgrk3
140     '&sigmaf;'   => '&#962;',  # greek small letter final sigma, U+03C2 ISOgrk3
141     '&sigma;'    => '&#963;',  # greek small letter sigma, U+03C3 ISOgrk3
142     '&tau;'      => '&#964;',  # greek small letter tau, U+03C4 ISOgrk3
143     '&upsilon;'  => '&#965;',  # greek small letter upsilon, U+03C5 ISOgrk3
144     '&phi;'      => '&#966;',  # greek small letter phi, U+03C6 ISOgrk3
145     '&chi;'      => '&#967;',  # greek small letter chi, U+03C7 ISOgrk3
146     '&psi;'      => '&#968;',  # greek small letter psi, U+03C8 ISOgrk3
147     '&omega;'    => '&#969;',  # greek small letter omega, U+03C9 ISOgrk3
148     '&thetasym;' => '&#977;',  # greek small letter theta symbol, U+03D1 NEW
149     '&upsih;'    => '&#978;',  # greek upsilon with hook symbol, U+03D2 NEW
150     '&piv;'      => '&#982;',  # greek pi symbol, U+03D6 ISOgrk3
151     '&bull;'     => '&#8226;', # bullet = black small circle, U+2022 ISOpub
152     '&hellip;'   => '&#8230;', # horizontal ellipsis = three dot leader, U+2026 ISOpub
153     '&prime;'    => '&#8242;', # prime = minutes = feet, U+2032 ISOtech
154     '&Prime;'    => '&#8243;', # double prime = seconds = inches, U+2033 ISOtech
155     '&oline;'    => '&#8254;', # overline = spacing overscore, U+203E NEW
156     '&frasl;'    => '&#8260;', # fraction slash, U+2044 NEW
157     '&weierp;'   => '&#8472;', # script capital P = power set = Weierstrass p, U+2118 ISOamso
158     '&image;'    => '&#8465;', # blackletter capital I = imaginary part, U+2111 ISOamso
159     '&real;'     => '&#8476;', # blackletter capital R = real part symbol, U+211C ISOamso
160     '&trade;'    => '&#8482;', # trade mark sign, U+2122 ISOnum
161     '&alefsym;'  => '&#8501;', # alef symbol = first transfinite cardinal, U+2135 NEW
162     '&larr;'     => '&#8592;', # leftwards arrow, U+2190 ISOnum
163     '&uarr;'     => '&#8593;', # upwards arrow, U+2191 ISOnum
164     '&rarr;'     => '&#8594;', # rightwards arrow, U+2192 ISOnum
165     '&darr;'     => '&#8595;', # downwards arrow, U+2193 ISOnum
166     '&harr;'     => '&#8596;', # left right arrow, U+2194 ISOamsa
167     '&crarr;'    => '&#8629;', # downwards arrow with corner leftwards = carriage return, U+21B5 NEW
168     '&lArr;'     => '&#8656;', # leftwards double arrow, U+21D0 ISOtech
169     '&uArr;'     => '&#8657;', # upwards double arrow, U+21D1 ISOamsa
170     '&rArr;'     => '&#8658;', # rightwards double arrow, U+21D2 ISOtech
171     '&dArr;'     => '&#8659;', # downwards double arrow, U+21D3 ISOamsa
172     '&hArr;'     => '&#8660;', # left right double arrow, U+21D4 ISOamsa
173     '&forall;'   => '&#8704;', # for all, U+2200 ISOtech
174     '&part;'     => '&#8706;', # partial differential, U+2202 ISOtech
175     '&exist;'    => '&#8707;', # there exists, U+2203 ISOtech
176     '&empty;'    => '&#8709;', # empty set = null set = diameter, U+2205 ISOamso
177     '&nabla;'    => '&#8711;', # nabla = backward difference, U+2207 ISOtech
178     '&isin;'     => '&#8712;', # element of, U+2208 ISOtech
179     '&notin;'    => '&#8713;', # not an element of, U+2209 ISOtech
180     '&ni;'       => '&#8715;', # contains as member, U+220B ISOtech
181     '&prod;'     => '&#8719;', # n-ary product = product sign, U+220F ISOamsb
182     '&sum;'      => '&#8721;', # n-ary sumation, U+2211 ISOamsb
183     '&minus;'    => '&#8722;', # minus sign, U+2212 ISOtech
184     '&lowast;'   => '&#8727;', # asterisk operator, U+2217 ISOtech
185     '&radic;'    => '&#8730;', # square root = radical sign, U+221A ISOtech
186     '&prop;'     => '&#8733;', # proportional to, U+221D ISOtech
187     '&infin;'    => '&#8734;', # infinity, U+221E ISOtech
188     '&ang;'      => '&#8736;', # angle, U+2220 ISOamso
189     '&and;'      => '&#8743;', # logical and = wedge, U+2227 ISOtech
190     '&or;'       => '&#8744;', # logical or = vee, U+2228 ISOtech
191     '&cap;'      => '&#8745;', # intersection = cap, U+2229 ISOtech
192     '&cup;'      => '&#8746;', # union = cup, U+222A ISOtech
193     '&int;'      => '&#8747;', # integral, U+222B ISOtech
194     '&there4;'   => '&#8756;', # therefore, U+2234 ISOtech
195     '&sim;'      => '&#8764;', # tilde operator = varies with = similar to, U+223C ISOtech
196     '&cong;'     => '&#8773;', # approximately equal to, U+2245 ISOtech
197     '&asymp;'    => '&#8776;', # almost equal to = asymptotic to, U+2248 ISOamsr
198     '&ne;'       => '&#8800;', # not equal to, U+2260 ISOtech
199     '&equiv;'    => '&#8801;', # identical to, U+2261 ISOtech
200     '&le;'       => '&#8804;', # less-than or equal to, U+2264 ISOtech
201     '&ge;'       => '&#8805;', # greater-than or equal to, U+2265 ISOtech
202     '&sub;'      => '&#8834;', # subset of, U+2282 ISOtech
203     '&sup;'      => '&#8835;', # superset of, U+2283 ISOtech
204     '&nsub;'     => '&#8836;', # not a subset of, U+2284 ISOamsn
205     '&sube;'     => '&#8838;', # subset of or equal to, U+2286 ISOtech
206     '&supe;'     => '&#8839;', # superset of or equal to, U+2287 ISOtech
207     '&oplus;'    => '&#8853;', # circled plus = direct sum, U+2295 ISOamsb
208     '&otimes;'   => '&#8855;', # circled times = vector product, U+2297 ISOamsb
209     '&perp;'     => '&#8869;', # up tack = orthogonal to = perpendicular, U+22A5 ISOtech
210     '&sdot;'     => '&#8901;', # dot operator, U+22C5 ISOamsb
211     '&lceil;'    => '&#8968;', # left ceiling = apl upstile, U+2308 ISOamsc
212     '&rceil;'    => '&#8969;', # right ceiling, U+2309 ISOamsc
213     '&lfloor;'   => '&#8970;', # left floor = apl downstile, U+230A ISOamsc
214     '&rfloor;'   => '&#8971;', # right floor, U+230B ISOamsc
215     '&lang;'     => '&#9001;', # left-pointing angle bracket = bra, U+2329 ISOtech
216     '&rang;'     => '&#9002;', # right-pointing angle bracket = ket, U+232A ISOtech
217     '&loz;'      => '&#9674;', # lozenge, U+25CA ISOpub
218     '&spades;'   => '&#9824;', # black spade suit, U+2660 ISOpub
219     '&clubs;'    => '&#9827;', # black club suit = shamrock, U+2663 ISOpub
220     '&hearts;'   => '&#9829;', # black heart suit = valentine, U+2665 ISOpub
221     '&diams;'    => '&#9830;', # black diamond suit, U+2666 ISOpub
222     '&quot;'     => '&#34;',   # quotation mark = APL quote, U+0022 ISOnum
223     '&amp;'      => '&#38;',   # ampersand, U+0026 ISOnum
224     '&lt;'       => '&#60;',   # less-than sign, U+003C ISOnum
225     '&gt;'       => '&#62;',   # greater-than sign, U+003E ISOnum
226     '&OElig;'    => '&#338;',  # latin capital ligature OE, U+0152 ISOlat2
227     '&oelig;'    => '&#339;',  # latin small ligature oe, U+0153 ISOlat2
228     '&Scaron;'   => '&#352;',  # latin capital letter S with caron, U+0160 ISOlat2
229     '&scaron;'   => '&#353;',  # latin small letter s with caron, U+0161 ISOlat2
230     '&Yuml;'     => '&#376;',  # latin capital letter Y with diaeresis, U+0178 ISOlat2
231     '&circ;'     => '&#710;',  # modifier letter circumflex accent, U+02C6 ISOpub
232     '&tilde;'    => '&#732;',  # small tilde, U+02DC ISOdia
233     '&ensp;'     => '&#8194;', # en space, U+2002 ISOpub
234     '&emsp;'     => '&#8195;', # em space, U+2003 ISOpub
235     '&thinsp;'   => '&#8201;', # thin space, U+2009 ISOpub
236     '&zwnj;'     => '&#8204;', # zero width non-joiner, U+200C NEW RFC 2070
237     '&zwj;'      => '&#8205;', # zero width joiner, U+200D NEW RFC 2070
238     '&lrm;'      => '&#8206;', # left-to-right mark, U+200E NEW RFC 2070
239     '&rlm;'      => '&#8207;', # right-to-left mark, U+200F NEW RFC 2070
240     '&ndash;'    => '&#8211;', # en dash, U+2013 ISOpub
241     '&mdash;'    => '&#8212;', # em dash, U+2014 ISOpub
242     '&lsquo;'    => '&#8216;', # left single quotation mark, U+2018 ISOnum
243     '&rsquo;'    => '&#8217;', # right single quotation mark, U+2019 ISOnum
244     '&sbquo;'    => '&#8218;', # single low-9 quotation mark, U+201A NEW
245     '&ldquo;'    => '&#8220;', # left double quotation mark, U+201C ISOnum
246     '&rdquo;'    => '&#8221;', # right double quotation mark, U+201D ISOnum
247     '&bdquo;'    => '&#8222;', # double low-9 quotation mark, U+201E NEW
248     '&dagger;'   => '&#8224;', # dagger, U+2020 ISOpub
249     '&Dagger;'   => '&#8225;', # double dagger, U+2021 ISOpub
250     '&permil;'   => '&#8240;', # per mille sign, U+2030 ISOtech
251     '&lsaquo;'   => '&#8249;', # single left-pointing angle quotation mark, U+2039 ISO proposed
252     '&rsaquo;'   => '&#8250;', # single right-pointing angle quotation mark, U+203A ISO proposed
253     '&euro;'     => '&#8364;', # euro sign, U+20AC NEW
254 );

And the one for XHTML:

1     '&apos;'     => '&#39;',   # apostrophe = APL quote, U+0027 ISOnum

えらいこっちゃぁ。要はこれだけのものを残しっちまうわけだ、html_entity_decode は。

論より証拠、で、こんな検証 PHP:

 1 <style type="text/css">
 2 .container { width: 800px; height: 200px; }
 3 </style>
 4 <form method="POST">
 5 <div class="container">
 6 <textarea name='txt' style='width: 100%; height: 100%;'>
 7 <?php
 8   if ($_POST) {
 9     echo $_POST['txt'];
10   }
11 ?>
12 </textarea>
13 </div>
14 <input type="submit">
15 </form>
16 
17 <pre style="border: 1px solid #777;">
18 <?php
19   if ($_POST) {
20     echo html_entity_decode($_POST['txt']);
21   }
22 ?>
23 </pre>
24 <pre style="border: 1px solid #777;">
25 <?php
26   if ($_POST) {
27     echo html_entity_decode($_POST['txt'], ENT_QUOTES | ENT_COMPAT | ENT_HTML401);
28   }
29 ?>
30 </pre>
31 
32 <pre style="border: 1px solid #777;">
33 <?php
34   if ($_POST) {
35     echo htmlentities($_POST['txt']);
36   }
37 ?>
38 </pre>
39 <pre style="border: 1px solid #777;">
40 <?php
41   if ($_POST) {
42     echo htmlentities($_POST['txt'], ENT_QUOTES | ENT_COMPAT | ENT_HTML401);
43   }
44 ?>
45 </pre>

(検証はブラウザで見るだけではダメね。「ソースの表示」で見ること。)

検証結果については、「今ワタシと同じように困ってる」人ならいらないだろうし、「困るかもしれないので知りたい」人は、ご自身でやってみるのが良かろう。どれも残念な結果になります。とりわけ残念なのは

1 &nbsp;
2 &apos;

なんですが、apos については先の StackOverflow からの引用でわかるでしょう。nbsp についてはPHPのマニュアルにこう書いてあります:

trim(html_entity_decode(‘ ‘)); の結果が空の文字列に ならないことを疑問に思う人もいるでしょう。なぜそうなるのかというと、 デフォルトのエンコーディング ISO-8859-1 では ‘ ‘ エンティティが ASCII コード 32 (これは trim() で取り除かれる) ではなく ASCII コード 160 (0xa0) に変換されるからです。

で、ずっと先の StackOverflow ページに答えがあったのに、なかなか気付かなかったけれど、答えは概ねこんな感じ:

 1 function entities_to_unicode($str, $flags) {
 2     $str = html_entity_decode(
 3         $str, $flags,
 4         'UTF-8');
 5     $str = preg_replace_callback(
 6         "/(&#[0-9]+;)/", function($m) {
 7             return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
 8         }, $str);
 9     return $str;
10 }
11 
12 $flags = ENT_QUOTES | ENT_COMPAT | ENT_HTML401;
13 $result = entities_to_unicode(
14     preg_replace(
15         array('/&nbsp;/', '/&apos;/'),
16         array(' ', "'"),
17         $some_input), $flags);

UTF-8特定してるのは、先の本題ではない引用からわかる通り、HTML401の名前付き実体参照の多くは「非ASCII」前提であり、つまり Unicode でなければならないんですね、「変換先」は。たとえば LATIN-1 で表現出来ない参照は、LATIN-1 には変換は出来ない。当たり前。日本人でもわかりやすいのは

 1 &larr; (←)
 2 &uarr; (↑)
 3 &rarr; (→)
 4 &darr; (↓)
 5 &harr; (↔)
 6 &crarr; (↵)
 7 &lArr; (⇐)
 8 &uArr; (⇑)
 9 &rArr; (⇒)
10 &dArr; (⇓)
11 &hArr; (⇔)

とかかね。

mb_convert_encoding まで駆り出さないと出来ないのか、と思うのはきっと、html_entity_decode が名前負けしてるからだ。誰だってここまでやってくれるものが「html_entity_decodeにある」と思いたいと思うぞ。