2014/02/07(金)PerlでのMalformed UTF-8文字を含む文字列の処理とutf-8-strictについて
深く理解できていないように感じたのでコードにまとめた。
準備:
perl -e 'print "ax{FFFF_FFFF}b"' > malformed_utf8.txt
perl -e 'print "x{FFFE}"' > FFFE.txt
コード:
#!/usr/bin/env perl
use strict;
use warnings;
use Encode qw/decode_utf8/;
my $IN_FILE = 'malformed_utf8.txt';
my $IN_FILE2 = 'FFFE.txt';
{
open(my $fh, '<', $IN_FILE) or die $!;
chomp(my $text = <$fh>);
close($fh);
# malformed が来たら代替文字で置換
print "Encode::DEFAULT\n";
printf( "U+%04Xn", ord decode_utf8($_, Encode::FB_DEFAULT) ) for split(//, $text);
# malformed が来たら即死
print "Encode::FB_CROAK\n";
eval { printf( "U+%04X\n", ord decode_utf8($_, Encode::FB_CROAK) ) for split(//, $text) };
print "Encode::FB_CROAK: $@" if $@;
# malformed が来たらこれまで処理したデータの一部を返す
print "Encode::FB_QUIET\n";
printf( "U+%04X\n", ord decode_utf8($_, Encode::FB_QUIET) ) for split(//, $text);
# FB_QUIET + 警告(デバッグ時に便利)
print "Encode::FB_WARN\n";
printf( "U+%04X\n", ord decode_utf8($_, Encode::FB_WARN) ) for split(//, $text);
print "\n";
}
## utf8 と utf-8-strict の区別
# utf と 8 の間にハイフン(またはアンダーライン '_')があるかどうか(大文字小文字は関係なし)
## utf-8-strict ってなに?
# 以下の制約がある
# U+FDD0 .. U+FDEF の non-character code points を許さない
# Unicodeの各面の最後の2文字のnon-character code points を許さない(U+XXFFFE, U+XXFFFF. XX = 0 - 10)
# non-shortest エンコーディングを許さない
# ↑を許すと例えば非最短形式のスラッシュとかがバリデーションをすり抜けて脆弱性になりうる
# てなわけで外からの信頼できない入力には常に utf-8-strict を使うべき
{
open(my $fh, '<:encoding(utf-8)', $IN_FILE) or die $!;
chomp(my $text = <$fh>);
close($fh);
printf("U+%04X\n", ord) for split(//, $text);
print "$text\n";
print "\n";
}
{
open(my $fh, '<:utf8', $IN_FILE) or die $!;
chomp(my $text = <$fh>);
close($fh);
printf("U+%04X\n", ord) for split(//, $text);
print "\n";
}
# ここから U+FFFE
{
open(my $fh, '<:encoding(utf-8)', $IN_FILE2) or die $!;
chomp(my $text = <$fh>);
close($fh);
printf("U+%04X\n", ord) for split(//, $text);
print "$text\n";
print "\n";
}
{
open(my $fh, '<:utf8', $IN_FILE2) or die $!;
chomp(my $text = <$fh>);
close($fh);
printf("U+%04X\n", ord) for split(//, $text);
print "\n";
}
出力:
Encode::DEFAULT
U+0061
U+FFFD
U+FFFD
U+FFFD
U+FFFD
U+FFFD
U+FFFD
U+FFFD
U+0062
Encode::FB_CROAK
U+0061
Encode::FB_CROAK: utf8 "xFE" does not map to Unicode at /home/***/.plenv/versions/5.18.2/lib/perl5/site_perl/5.18.2/x86_64-linux/Encode.pm line 215.
Encode::FB_QUIET
U+0061
U+0000
U+0000
U+0000
U+0000
U+0000
U+0000
U+0000
U+0062
Encode::FB_WARN
U+0061
utf8 "xFE" does not map to Unicode at /home/***/.plenv/versions/5.18.2/lib/perl5/site_perl/5.18.2/x86_64-linux/Encode.pm line 215.
U+0000
utf8 "x83" does not map to Unicode at /home/***/.plenv/versions/5.18.2/lib/perl5/site_perl/5.18.2/x86_64-linux/Encode.pm line 215.
U+0000
utf8 "xBF" does not map to Unicode at /home/***/.plenv/versions/5.18.2/lib/perl5/site_perl/5.18.2/x86_64-linux/Encode.pm line 215.
U+0000
utf8 "xBF" does not map to Unicode at /home/***/.plenv/versions/5.18.2/lib/perl5/site_perl/5.18.2/x86_64-linux/Encode.pm line 215.
U+0000
utf8 "xBF" does not map to Unicode at /home/***/.plenv/versions/5.18.2/lib/perl5/site_perl/5.18.2/x86_64-linux/Encode.pm line 215.
U+0000
utf8 "xBF" does not map to Unicode at /home/***/.plenv/versions/5.18.2/lib/perl5/site_perl/5.18.2/x86_64-linux/Encode.pm line 215.
U+0000
utf8 "xBF" does not map to Unicode at /home/***/.plenv/versions/5.18.2/lib/perl5/site_perl/5.18.2/x86_64-linux/Encode.pm line 215.
U+0000
U+0062
utf8 "xFFFFFFFF" does not map to Unicode at read_malformed_utf8.pl line 48.
U+0061
U+005C
U+0078
U+007B
U+0046
U+0046
U+0046
U+0046
U+0046
U+0046
U+0046
U+0046
U+007D
U+0062
ax{FFFFFFFF}b
U+0061
U+FFFFFFFF
U+0062
utf8 "xFFFE" does not map to Unicode at read_malformed_utf8.pl line 69.
U+005C
U+0078
U+007B
U+0046
U+0046
U+0046
U+0045
U+007D
x{FFFE}
U+FFFE