«前の日記(2006年06月21日) 最新次の日記(2006年06月23日)» 編集

Matzにっき

2006年06月22日 [長年日記]

_ [Ruby] Unicode

[ruby-talk:197946]で公開されたRubyでUnicodeを扱うライブラリ。

ダウンロードは<URL:ftp://ftp.mars.org/pub/ruby/Unicode.tar.bz2>から。

使い方はこんな感じ。

Unicode strings can be obtained by applying the + unary operator to native strings, e.g. +"Hello" (where the native string is encoded in the default encoding).

% irb -I. -runicode -Ku
irb(main):001:0> ustr = +"π is pi"
=> +"π is pi"

Native strings are obtained from Unicode strings by calling to_s, which accepts an optional argument to indicate the desired encoding.

irb(main):002:0> str = ustr.to_s
=> "π is pi"
irb(main):003:0> str.encoding
=> Unicode::Encoding::UTF8

Individual characters can be indexed from Unicode strings, returning a Unicode::Character object.

irb(main):004:0> ustr[0]
=> U+03C0 GREEK SMALL LETTER PI

Case conversion is handled as with native strings.

irb(main):005:0> ustr.upcase
=> +"Π IS PI"

Normalization is accomplished with the ~ unary operator.

irb(main):006:0> ustr = +"m,Am"
=> +"m,Am"
irb(main):007:0> ustr.to_a
=> [U+006D LATIN SMALL LETTER M, U+00ED LATIN SMALL LETTER I WITH ACUTE]
irb(main):008:0> (~ustr).each_char { |ch| p ch }
U+006D LATIN SMALL LETTER M
U+0069 LATIN SMALL LETTER I
U+0301 COMBINING ACUTE ACCENT
=> +"m,Am"

実に面白い。

_ [Ruby] auto conversion

Ruby M17Nは、複数のエンコーディングを(できるだけ)変換なしで処理するのを主眼にしたデザインになっているのだが、Cのlocaleモデルのような、１プログラム１エンコーディングのようなケースはともかく、複数エンコーディングが混在する場合には、結局は統一的な内部文字集合(Universal Character Set - UCS)に変換して処理する必要があるかな、と考えてきた。

というか、変換まわりにはあまり気を使ってこなかったというのが実情だ。この辺が、「基本はUnicodeへの変換」という他の言語(PerlとかPythonとか)との違いだ。

とはいえ、実用のためには、どこかで変換は必要なわけで、それはきっとIOで行うに違いないと考えてきた。

しかし、自動変換(coercing)を強く勧める意見が出た。[ruby-talk:198475]

自動変換は

ふたつのエンコーディングが相互に変換可能とは限らない
変換によって知らないうちに情報が落ちる可能性がある
エラーが起きたときのデータの起源がわからなくなりがち

などの理由で敬遠してきたのだけど、今回の提案はちょっと具体的。

#
# NOTES:
# a) String#recode!(new_encoding) replaces current
#    internal byte representation with new byte sequence,
#    that is recoded current. must raise IncompatibleCharError, if
#    can't convert char to destination encoding
# b) downgrading string from some stated encoding to "none"  tag must
#    be done only explicitly.
#    it is not an option for implicit conversion
# c) $APPLICATION_UNIVERSAL_ENCODING is a global var, allowed to be
#    set once and only once per application run.
#    Intent: we want all strings which aren't raw bytes to be in one
#    single predefined encoding,
#    so all operations on string must return string in conformant encoding.
#    Desired encoding is value of $APPLICATION_UNIVERSAL_ENCODING.
#    If $APPLICATION_UNIVERSAL_ENCODING is nil, we go in "democracy
#    mode", see below.
#
def coerce_encodings(str1, str2)
   enc1 = str1.encoding
   enc2 = str2.encoding

   # simple case, same encodings, will return fast in most cases
   return if enc1 == enc2

   # another simple but rare case, totally incompatible encodings, as
   # they represent incompatible charsets
   if fully_incompatible_charsets?(enc1, enc2)
        raise(IncompatibleCharError, "incompatible charsets %s and %s", enc1, enc2)
   end

   # uncertainity, handling "none" and preset encoding
   if enc1 == "none" || enc2 == "none"
        raise(UnknownIntentEncodingError, "can't implicitly coerce encodings %s and %s, use explicit conversion", enc1, enc2)
   end

   # Tirany mode:
   # we want all strings which aren't raw bytes to be in one single
   # predefined encoding
   if $APPLICATION_UNIVERSAL_ENCODING
        str1.recode!($APPLICATION_UNIVERSAL_ENCODING)
        str2.recode!($APPLICATION_UNIVERSAL_ENCODING)
        return
   end

   # Democracy mode:
   # first try to perform non-loss conversion from one encoding to another:
   # 1) direct conversion, without loss, to another encoding, e.g. UTF8 + UTF16
   if exists_direct_non_loss_conversion?(enc1, enc2)
        if exists_direct_non_loss_conversion?(enc2, enc1)
        # performance hint if both available
           if str1.byte_length < str2.byte_length
                str1.recode!(enc2)
           else
                str2.recode!(enc1)
           end
        else
                str1.recode!(enc2)
        end
        return
   end
   if exists_direct_non_loss_conversion?(enc2, enc1)
        str2.recode!(enc1)
        return
   end

   # 2) non-loss conversion to superset
   # (I see no reason to raise exception on KOI8R + CP1251,
   # returning string in Unicode will be OK)
   if superset_encoding = find_superset_non_loss_conversion?(enc1, enc2)
        str1.recode!(superset_encoding)
        str2.recode!(superset_encoding)
        return
   end

   # A case for incomplete compatibility:
   # Check if subset of enc1 is also subset of enc2,
   # so some strings in enc1 can be safely recoded to enc2,
   # e.g. two pure ASCII strings, whatever ASCII-compatible encoding
   # they have
   if exists_partial_loss_conversion?(enc1, enc2)    	
        if exists_partial_loss_conversion?(enc2, enc1)
           # performance hint if both available
           if str1.byte_length < str2.byte_length
                str1.recode!(enc2)
           else
                str2.recode!(enc1)
           end
        else
                str1.recode!(enc2)
        end
        return
   end

   # the last thing we can try
   str2.recode!(enc1)
end

うーん、面白い(こればっかり)。

確かに通常のアプリケーションモデルは

１プログラム１エンコーディング(ただし、切り替えはあり)
１プログラム１内部エンコーディング(おそらくはUnicode)

が、ほとんどだと思うので、それを考えるとこの辺ってのはそんなに悪くないのかも。ただ、文字列の中身がいつの間にかすりかわるのはちょっと恐い。

[ツッコミを入れる]

«前の日記(2006年06月21日) 最新次の日記(2006年06月23日)» 編集