GBString 一个可以识别GB18030编码的字符串类

bob.yang

UID: 21348
帖子: 72
积分: 165
在线时间: 7 小时

1^# bob.yang 发表于 2007-07-08 12:47

GBString 一个可以识别GB18030编码的字符串类

GBString是一个可以识别GB18030编码的字符串类，它改写了String类的一些方法，可以很方便地处理内码是GB18030/GBK/GB2312的字符串。

项目的homepage
http://rubyforge.org/projects/gbstring/

License
====================
GBString, a ruby class simliar to String class but with GB18030 encoding aware style.

Copyright (C) <2007> <Bob Yang>

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

Email: bob.yang.dev at gmail.com
====================

一、快速开始
====================
1、构造一个GBString对象
require "g_b_string"
gbstr = GBString.new("这是一个GB18030编码的字符串！") # 注意，文件的编码方式要是GB18030

或者使用下面的简单方式

gbstr = _c("这也是一个GB18303编码的字符串！")

2、按中文字符为基本单位计算字符串长度
gbstr = _c("中文串")
puts( gbstr.size ) # => 3 而不是 6

3、遍历中文字符串
_c("中文串abc").each do |char|
puts char
end
# => 中
# => 文
# => 串
# => a
# => b
# => c

二、更多用法
====================
可以参考 test/tc_g_b_string.rb

1、each, each_with_index
每个元素是一个中文字符，类型为String.

2、split

cstr = _c"第一第二第三"
tokens = cstr.split(" ")
tokens[0] # => 第一
tokens[1] # => 第二
tokens[3] # => 第三

3、下标操作符'[ ]', 注意得到的对象是GBString类型的

cstr = _c"甲A"
puts( cstr[0] ) # => 甲
puts( cstr[1] ) # => A
puts( cstr[0].class ) # => GBString
puts( cstr[1].class ) # => GBString

4、类型转换，to_a, php?name=range" onclick="tagshow(event)" class="t_tag">range
# 转换到数组, to_a
gbstr = _c("类型map")
array = gbstr.to_a
puts array.size # => 5
puts array[0] # => 类
puts array[3] # => a

# 作为range使用, 得到的对象是GBString类的
gbstr = _c"This is a \"中文字符串\"！"
puts( gbstr[10..16] ) # => "中文字符串"
puts( gbstr[-8..-1] ) # => "中文字符串"！
puts( gbstr[10, 7] ) # => "中文字符串"

5、作为Hash的key使用, 与相同内容的String对象相等.
hash = {_c("中国")=>1, _c("贵州省")=>2, _c("贵阳市")=>3}
puts( hash[_c("贵阳市")] ) # => 3
puts( hash["贵阳市"] ) # => 3

三、运行单元测试
====================
进入 test 目录，运行 ruby tc_g_b_string.rb 即可。
如果一切正常，会提示：
7 tests, 56 assertions, 0 failures, 0 errors

admin

UID: 6902
帖子: 131
积分: 301
在线时间: 23 小时

2^# admin 发表于 2007-07-08 14:52

我们首先想到的是使用String.length获取字符串的长度，但是这个函数在纯ascii字符串的时候能够准确获取字符个数。在中文或者中英文混合的时候就不行了。看下面这个例子。

拷贝一段代码，保存为文件kcode.rb。
text="y呀"
chars = text.split(//)
puts "the length of the Array: ",chars.length
puts "the length of the String: ",text.length #will be 3

字符串“text”是一个中英文混合的字符串，text.length将会返回3。因为将中文“呀”当作两个字符来计算了。

使用命令 ruby kcode.rb将会输出：
the length of the Array:
3
the length of the String:
3

在这里我们可以看到，ruby把中文当作两个字符了。

但是如果想要ruby将中文也当作一个字符的话，如何处理？试着使用 ruby -KU kcode.rb运行这个文件，将会输出：
the length of the Array:
2
the length of the String:
3

输出的2应该就是我们想要的结果。这里惟一的不同是“-KU”这个参数。

-K是ruby.exe的一个参数，作用是：
' Specifies the code set to be used. This option is useful mainly when Ruby is used for Japanese-language processing. kcode may be one of: e, E for EUC; s, S for SJIS; u, U for UTF-8; or a, A, n, N for ASCII. （笔者：ruby是日本人发明的，所以人家的说明中说这主要是为日语处理而设置的参数。）

在这里-KU是设置为UTF-8。而设置了这个参数后，使用text.split(//)就会按照指定的编码将text的字符转化为Array数组。中文 “呀”就被当作一个单一字符。所以数组chars.length是2。但是text.length却仍然是3。这个对比就是告诉我们，在使用了-KU后，当需要把获得中英文混合的字符串的字符个数的时候，可以使用split(//)将字符串分割为单个字符组成的数组，再获取数组的长度就是了。而不是使用 String.length获取字符个数。
获取字符个数的典型应用是截取字符串。例如当要从一个中文/中英文混合字符串中截取一定长度的字符的时候，如果不使用-KU参数，很容易出现乱码的情况。笔者在作ruby on rails应用的时候发现这个问题的。如果在rails的应用中要使用-KU参数的话，只需要象这样启动WEBrick服务器就可以了：ruby -KU script\server。

作者：Thomas Yung 200506 联系作者：earoc@126.com
转载请保留作者声明。

admin

UID: 6902
帖子: 131
积分: 301
在线时间: 23 小时

3^# admin 发表于 2007-07-09 11:19

怎么才能让Ruby内置的支持中文或者UTF8？

也就是说

[Copy to clipboard] [ - ]

axgle

UID: 19498
帖子: 1
积分: 2
在线时间: 10 分钟

4^# axgle 发表于 2007-07-09 11:55

@skyover:你采用的split(//)办法会有这样的结果的

[Copy to clipboard] [ - ]

admin

UID: 6902
帖子: 131
积分: 301
在线时间: 23 小时

5^# admin 发表于 2007-07-09 11:58

难怪，这就是为什么"中华人民共和国".split(//).length 等于十的原因了。

这可能是在日文中有“中”和“文”这两个字，可是却没有“社”这个字的原因。

axgle

UID: 19498
帖子: 1
积分: 2
在线时间: 10 分钟

6^# axgle 发表于 2007-07-09 12:00

g_b_string的结果:

[Copy to clipboard] [ - ]

admin

UID: 6902
帖子: 131
积分: 301
在线时间: 23 小时

7^# admin 发表于 2007-07-09 12:01

引用:

原帖由 axgle 于 2007-7-9 12:00 发表
g_b_string的结果:

require 'g_b_string'
str = _c '中文测试'

str.each do |c|
puts c
end

axgle

UID: 19498
帖子: 1
积分: 2
在线时间: 10 分钟

8^# axgle 发表于 2007-07-09 12:29

我在rails里一般使用utf-8编码.
因此validates_length_of是个大致长度验证,所以不是很严格.

axgle

UID: 19498
帖子: 1
积分: 2
在线时间: 10 分钟

9^# axgle 发表于 2007-07-09 12:32

类似下面这样取得中文子串的需求,可以用这里的g_b_string.

[Copy to clipboard] [ - ]

drive2me

UID: 29989
帖子: 30
积分: 69
在线时间: 1 小时

10^# drive2me 发表于 2007-07-19 20:57

日文里有这个社字，而且用的比中国人多，日文中的公司一词就是株式会社，看到它的使用频率了吧。名片上用的最多了。
你们的问题不会是这个原因的。