botvector.net domain for sale $2k usd (negotiable). Em@il me t0: alex DOT creopolis at gmail DOT com

My Git activity

19 November 2007

Encoding problems

First dump initial, non-utf database:


mysqldump -uUSER -pPASSWORD --default-character-set=latin1 --skip-set-charset dbname > dumpfile.sql

then (linux) change all latin1 in dumpfile.sql to utf8

sed 's/latin1/utf8/g' dumpfile.sql > dumpfile_collate_utf8.sql

Optionally gzip and send to localpc
gzip -d dumpfile_collate_utf8.sql.gz


Recreate same database with utf8 collation and import dumpfile_collate_utf8:

mysql -uUSER -pPASSWORD --execute="DROP DATABASE dbname;
CREATE DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci;"
mysql -uUSER dbname < dumpfile_collate_utf8.sql


Now the next part, get on Rails


I've set RAILS_GEM_VERSION = '1.1.6' to be sure we are in non unicode.
1. Create controller with:

require 'iconv'
require 'singleton'
class ConverterController < ApplicationController
end

Inside create two actions:

def convertme
rows = Model.find(:all)
for row in rows
row.title = iconvert(row.title, "latin1", "utf8")
# replace latin1 with appropriate encoding
row.save!
end
end

def iconvert(str, encoding_from, encoding_to = "utf8")
i = Iconv.new encoding_to, encoding_from
utf_str = ""
begin
utf_str << i.iconv(str)
rescue Exception => e
utf_str << e.success
ch, str = e.failed.split(//, 2)
utf_str << "?"
logger.info "FAIL !! char:#{ch}"
retry
end
return utf_str
end


Point browser/console to /converter !

Thats IT !
What doing iconvert: its taking string, trying to convert it to utf-8, if conversion fails (on some japanese/korean) it put "?" instead. Thats because some non-utf encoding cannot be converted to utf-8(needs to be utf-16 or utf-32).

Also in case of large database conversion, i suggest to add column to database table, like "utf_converted" and do conditionally find(:all) and then row.utf_converted = true before save, just to be safe if something fail, to do not convert twice.