botvector.net domain for sale $2k usd (negotiable). Em@il me t0: alex DOT creopolis at gmail DOT com

My Git activity

19 November 2007

Encoding problems

First dump initial, non-utf database:


mysqldump -uUSER -pPASSWORD --default-character-set=latin1 --skip-set-charset dbname > dumpfile.sql

then (linux) change all latin1 in dumpfile.sql to utf8

sed 's/latin1/utf8/g' dumpfile.sql > dumpfile_collate_utf8.sql

Optionally gzip and send to localpc
gzip -d dumpfile_collate_utf8.sql.gz


Recreate same database with utf8 collation and import dumpfile_collate_utf8:

mysql -uUSER -pPASSWORD --execute="DROP DATABASE dbname;
CREATE DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci;"
mysql -uUSER dbname < dumpfile_collate_utf8.sql


Now the next part, get on Rails


I've set RAILS_GEM_VERSION = '1.1.6' to be sure we are in non unicode.
1. Create controller with:

require 'iconv'
require 'singleton'
class ConverterController < ApplicationController
end

Inside create two actions:

def convertme
rows = Model.find(:all)
for row in rows
row.title = iconvert(row.title, "latin1", "utf8")
# replace latin1 with appropriate encoding
row.save!
end
end

def iconvert(str, encoding_from, encoding_to = "utf8")
i = Iconv.new encoding_to, encoding_from
utf_str = ""
begin
utf_str << i.iconv(str)
rescue Exception => e
utf_str << e.success
ch, str = e.failed.split(//, 2)
utf_str << "?"
logger.info "FAIL !! char:#{ch}"
retry
end
return utf_str
end


Point browser/console to /converter !

Thats IT !
What doing iconvert: its taking string, trying to convert it to utf-8, if conversion fails (on some japanese/korean) it put "?" instead. Thats because some non-utf encoding cannot be converted to utf-8(needs to be utf-16 or utf-32).

Also in case of large database conversion, i suggest to add column to database table, like "utf_converted" and do conditionally find(:all) and then row.utf_converted = true before save, just to be safe if something fail, to do not convert twice.

5 comments:

TPol said...

I attempted your approach and found that data in fields defined as Text were truncated. Do you have any suggestions?

Alexey said...

What exactly you mean ?
Exception ?

TPol said...

Never mind. I was expanding what you wrote with a SET NAMES command and that seems to be what created the truncation issue.

Have you come across the ???? question marks caused by inserting data using a UTF8 client connection to a latin1 table. I haven't found a way to fix data stored in this fashion, do you know if it can be fixed?

Alexey said...

I had "???" when i converted (from latin1 to UTF8) of japanese and other languages that should be encoded to UTF16/32 becausethey didnt have "symbols" in UTF8 table.

In case of UTF16/32 you need to do everything from beginning and replace UTF8 to UTF16 (or UTF32) respectively.

Unknown said...

i had a similar problem with storing utf strings that were encrypted into a mysql database from ror, they kept being truncated. setting the mysql field datatype to text/binary allowed me to store and retrieve these complex strings automatically.