Character Set Encoding/Decoding
Hey guys... long time no see :) how is everyone? I see a lot of the same faces around... and a few new ones
Well here's my question. I wrote a spider in perl that grabs 230K pages of content and puts it into a database for later use on our own site.... and like a ... you know :fpissed: ... I forgot to make sure the database was set to the same charset. The original content was in UTF-8 but the database was in latin-1.
After realizing my mistake i told the database it should be in utf-8 but that didnt fix it (it's mysql btw), actually some perl programmers think that might have even compounded the problem.
Basically how do i get back to my original data? I have php, perl, C#, C++, python and a few others under my belt so tools arent much of a problem but the ones that I've tried are getting me nowhere... and i think i have a faulty view of encoding and decoding.
I looked at the hex and know it definitely converted the utf-8 to latin-1 on the way in and i think i need to only go one step backwards. any thoughts?
ex. i had the char Ω on the site and it comes out Ω in the db.
hex: ce a9 -> e2 84 7c