logilab-mtconverter #183450 guess_encoding sometimes returns gb2312 instead of utf8 [in-progress]
priority | normal |
---|---|
type | bug |
done in | <not specified> |
closed by | <not specified> |
patch | Use icu instead of chardet for text encoding detection [reviewed] |
priority | normal |
---|---|
type | bug |
done in | <not specified> |
closed by | <not specified> |
patch | Use icu instead of chardet for text encoding detection [reviewed] |
Comments
-
2014/04/02 23:19, written by fcayre
-
2014/04/03 07:06, written by jcristau
-
2014/04/03 07:45, written by fcayre
add commentI made some tests with python-magic, a ctypes binding of the standard libmagic (used at least by the file linux command).
I would suggest using it instead of icu, because it is also good at detecting mime types (much better than today's python mimetypes module that only uses the file's name). We sould thus improve both encoding and mime type detection with a single dependency.
What do you think?
I have a fairly strong distrust of ctypes-based bindings, so -1 here.
I knew, and I agree. I will try to work on a cython binding for libmagic soon.