Skip to content

xgettext: rerun with UTF-8 encoding and/or properly process failures #14

@armijnhemel

Description

@armijnhemel

Something to consider is to rerun xgettext with different parameters in case it fails. The xgettext manual says:

By default the input files are assumed to be in ASCII.

Sometimes this will lead to incorrect results (or no results at all) and xgettext might be needed to rerun with a different option. One example where fidks fails is util-linux/fdisk.c from a recent BusyBox:

$ xgettext --omit-header --extract-all --no-wrap fdisk.c
xgettext: Non-ASCII string at fdisk.c:333.
          Please specify the source encoding through --from-code.

The culprit here is actually this sequence:

    "\x80" "Old Minix",        /* Minix 1.4a and earlier */

where xgettext thinks this might be some UTF-8 character (but, of course, it is not a valid sequence). No output file is generated in this case.

https://git.busybox.net/busybox/tree/util-linux/fdisk.c?h=1_35_stable

Another example is the attached file (lineedit.c from BusyBox, zipped) where I have replaced a string on line 893.

$ xgettext --omit-header --extract-all --no-wrap lineedit.c
xgettext: Non-ASCII string at lineedit.c:893.
          Please specify the source encoding through --from-code.

and no output file will be created.

When using the --from-code parameter the string will not be correctly extracted, but an output file will be created:

$ xgettext --omit-header --extract-all --no-wrap --from-code=UTF-8 lineedit.c
lineedit.c:442: warning: internationalized messages should not contain the '\r' escape sequence
lineedit.c:893: warning: The following msgid contains non-ASCII characters.
                         This will cause problems to translators who use a character encoding
                         different from yours. Consider using a pure ASCII msgid instead.
                         ë
lineedit.c:893: invalid multibyte sequence
lineedit.c:893: invalid multibyte sequence
lineedit.c:893: invalid multibyte sequence
lineedit.c:893: invalid multibyte sequence

It is not ideal, but better than getting no data at all. This could use some refinement.

Please note that this isn't true for all languages according to the xgettext manual:

       --from-code=NAME
              encoding of input files (except for Python, Tcl, Glade)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions