RFC: changing RPM's default scriptlet locale

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

RFC: changing RPM's default scriptlet locale

jan matejek-4
Fellow mortals,

more and more packages need their locale to be set to something more sensible than C.
This hit me while switching packages over to Python 3. Python gets its encoding from locale, so by
default, it won't decode UTF-8 unless the appropriate encoding is set. Right now, when gtk-doc is
switched to use Python 3, it won't build UTF-8 documentation.

This could be changed in gtk-doc itself (although that is impractical), but perhaps a better way is
to change the default locale for spec scriptlets. Right now the macros set it to C. We could switch
that to "C.UTF-8", "en_US.UTF-8", or export a special RPM variable that could be overriden in your
spec file. Still, the default should be something with UTF-8 in it.

I'm now trying to build a Ring 1 staging project with this change. So far I have seen one failure
related to it: with en_US.UTF-8, bash ranges (like [a-z]) are case-insensitive and match more than
intended. That could be solved by changing the expression, or by setting locale to C.UTF-8.

Thoughts, comments?

regards
m.


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: RFC: changing RPM's default scriptlet locale

Dominique Leuenberger / DimStar
On Fri, 2017-10-27 at 16:13 +0200, jan matejek wrote:
> This could be changed in gtk-doc itself (although that is
> impractical), but perhaps a better way is
> to change the default locale for spec scriptlets. Right now the
> macros set it to C. We could switch
> that to "C.UTF-8", "en_US.UTF-8", or export a special RPM variable
> that could be overriden in your
> spec file. Still, the default should be something with UTF-8 in it.

I support this overall approach of getting it fixed 'high up'.

One thing that strikes generally as 'odd' (looking at gtk-doc) is that
the python2 variant seemed not to have trouble parsing the same files -
but switching to py3 causes trouble.

Is python3 itself expecting a .UTF8 locale to be reliably usable?

Cheers
Dominique

signature.asc (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: RFC: changing RPM's default scriptlet locale

Neal Gompa
On Fri, Oct 27, 2017 at 11:03 AM, Dominique Leuenberger / DimStar
<[hidden email]> wrote:

> On Fri, 2017-10-27 at 16:13 +0200, jan matejek wrote:
>> This could be changed in gtk-doc itself (although that is
>> impractical), but perhaps a better way is
>> to change the default locale for spec scriptlets. Right now the
>> macros set it to C. We could switch
>> that to "C.UTF-8", "en_US.UTF-8", or export a special RPM variable
>> that could be overriden in your
>> spec file. Still, the default should be something with UTF-8 in it.
>
> I support this overall approach of getting it fixed 'high up'.
>
> One thing that strikes generally as 'odd' (looking at gtk-doc) is that
> the python2 variant seemed not to have trouble parsing the same files -
> but switching to py3 causes trouble.
>
> Is python3 itself expecting a .UTF8 locale to be reliably usable?
>

Yes, it is.

Actually, the main reason for RPM itself not automatically exporting
C.UTF-8 by default is that the actual support for this hasn't been
merged into glibc. It's a patch that's carried only by Fedora,
openSUSE, and Debian. Upstream, we've been having this discussion in
one of the pull requests for RPM:
https://github.com/rpm-software-management/rpm/pull/227

Personally speaking, I really wish someone would champion this to get
into glibc properly so that we can reliably depend on it.

--
真実はいつも一つ!/ Always, there's only one truth!
--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: changing RPM's default scriptlet locale

Dominique Leuenberger / DimStar
In reply to this post by jan matejek-4
On Fri, 2017-10-27 at 16:13 +0200, jan matejek wrote:
>
> > That could be solved by changing the expression, or by setting
> locale to C.UTF-8.

For the record, just to have it mentioned: %meson exports LANG=C.UTF-8

As meson is also a python3 app, it 'suffered' from the same issue that
it needs a UTF-8 locale configured, but it was 'simple enough' to make
this in the globally used macros. If we get RPM's own macros to use
C.UTF-8, we would of course strip this from the meson-macros again.

Cheers
Dominique

signature.asc (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: RFC: changing RPM's default scriptlet locale

Jan Engelhardt-4
In reply to this post by jan matejek-4

On Friday 2017-10-27 16:13, jan matejek wrote:
>
>more and more packages need their locale to be set to something more
>sensible than C. This hit me while switching packages over to Python
>3. Python gets its encoding from locale, so by default, it won't
>decode UTF-8 unless the appropriate encoding is set.

I am not convinced that changing the rpm default helps. FWIW, a large
portion of source files could be ISO-8859-1 (Windows still is a
thing, and a popular one at that) so that LC=UTF-8 on a global scale
would not help.

The only real fix is to use an in-file marker such that the file
becomes self-describing, and there are sufficient examples in history
how to pull that off:

 - Byte Order Mark to determine UTF-8, UTF-{16,32}{BE,LE}
 - <?xml encoding="..." ?> in XML
 - <meta> in HTML likewise
 - "use utf8" in Perl (or something similar to it)
 - "# -*- coding: utf-8 -*-" in Python (PEP-0263)

So there you have it. If Python falls over on UTF-8 files (I know
Perl would), then those source files should say they are UTF-8. And
those that are ISO-8859-1 should say they are iso-8859-1.

Keeping the locale at C would at least identify the important
spots as python would stop execution.
--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: changing RPM's default scriptlet locale

Jan Engelhardt-4
In reply to this post by Dominique Leuenberger / DimStar

On Friday 2017-10-27 17:03, Dominique Leuenberger / DimStar wrote:

>On Fri, 2017-10-27 at 16:13 +0200, jan matejek wrote:
>> This could be changed in gtk-doc itself (although that is
>> impractical), but perhaps a better way is
>> to change the default locale for spec scriptlets. Right now the
>> macros set it to C. We could switch
>> that to "C.UTF-8", "en_US.UTF-8", or export a special RPM variable
>> that could be overriden in your
>> spec file. Still, the default should be something with UTF-8 in it.
>
>One thing that strikes generally as 'odd' (looking at gtk-doc) is that
>the python2 variant seemed not to have trouble parsing the same files -

I see no deviation in python2 behavior.
Without PEP-0263 markers to give a 1st class decision,
the parser only has 2nd-class choices to make —
and Python decides not to take 2nd choices, unlike Perl.

# leap 42.2
17:42 zap:/dev/shm > python2 x.py
  File "x.py", line 2
SyntaxError: Non-ASCII character '\xc3' in file x.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
17:42 zap:/dev/shm > cat x.py
#!/usr/bin/python
print "föhn\n"


--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: changing RPM's default scriptlet locale

jan matejek-4
In reply to this post by Jan Engelhardt-4
On 27.10.2017 17:41, Jan Engelhardt wrote:
> I am not convinced that changing the rpm default helps. FWIW, a large
> portion of source files could be ISO-8859-1 (Windows still is a
> thing, and a popular one at that) so that LC=UTF-8 on a global scale
> would not help.

Source files are not the issue here though. (and FWIW Python 3 compatible source files tend to be
modern enough to either a) be UTF8 or b) have the PEP263 encoding header)

The issue is that Python 3 needs to know the encoding of *all* external inputs (e.g., documentation
files to be parsed by a doc generator) because its stores strings as Unicode internally. So you'd
need a BOM mark on *every* file potentially touched by Python, ever, and also somehow mark the
encoding of stdin, stdout and such. (so maybe an environment variable? ;) )

> So there you have it. If Python falls over on UTF-8 files (I know
> Perl would), then those source files should say they are UTF-8. And
> those that are ISO-8859-1 should say they are iso-8859-1.

Alternately we could say that UTF-8 is the distro default, and only non-default files must be marked.
Which is what I'm proposing.


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: RFC: changing RPM's default scriptlet locale

Dominique Leuenberger / DimStar
In reply to this post by Jan Engelhardt-4
On Fri, 2017-10-27 at 17:43 +0200, Jan Engelhardt wrote:

>
> # leap 42.2
> 17:42 zap:/dev/shm > python2 x.py
>   File "x.py", line 2
> SyntaxError: Non-ASCII character '\xc3' in file x.py on line 2, but
> no encoding declared; see http://python.org/dev/peps/pep-0263/ for
> details
> 17:42 zap:/dev/shm > cat x.py
> #!/usr/bin/python
> print "föhn\n"
Not exactly the issue we are seeing...

It's not about the python file itself, but the files a python script
might be opening and inspecting.

Cheers
Dominique

signature.asc (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: RFC: changing RPM's default scriptlet locale

Brüns, Stefan
In reply to this post by Dominique Leuenberger / DimStar
On Freitag, 27. Oktober 2017 17:03:58 CEST Dominique Leuenberger / DimStar
wrote:

> On Fri, 2017-10-27 at 16:13 +0200, jan matejek wrote:
> > This could be changed in gtk-doc itself (although that is
> > impractical), but perhaps a better way is
> > to change the default locale for spec scriptlets. Right now the
> > macros set it to C. We could switch
> > that to "C.UTF-8", "en_US.UTF-8", or export a special RPM variable
> > that could be overriden in your
> > spec file. Still, the default should be something with UTF-8 in it.
>
> I support this overall approach of getting it fixed 'high up'.
>
> One thing that strikes generally as 'odd' (looking at gtk-doc) is that
> the python2 variant seemed not to have trouble parsing the same files -
> but switching to py3 causes trouble.
>
> Is python3 itself expecting a .UTF8 locale to be reliably usable?

Reading external string data without an explicit encoding is just plain wrong,
both in python2 and in python3, and in any other language.

If the program knows some input is UTF-8 (by specification), it should set the
encoding to 'utf-8', otherwise it should determine the encoding (e.g. XML
allows the specification of the encoding).

When communicating with external programs (e.g. using pipes), the calling
program should tell the called program which encoding to use (noop if the
other program uses a fixed encoding), either via commandline switch or by
explicit setting of the locale.

Kind regards,

Stefan
--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: changing RPM's default scriptlet locale

Brüns, Stefan
In reply to this post by jan matejek-4
On Freitag, 27. Oktober 2017 16:13:25 CEST jan matejek wrote:

> Fellow mortals,
>
> more and more packages need their locale to be set to something more
> sensible than C. This hit me while switching packages over to Python 3.
> Python gets its encoding from locale, so by default, it won't decode UTF-8
> unless the appropriate encoding is set. Right now, when gtk-doc is switched
> to use Python 3, it won't build UTF-8 documentation.
>
> This could be changed in gtk-doc itself (although that is impractical), but
> perhaps a better way is to change the default locale for spec scriptlets.

I think the gtk-doc problem is solved properly:
https://github.com/GNOME/gtk-doc/commit/1eeec38a9a06a9956cdab9789cbd2ea1

Kind regards,

Stefan--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: changing RPM's default scriptlet locale

Brüns, Stefan
In reply to this post by Jan Engelhardt-4
On Freitag, 27. Oktober 2017 17:41:23 CEST Jan Engelhardt wrote:

> On Friday 2017-10-27 16:13, jan matejek wrote:
> >more and more packages need their locale to be set to something more
> >sensible than C. This hit me while switching packages over to Python
> >3. Python gets its encoding from locale, so by default, it won't
> >decode UTF-8 unless the appropriate encoding is set.
>
> I am not convinced that changing the rpm default helps. FWIW, a large
> portion of source files could be ISO-8859-1 (Windows still is a
> thing, and a popular one at that) so that LC=UTF-8 on a global scale
> would not help.
>
> The only real fix is to use an in-file marker such that the file
> becomes self-describing, and there are sufficient examples in history
> how to pull that off:
>
>  - Byte Order Mark to determine UTF-8, UTF-{16,32}{BE,LE}

Byte order mark is ambiguous - it is three bytes, which are valid codepoints
in e.g. ISO-8859-1. Granted, it is unlikely, but ...

>  - <?xml encoding="..." ?> in XML
>  - <meta> in HTML likewise
>  - "use utf8" in Perl (or something similar to it)
>  - "# -*- coding: utf-8 -*-" in Python (PEP-0263)

All these guarantee the content up to and including the encoding specification
is plain ASCII, so this is completely unambiguous and sane.

> So there you have it. If Python falls over on UTF-8 files (I know
> Perl would), then those source files should say they are UTF-8. And
> those that are ISO-8859-1 should say they are iso-8859-1.
>
> Keeping the locale at C would at least identify the important
> spots as python would stop execution.

Seconded, most utf-8 documents are also valid when interpreted as iso-8859-x,
so guessing is a bad idea.

Kind regards,

Stefan

--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: changing RPM's default scriptlet locale

Jan Engelhardt-4

On Friday 2017-10-27 18:23, Brüns, Stefan wrote:
>>
>> The only real fix is to use an in-file marker such that the file
>> becomes self-describing, and there are sufficient examples in history
>> how to pull that off:
>>
>>  - Byte Order Mark to determine UTF-8, UTF-{16,32}{BE,LE}
>
>Byte order mark is ambiguous - it is three bytes, which are valid codepoints
>in e.g. ISO-8859-1. Granted, it is unlikely, but ...

The byte order mark is not ambiguous for what it was meant to do, since
there is a bijective mapping between the domain of (defined) bit
patterns and the codomain of (defined) encodings. ISO-8859-* is just not
within the set. Understandably so, since ISO-8859-* does not __have__ a
byte __order__ to begin with — it is a single-byte encoding.
--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: changing RPM's default scriptlet locale

Michal Kubecek
On Tuesday, 31 October 2017 8:11 Jan Engelhardt wrote:

> On Friday 2017-10-27 18:23, Brüns, Stefan wrote:
> >
> >Byte order mark is ambiguous - it is three bytes, which are valid
> >codepoints in e.g. ISO-8859-1. Granted, it is unlikely, but ...
>
> The byte order mark is not ambiguous for what it was meant to do,
> since there is a bijective mapping between the domain of (defined)
> bit patterns and the codomain of (defined) encodings. ISO-8859-* is
> just not within the set. Understandably so, since ISO-8859-* does not
> __have__ a byte __order__ to begin with ? it is a single-byte
> encoding.

Neither does UTF-8, that's why I always considered putting a "BOM" into
a UTF-8 text a sign that someone doesn't know what they are doing and
why.

But I believe Stefan wanted to point out that UTF-8 "BOM" consists of
three bytes which are valid ISO-8859-1 characters so that a document in
ISO-8859-1 could, in theory, start with them (however unlikely that is).

Michal Kubeček
--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: changing RPM's default scriptlet locale

Jan Engelhardt-4

On Tuesday 2017-10-31 08:42, Michal Kubecek wrote:

>> ISO-8859-* does not __have__ a byte __order__ to begin with -- it
>> is a single-byte encoding.
>
>Neither does UTF-8

As a multibyte encoding, it *does* have an order, even if just a
single defined one. Swapping two octets does not necessarily produce
the same character value (U+xxxx). In ISO-8859, you can do this
switch and you will get the same character values.

>But I believe Stefan wanted to point out that UTF-8 "BOM" consists of
>three bytes which are valid ISO-8859-1 characters so that a document in
>ISO-8859-1 could, in theory, start with them (however unlikely that is).

The BOM certainly is no gold standard (it only knows 5 character
"sets" anyway), but it at least indicates to an EBCDIC system that
some UTF is coming, as it would otherwise have no chance to see the
<meta> tag due to the different codepage assumption. Though that's
mostly speculation.
--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: changing RPM's default scriptlet locale

Michal Kubecek
On Tuesday, 31 October 2017 9:48 Jan Engelhardt wrote:

> On Tuesday 2017-10-31 08:42, Michal Kubecek wrote:
> >> ISO-8859-* does not __have__ a byte __order__ to begin with -- it
> >> is a single-byte encoding.
> >
> >Neither does UTF-8
>
> As a multibyte encoding, it *does* have an order, even if just a
> single defined one. Swapping two octets does not necessarily produce
> the same character value (U+xxxx). In ISO-8859, you can do this
> switch and you will get the same character values.

If you wish to call it byte order, you certainly can. But unlike for
e.g. UTF-16, there is no actual need for "BOM" in UTF-8 text. It's just
a way some editors do to say "this is UTF-8" - but it doesn't actually
work in general and confuses various parsers.

The very idea of putting the information about encoding into the text
itself is IMHO completely wrong as it can only work for a very limited
set of encodings and only for limited number of applications processing
the text documents. Such information must be specified outside the
document itself, explicitly or implicitly.

Michal Kubecek

--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: changing RPM's default scriptlet locale

Michael Matz
In reply to this post by jan matejek-4
Hi,

On Fri, 27 Oct 2017, jan matejek wrote:

> more and more packages need their locale to be set to something more
> sensible than C. This hit me while switching packages over to Python 3.
> Python gets its encoding from locale, so by default, it won't decode
> UTF-8 unless the appropriate encoding is set. Right now, when gtk-doc is
> switched to use Python 3, it won't build UTF-8 documentation.
>
> This could be changed in gtk-doc itself (although that is impractical),
> but perhaps a better way is to change the default locale for spec
> scriptlets. Right now the macros set it to C. We could switch that to
> "C.UTF-8", "en_US.UTF-8", or export a special RPM variable that could be
> overriden in your spec file. Still, the default should be something with
> UTF-8 in it.
>
> I'm now trying to build a Ring 1 staging project with this change. So
> far I have seen one failure related to it: with en_US.UTF-8, bash ranges
> (like [a-z]) are case-insensitive and match more than intended. That
> could be solved by changing the expression, or by setting locale to
> C.UTF-8.
>
> Thoughts, comments?

As others have said already: this merely indicates a deeper problem in
gtk-doc, for which a global change for all scriptlets is a crude
work-around at best.  As you found it has real ramifications.

To see why it is a deeper problem in gtk-doc it's enough to think about
the situation that some input files are in UTF-8 and some are in, let's
say, SHIFT_JISX0213.  No setting of $LANG will make it work, so setting
$LANG is not the solution.  Whatever the solution is, it must be in
gtk-doc itself; it must somehow determine (and if by external knowledge)
which files are in which encoding.

Setting LANG to en_US.UTF-8 is a horrible idea for scripts (as you found
out, collating order and ctypes get in the way as it's not ASCII
compatible).  Setting it to C.UTF-8 is not supported by upstream glibc
(and yes it'd be nice if it would be).

You need to find some other solution unfortunately.


Ciao,
Michael.
--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RFC: changing RPM's default scriptlet locale

Michael Matz
In reply to this post by Michal Kubecek
Hi,

On Tue, 31 Oct 2017, Michal Kubecek wrote:

> If you wish to call it byte order, you certainly can. But unlike for
> e.g. UTF-16, there is no actual need for "BOM" in UTF-8 text. It's just
> a way some editors do to say "this is UTF-8" - but it doesn't actually
> work in general and confuses various parsers.

Indeed.  And unicode actually discourages using a BOM with UTF-8 encoded
text streams (and forbids it for UTF-16LE and UTF-16BE tagged ones).

Any solution involving a BOM is a non-solution (if perhaps mildly
intellectually entertaining) as it's useless for UTF-8 text streams which
predominantly exist on Linux and doesn't help with all the other encodings
that are still in use, except the two encodings for which it was designed
but which aren't usually used on our systems: UTF-16 and UTF-32.


Ciao,
Michael.
--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]