Need new source for unix utils -- gnu has broken another.

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Need new source for unix utils -- gnu has broken another.

L A Walsh
I've used grep to search for strings across all my mailboxes for
decades.  Found out today, it randomly doesn't work based on whether
or not the file contains any text that doesn't comply with
POSIX.

So if one user has UTF-8 encoding and another, ISO-8859-1, and they
are in the same mailbox, according to POSIX, that's a binary file.

You have to tell grep to search (and potentially display) binary
data -- which can easily through a terminal into weird modes, making
it unreadable (see attached example for results of a random
binary being listed).

Note the last line is the prompt same text as you can see at top of window.

mbox files don't do this when you search for strings because when
I search for strings I'm looking for something in the text of an
email.

While I want grep to skip things like compressed files and coredumps,
I don't want it judging the quality of "text" that I'm searching
through -- but that's what many of the utilities have been modified
to do -- if it doesn't fit the POSIX definition of text, then
some text utils won't process it.  Technically, if the last
line of the file doesn't end with a newline, it's also binary
(though grep still displays it).

Many text utils used to be generally useful -- but now they are
having functionality removed to have them only work with POSIX.

I suppose no one else really does a quick search through all their
email this way any more.  Though is this what you'd expect?

Sigh.



binary-output-on-tty.jpg (37K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Need new source for unix utils -- gnu has broken another.

Felix Miata-3
Linda Walsh composed on 2018-02-02 12:29 (UTC-0800):

> I suppose no one else really does a quick search through all their
> email this way any more.

I do such searching with mc or filecommander, which is largely why I still use POP.
--
"Wisdom is supreme; therefore get wisdom. Whatever else you
get, get wisdom." Proverbs 4:7 (New Living Translation)

 Team OS/2 ** Reg. Linux User #211409 ** a11y rocks!

Felix Miata  ***  http://fm.no-ip.com/

--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Need new source for unix utils -- gnu has broken another.

Carlos E. R.-2
On 2018-02-02 21:42, Felix Miata wrote:
> Linda Walsh composed on 2018-02-02 12:29 (UTC-0800):
>
>> I suppose no one else really does a quick search through all their
>> email this way any more.
>
> I do such searching with mc or filecommander, which is largely why I still use POP.

I also do with 'mc', no matter if I used pop3 or imap to retrieve the
posts. But in the past I remember using grepmail or mailgrep - I'm
unsure of the exact name. It would generate another mbox file with the
hits, IIRC.

Right, I just found old notes:

cd Mail
grepmail -b -i -M -m -R -u -e "RBL" file/* > busqueda

grepmail -b -i -M -m -R -u -e "griego" busqueda > busqueda_paso2
grepmail -b -i -M -m -R -u -e "greek" busqueda >> busqueda_paso2

grepmail -h -m -R -e "mail.id@host" lists/* > busqueda


Notice that the mbox "busqueda" (search in Spanish) can itself fall in
the search recursively with nasty results.

       -b      Asserts that the pattern must match in the body of the
email. (Not compatible with -B.)

       -B      Asserts that the pattern must match in the body of the
email, but not the sig­nature. The signature consists of everything
after a line consisting of "-- ".
               (Not compatible with -b.)

       -i      Make the search case-insensitive (by analogy to grep -i).

       -M      Causes grepmail to ignore non-text MIME attachments. This
removes false posi­tives resulting from binaries encoded as ASCII
attachments.

       -m      Append "X-Mailfolder: <folder>" to all email headers,
indicating which folder contained the matched email.

       -R      Causes grepmail to recurse any directories encountered.

       -u      Output only unique emails, by analogy to sort -u.
Grepmail determines email uniqueness by the Message-ID header.

       -e      Explicitly specify the search pattern. This is useful for
specifying patterns that begin with "-", which would otherwise be
interpreted as a flag.

       -h      Asserts that the pattern must match in the header of the
email.



I don't remember now why I stopped using it. Perhaps because Thunderbird
can search in several folders.


--
Cheers / Saludos,

                Carlos E. R.

  (from 42.3 x86_64 "Malachite" (Minas Tirith))


signature.asc (220 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Need new source for unix utils -- gnu has broken another.

L A Walsh
Carlos E. R. wrote:
> It would generate another mbox file with the
> hits, IIRC.
...
> I don't remember now why I stopped using it. Perhaps because Thunderbird
> can search in several folders.
---
        I used grep for a few reasons -- 1) (later reason), it added
perl-compat RE's,
        2) speed.  have about 6.4G though some of those are compressed.
        3) recursive
        4) didn't search in Tbird as things would get messy trying
to keep even archives in Imap.
       
Would usually try to find which folders had references.  From there would
either look at the file in an editor if it was old, or if the file was
in IMAP, I'd search for what I wanted via Tbird+IMAP.

        Still takes a while to do text searches through several gigabytes
of text.  

        It's mostly a narrowing down step to find where something is.

        Just needed something to search through files for given strings.
and grep used to be general case enough that it would search through just
about anything.  Apparently not anymore.  

Tried to report problem to gnu-grep bug list, and was told that grep only works on text files as defined by POSIX, ... wonderful...

Now if I can only get all email sources(authors)
to follow POSIX standards for their email texts.  Hahahaha...like that's
gonna happen.  Does anyone else think it's more than a bit odd
to POSIXify people-interfaces?  Programs, sure, but people?

So much for userfriendly...

--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Need new source for unix utils -- gnu has broken another.

Carlos E. R.-2
On 2018-02-03 00:10, L A Walsh wrote:

> Carlos E. R. wrote:
>> It would generate another mbox file with the
>> hits, IIRC.
> ...
>> I don't remember now why I stopped using it. Perhaps because Thunderbird
>> can search in several folders.
> ---
>     I used grep for a few reasons -- 1) (later reason), it added
> perl-compat RE's,     2) speed.  have about 6.4G though some of those
> are compressed.
>     3) recursive
>     4) didn't search in Tbird as things would get messy trying
> to keep even archives in Imap.
Well, try grepmail - as long as they are not compressed.

--
Cheers / Saludos,

                Carlos E. R.

  (from 42.3 x86_64 "Malachite" (Minas Tirith))


signature.asc (220 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Need new source for unix utils -- gnu has broken another.

Dave Howorth-3
In reply to this post by L A Walsh
On Fri, 02 Feb 2018 15:10:41 -0800
L A Walsh <[hidden email]> wrote:

> Carlos E. R. wrote:
> > It would generate another mbox file with the
> > hits, IIRC.  
> ...
> > I don't remember now why I stopped using it. Perhaps because
> > Thunderbird can search in several folders.  
> ---
> I used grep for a few reasons -- 1) (later reason), it added
> perl-compat RE's, 2) speed.  have about 6.4G though some of
> those are compressed. 3) recursive
> 4) didn't search in Tbird as things would get messy trying
> to keep even archives in Imap.
>
> Would usually try to find which folders had references.  From there
> would either look at the file in an editor if it was old, or if the
> file was in IMAP, I'd search for what I wanted via Tbird+IMAP.
>
> Still takes a while to do text searches through several
> gigabytes of text.   It's mostly a narrowing down step to find
> where something is.
>
> Just needed something to search through files for given
> strings. and grep used to be general case enough that it would search
> through just about anything.  Apparently not anymore.  Tried to
> report problem to gnu-grep bug list, and was told that grep only
> works on text files as defined by POSIX, ... wonderful...
>
> Now if I can only get all email sources(authors)
> to follow POSIX standards for their email texts.  Hahahaha...like
> that's gonna happen.  Does anyone else think it's more than a bit odd
> to POSIXify people-interfaces?  Programs, sure, but people?
>
> So much for userfriendly...

Get an old version of grep from wherever it is still skulking about?
Rename it as 'grep-that-works' or 'grep-jfdi' or somesuch :)

--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Need new source for unix utils -- gnu has broken another.

Bernhard Voelker
In reply to this post by L A Walsh
On 02/02/2018 09:29 PM, Linda Walsh wrote:
> I've used grep to search for strings across all my mailboxes for
> decades.  Found out today, it randomly doesn't work based on whether
> or not the file contains any text that doesn't comply with
> POSIX.
>
> So if one user has UTF-8 encoding and another, ISO-8859-1, and they
> are in the same mailbox, according to POSIX, that's a binary file.

yes, IMO Eric Blake explained that quite well:
   https://lists.gnu.org/r/bug-grep/2018-02/msg00001.html

> You have to tell grep to search (and potentially display) binary
> data [...].

I don't think so - but you have to tell it to process the file
single-byte-wise instead of trying to conform to a certain single
locale (which is impossible in that case).
Again, Eric showed you the way:
   $ LC_ALL=C grep ...

> Note the last line is the prompt same text as you can see at top of window.

I don't see what "head -3 /bin/bash" has to do with the output
of grep or your $SUBJECT at all (apart from the term "binary").

Apropos standards: "head -NUM" is obsolete and non-portable syntax:
   https://www.gnu.org/software/coreutils/head
Use "head -n NUM" instead. ;-)

Have a nice day,
Berny

--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Need new source for unix utils -- gnu has broken another.

L A Walsh
Bernhard Voelker wrote:

> On 02/02/2018 09:29 PM, Linda Walsh wrote:
>> I've used grep to search for strings across all my mailboxes for
>> decades.  Found out today, it randomly doesn't work based on whether
>> or not the file contains any text that doesn't comply with
>> POSIX.
>>
>> So if one user has UTF-8 encoding and another, ISO-8859-1, and they
>> are in the same mailbox, according to POSIX, that's a binary file.
>
> yes, IMO Eric Blake explained that quite well:
>    https://lists.gnu.org/r/bug-grep/2018-02/msg00001.html
>
>> You have to tell grep to search (and potentially display) binary
>> data [...].
>
> I don't think so - but you have to tell it to process the file
> single-byte-wise
----
        If data is in processed as "binary", wouldn't that
mean processing it with no encoding or decoding -- as a stream
of bytes?

        How do you interpret binary?

> locale (which is impossible in that case).
> Again, Eric showed you the way:
>    $ LC_ALL=C grep ...
---
        Wouldn't LC_CTYPE suffice if you went that route?

        However, since you say it would be impossible
to process the file as some encoding, then instead of
throwing some error, or skipping the file, wouldn't it
be more useful to default to such processing upon
encountering a file that might appear "binary" (as in my
case: "Non-ISO extended-ASCII text, with very long lines")
in the case that "POSIXLY_CORRECT" was not set?

>> Note the last line is the prompt same text as you can see at top of window.
>
> I don't see what "head -3 /bin/bash" has to do with the output
> of grep or your $SUBJECT at all (apart from the term "binary").
---
        It was showing the output of a real binary file
instead of an "mbox" that would contain multiple encodings
and why one still doesn't want unrestrained display of
binary, even if one wants to process text with multiple
or incorrect encodings.

        The original example had:

'grep -a string /bin/bash|head -3', but for purposes of
showing tty-corruption, the "head -3 /bin/bash" was
sufficient.



--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Need new source for unix utils -- gnu has broken another.

Bernhard Voelker
On 02/03/2018 04:03 AM, L A Walsh wrote:
> Bernhard Voelker wrote:
> If data is in processed as "binary", wouldn't that
> mean processing it with no encoding or decoding -- as a stream
> of bytes?

well, it tries to treat each byte as one character - as opposed to
multi-byte encodings where up to 4 bytes are used, e.g. the character
"SMILING FACE WITH SUNGLASSES" is encoded in UTF-8 as 4 bytes:
0xF0 0x9F 0x98 0x8E.

Furthermore, grep is a line-based tool, and binary files could have
extreme long lines, control characters (which grep doesn't care but
the terminal it writes to), and finally the NUL character which
traditionally is the end of a string.

> How do you interpret binary?
>
>> locale (which is impossible in that case).
>> Again, Eric showed you the way:
>>     $ LC_ALL=C grep ...
> ---
> Wouldn't LC_CTYPE suffice if you went that route?

Depends what you want to match.  LC_COLLATE may also influence
the matching. See 'info grep' or 'man grep'.  LC_ALL does it
all in one go.

> However, since you say it would be impossible
> to process the file as some encoding, then instead of
> throwing some error, or skipping the file, wouldn't it
> be more useful to default to such processing upon
> encountering a file that might appear "binary" (as in my
> case: "Non-ISO extended-ASCII text, with very long lines")
> in the case that "POSIXLY_CORRECT" was not set?

IMO no: your environment tells grep to treat input e.g. as UTF-8,
but the actual input might be some ISO encoding.  There is simply
no way to get the regular expression for such a mixture.  It's like
one is a vegetarian, and wants to get some carrots of a bag someone
came back with from a butcher; in the back might be a duck stuffed
with some vegetables and even a carrot, but the search is just a
fail.

> The original example had:
>
> 'grep -a string /bin/bash|head -3', but for purposes of
> showing tty-corruption, the "head -3 /bin/bash" was
> sufficient.

To search for some strings in executables, you're much better off
with "strings /bin/bash | grep string".  Your attempt is not a problem
for grep at all, but the terminal it writes to might interpret certain
control characters ... this is like someone speaking Chinese and an
English listener might hear the word "bye" somewhere in the sentence
... and leave.

Have a nice day,
Berny

--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]