Discussion:
URL_REGEXP and URLs ending with ')'
Dennis Preiser
2014-04-18 15:31:19 UTC
Permalink
Hello,

Wikipedia use many URLs that end with ')', e.g.

<http://en.wikipedia.org/wiki/Tin_(newsreader)>

Tin's regex does not recognize the ')' as part of the URL. The attached
patch fixes this for me but I'm not sure if this is the right solution.

In the diff it's hard to see what I've changed. It is this round bracket
at the end of the regex, which I have removed:

... (?:/[^)\\]\\>\"\\s]*|$|(?=[)\\]\\>\"\\s]))"
^

Dennis
Urs Janßen
2014-04-18 17:59:24 UTC
Permalink
Post by Dennis Preiser
Wikipedia use many URLs that end with ')', e.g.
<http://en.wikipedia.org/wiki/Tin_(newsreader)>
Tin's regex does not recognize the ')' as part of the URL. The attached
this was intentionally to avoid capturing ) when it does not belong to
the url.

for a de.* full feed with ~14 days retention 3466 out of 88146 matches would
differ (3.93%). out of 3466 differences

242 matches of urls which also match

(\.(txt|html?|jpg|png|pdf|flv|zip|php|aspx)\)|\)[\.,:\?=]|/\))\s*$

and are very likely to be an error, and 1688 matches which also had
an opening ( in it and thus are likely to be correct.

for the remaining 1541 differences, most of them look like errors, but I
dind't check the links, i.e.

http://de.wikipedia.org/wiki/Liste_von_Katastrophen_der_Luftfahrt)
https://addons.mozilla.org/de/seamonkey/addon/bugmail/?src=api)
http://xnews.newsguy.com/xnews_de.chm)
http://www.vba-tutorial.de/cgi-bin/mailto.pl)

http://www.deutschepost.de/dpag?xmlFile=link1016009_901)
http://bit.ly/ekg72U)
https://twitter.com/zugschlusine/status/442945994121293824/photo/1)
[...]
Post by Dennis Preiser
patch fixes this for me but I'm not sure if this is the right solution.
IMHO the closing ) at the end of the url should only be catptured if there
was an opening ( befor, otherwise it should not be trated to be part of
the url to avoid wrong captures. have fun with putting that into a regexp ,-)

urs
Martin Klaiber
2014-04-20 11:16:46 UTC
Permalink
Post by Urs Janßen
IMHO the closing ) at the end of the url should only be catptured if there
was an opening ( befor, otherwise it should not be trated to be part of
the url to avoid wrong captures. have fun with putting that into a regexp ,-)
This would be a nice solution. But the main problem is IMHO that it is
not possible to add characters manually when editing the URL because
tin seems to limit the length of the URL-string to the original length
without the ")".

So, it is not possible to add a ")" manually but it is always possible
to delete a trailing ")". At least with my version of tin (2.1.1 from
Debian wheezy).

Therefore and as a handy solution I would allow the closing ")" rather
than forbidding it.

Martin
Dennis Preiser
2014-04-20 19:30:36 UTC
Permalink
Post by Martin Klaiber
This would be a nice solution. But the main problem is IMHO that it is
not possible to add characters manually when editing the URL because
tin seems to limit the length of the URL-string to the original length
without the ")".
You are right, this is a fixed length and not a dynamic one. It would be
possible to increase the length by a fixed amount. But what would be a
reasonable value: length of the URL + 1 or + 2 or + 5 or + 10 ...

Here is an example with URL + 5:

diff -urp tin-2.2.1_r4/src/page.c tin-2.2.1_r5/src/page.c
--- tin-2.2.1_r4/src/page.c 2014-04-19 21:29:40.000000000 +0200
+++ tin-2.2.1_r5/src/page.c 2014-04-20 16:44:03.000000000 +0200
@@ -2582,7 +2582,7 @@ process_url(
t_url *lptr;

lptr = find_url(n);
- len = strlen(lptr->url);
+ len = strlen(lptr->url) + 5; /* +5: additional space to allow for completion of truncated url by hand */
url = my_malloc(len + 1);
if (prompt_default_string("URL:", url, len, lptr->url, HIST_URL)) {
if (!*url) { /* Don't try and open nothing */

The question is, what is less confusing for the user: No additional
input is possible or e.g. only 5 additional characters are possible.

Dennis
Urs Janßen
2014-04-21 09:57:09 UTC
Permalink
Post by Dennis Preiser
Post by Martin Klaiber
This would be a nice solution. But the main problem is IMHO that it is
not possible to add characters manually when editing the URL because
tin seems to limit the length of the URL-string to the original length
without the ")".
You are right, this is a fixed length and not a dynamic one. It would be
possible to increase the length by a fixed amount.
or a danamic one like

len = strlen(lptr->url) << 1;
Post by Dennis Preiser
But what would be a reasonable value: length of the URL + 1 or + 2 or + 5 or + 10 ...
as the memory is freed ASAP there is no need to be that minimalistic,
something like + 20 should cover most cases (who will add 20 extra chars
to an url?)

urs
Martin Klaiber
2014-04-21 10:11:22 UTC
Permalink
Post by Dennis Preiser
Post by Martin Klaiber
This would be a nice solution. But the main problem is IMHO that it is
not possible to add characters manually when editing the URL because
tin seems to limit the length of the URL-string to the original length
without the ")".
You are right, this is a fixed length and not a dynamic one. It would be
possible to increase the length by a fixed amount. But what would be a
reasonable value: length of the URL + 1 or + 2 or + 5 or + 10 ...
I don't know. The trailing ")" is only one problem. There are others,
like wrapped URLs of broken newsreaders. Just an example:

| Message-ID: <***@wp-schulz.de>
| From: "Werner P. Schulz" <***@wp-schulz.de>
| Newsgroups: de.comm.provider.t-online
| Subject: Re: =?iso-8859-1?Q?Aprilst=F6rung=2C?= schluckauf oder...?
| Date: 1 Apr 2014 07:22:26 GMT
| User-Agent: Pan/0.139 (Sexual Chocolate; GIT bf56508
| git://git.gnome.org/pan2)
|
| [...]
|
| http://www.heise.de/newsticker/meldung/Kabelverstopfung-Telekom-aendert-
| Strategie-gegen-Skin-Effekt-2157959.html?wt_mc=nl.ho

The last two lines is one URL, destroyed by a line break. tin recognizes
only the first line as the URL, the user has to add the second line
manually, which would be about 50 characters here.
Post by Dennis Preiser
The question is, what is less confusing for the user: No additional
input is possible or e.g. only 5 additional characters are possible.
I think, both is confusing. From a users point of view it would be nice
to have no limit at all (IMHO).

But on the other hand, broken URLs are rare, the few cases can be
handled by dragging the URL to a browser in a second window/terminal.

To be honest: I don't know what is best.

Sorry,
Martin

Loading...