View Single Post

   
  #4 (permalink)  
Old 04-19-2008, 07:01 PM
shakahshakah
 
Posts: n/a
Default Re: Escaping of RegEx?

On Feb 6, 2:54 am, Torsten Zühlsdorff <f...@meisterderspiele.de>
wrote:
> shakahshakah schrieb:
>
>
>
> > On Feb 5, 5:11 am, Torsten Zühlsdorff <f...@meisterderspiele.de>
> > wrote:
> >> Hello,

>
> >> i have a list of URLs from the HTTP-Referer. I get all URLs which
> >> contains "google". Now i want to extract the searchstring. For example:
> >> "http://www.google.de/search?hl=de&q=porenbeton+planbauplatten+abmessu.. ."
> >> should return "porenbeton+planbauplatten+abmessungen"

>
> >> Therefor i use this RegEx:
> >> (?:\?|&|as_)q=(.*?)(?:&|\s)

>
> >> In SQL it look like this:
> >> SELECT
> >> substring('http://www.google.de/search?hl=de&q=porenbeton+planbauplatten+abmessu.. .
> >> from '(?:\?|&|as_)q=(.*?)(?:&|\s)');

>
> >> But i get this error-message:
> >> quantifier operand invalid

>
> >> (complete errror-message in german:
> >> WARNUNG: nicht standardkonforme Verwendung von Escape in
> >> Zeichenkettenkonstante
> >> ZEILE 1: ...porenbeton+planbauplatten+abmessungen&meta=' from '(?:\?|&|a...
> >> ^
> >> TIP: Verwenden Sie die Syntax für Escape-Zeichenketten, z.B. E'\r\n'..
> >> FEHLER: ungültiger regulärer Ausdruck: quantifier operand invalid)

>
> >> How do I need to escape the RegEx?

>
> >> Thank for your help & greetings from Germany,
> >> Torsten

>
> > Does '^.*q=([^&=]*).*$' work for you?

>
> It works in most cases. But not on strings like this:http://www.google.de/search?q=Versag...&oe=utf-8&aq=t...
>
> The result is always "t":
> crawler=# SELECT
> SUBSTRING('http://www.google.de/search?q=Lampen+anbringen&ie=utf-8&oe=utf-8&aq=t...
> FROM '^.*q=([^&=]*).*$');
> substring
> -----------
> t
> (1 Zeile)
>
> I can not figure out, why it don't work, because i do not understand the
> RegEx completly
> But every User, which use the "firefox-google"
> (http://de.start2.mozilla.com/firefox...g.mozilla....),
> create a referer which could not be parsed by your regex :/


Looks like it gets tripped up by query params like "aq" (in addition
to "q") -- how about:
'^.*[?&]q=([^&=]*).*$'

reporting=> SELECT SUBSTRING('http://www.google.de/search?q=Versagen
+der
+Teilungsgenehmigung&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:defficial&client=firefo x-
a' FROM '^.*[?&]q=([^&=]*).*$');
substring
----------------------------------
Versagen+der+Teilungsgenehmigung
(1 row)
Reply With Quote