Unix Technical Forum

Extracting hostname from URI column

This is a discussion on Extracting hostname from URI column within the pgsql Sql forums, part of the PostgreSQL category; --> Hi, I'm trying to use substr() and position() functions to extract the full host name (and later a domain) ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Sql

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-19-2008, 05:54 PM
ogjunk-pgjedan@yahoo.com
 
Posts: n/a
Default Extracting hostname from URI column

Hi,

I'm trying to use substr() and position() functions to extract the full host name (and later a domain) from a column that holds URLs.
This is what I'm trying, but it clearly doesn't do the job.

=> select substr(href, position('://' in href)+3, position('://' in href)+3+position('/' in href)), href from url where id <10;
substr | href
----------------+----------------------------------------------------------
texturizer.net | http://texturizer.net/firebird/extensions/
texturizer.net | http://texturizer.net/firebird/themes/
forums.mozilla | http://forums.mozillazine.org/index.php?c=4
www.mozillazin | http://www.mozillazine.org/
devedge.netsca | http://devedge.netscape.com/viewsource/2002/bookmarks/
www.google.com | http://www.google.com/search?&q=%s
groups.google. | http://groups.google.com/groups?scoring=d&q=%s
www.google.com | http://www.google.com/search?q=%s&btnI=I'm+Feeling+Lucky
dictionary.ref | http://dictionary.reference.com/search?q=%s

The 3rd param to the substr function is clearly wrong. Is it even doable without writing a procedure?

Finally, is this the fastest way to get this data, or is there regex-based function that might be faster?

Thanks,
Otis



---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-19-2008, 05:54 PM
chester c young
 
Posts: n/a
Default Re: Extracting hostname from URI column

> I'm trying to use substr() and position() functions to extract the
> full host name (and later a domain) from a column that holds URLs.


substring( href from '.*://\([^/]*)' );




__________________________________________________ __________________________________
Pinpoint customers who are looking for what you sell.
http://searchmarketing.yahoo.com/

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-19-2008, 05:54 PM
Paul Lambert
 
Posts: n/a
Default Re: Extracting hostname from URI column

ogjunk-pgjedan@yahoo.com wrote:
> Hi,
>
> I'm trying to use substr() and position() functions to extract the full host name (and later a domain) from a column that holds URLs.
> This is what I'm trying, but it clearly doesn't do the job.
>
> => select substr(href, position('://' in href)+3, position('://' in href)+3+position('/' in href)), href from url where id <10;
> substr | href
> ----------------+----------------------------------------------------------
> texturizer.net | http://texturizer.net/firebird/extensions/
> texturizer.net | http://texturizer.net/firebird/themes/
> forums.mozilla | http://forums.mozillazine.org/index.php?c=4
> www.mozillazin | http://www.mozillazine.org/
> devedge.netsca | http://devedge.netscape.com/viewsource/2002/bookmarks/
> www.google.com | http://www.google.com/search?&q=%s
> groups.google. | http://groups.google.com/groups?scoring=d&q=%s
> www.google.com | http://www.google.com/search?q=%s&btnI=I'm+Feeling+Lucky
> dictionary.ref | http://dictionary.reference.com/search?q=%s
>
> The 3rd param to the substr function is clearly wrong. Is it even doable without writing a procedure?
>
> Finally, is this the fastest way to get this data, or is there regex-based function that might be faster?
>
> Thanks,
> Otis
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>
>


Try this:

test=# select substr(href, position('://' in href)+3, position('/' in
substr(
href,position('://' in href)+3))-1), href from url;
substr | href

--------------------------+-----------------------------------------------------
-----
devedge.netscape.com |
http://devedge.netscape.com/viewsource/2002/bookmark
s/
texturizer.net | http://texturizer.net/firebird/extensions/
texturizer.net | http://texturizer.net/firebird/themes/
forums.mozillazine.org | http://forums.mozillazine.org/index.php?c=4
www.mozillazine.org | http://www.mozillazine.org/
devedge.netscape.com |
http://devedge.netscape.com/viewsource/2002/bookmark
s/
www.google.com | http://www.google.com/search?&q=%s
groups.google.com | http://groups.google.com/groups?scoring=d&q=%s
www.google.com |
http://www.google.com/search?q=%s&btnI=I'm+Feeling+L
ucky
dictionary.reference.com | http://dictionary.reference.com/search?q=%s
(10 rows)

--
Paul Lambert
Database Administrator
AutoLedgers

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-19-2008, 05:54 PM
Paul Lambert
 
Posts: n/a
Default Re: Extracting hostname from URI column

chester c young wrote:
>> I'm trying to use substr() and position() functions to extract the
>> full host name (and later a domain) from a column that holds URLs.

>
> substring( href from '.*://\([^/]*)' );
>


Ok, your solution looks better than mine... but I have no idea how to
interpret that, time to consult some manuals.

--
Paul Lambert
Database Administrator
AutoLedgers


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-19-2008, 05:54 PM
Andrej Ricnik-Bay
 
Posts: n/a
Default Re: Extracting hostname from URI column

On 9/12/07, Paul Lambert <paul.lambert@autoledgers.com.au> wrote:
> > substring( href from '.*://\([^/]*)' );


> Ok, your solution looks better than mine... but I have no idea how to
> interpret that, time to consult some manuals.

Plain regex.... The key are the parenthesis () ...
basically it will omit ANYTHING + two slashes at the beginning
of a string. Then it will match everything BUT a slash, and as
much of that as possible since regex are greedy by default
(hence the host name he was looking for) ... and everything
AFTER a slash will be omitted.



Cheers,
Andrej


--
Please don't top post, and don't use HTML e-Mail :} Make your quotes concise.

http://www.american.edu/econ/notes/htmlmail.htm

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-19-2008, 05:54 PM
Paul Lambert
 
Posts: n/a
Default Re: Extracting hostname from URI column

Andrej Ricnik-Bay wrote:

> Plain regex.... The key are the parenthesis () ...
> basically it will omit ANYTHING + two slashes at the beginning
> of a string. Then it will match everything BUT a slash, and as
> much of that as possible since regex are greedy by default
> (hence the host name he was looking for) ... and everything
> AFTER a slash will be omitted.
>
>
>
> Cheers,
> Andrej
>
>

Thanks - that makes a bit more sense. I'm in the middle of reading
chapter 9.3.7 of the manual - POSIX Regular Expressions - which I'm
assuming is dealing with this, so it's looking clearer.

--
Paul Lambert
Database Administrator
AutoLedgers

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-19-2008, 05:54 PM
chester c young
 
Posts: n/a
Default Re: Extracting hostname from URI column

> >> I'm trying to use substr() and position() functions to extract the
> >> full host name (and later a domain) from a column that holds URLs.

> >
> > substring( href from '.*://\([^/]*)' );
> >


typo: no backslash in front of left paren
substring( href from '.*://([^/]*)' )

match up thru //
within parens, match anything except /

return match within parens



__________________________________________________ __________________________________
Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online.
http://smallbusiness.yahoo.com/webhosting

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-19-2008, 05:54 PM
Alvaro Herrera
 
Posts: n/a
Default Re: Extracting hostname from URI column

Paul Lambert wrote:
> Andrej Ricnik-Bay wrote:
>
>> Plain regex.... The key are the parenthesis () ...
>> basically it will omit ANYTHING + two slashes at the beginning
>> of a string. Then it will match everything BUT a slash, and as
>> much of that as possible since regex are greedy by default
>> (hence the host name he was looking for) ... and everything
>> AFTER a slash will be omitted.
>> Cheers,
>> Andrej

> Thanks - that makes a bit more sense. I'm in the middle of reading chapter
> 9.3.7 of the manual - POSIX Regular Expressions - which I'm assuming is
> dealing with this, so it's looking clearer.


A very good resource on regular expressions is "Mastering Regular
Expressions" by Jeffrey Friedl, now in its third edition:
http://www.oreilly.com/catalog/regex3/

I read the first ed. years ago and it was very illuminating.

--
Alvaro Herrera Developer, http://www.PostgreSQL.org/
"Acepta los honores y aplausos y perderás tu libertad"

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-19-2008, 05:54 PM
Paul Lambert
 
Posts: n/a
Default Re: Extracting hostname from URI column

Paul Lambert wrote:
> chester c young wrote:
>>> I'm trying to use substr() and position() functions to extract the
>>> full host name (and later a domain) from a column that holds URLs.

>>
>> substring( href from '.*://\([^/]*)' );
>>

>
> Ok, your solution looks better than mine... but I have no idea how to
> interpret that, time to consult some manuals.
>


OK - following on from this, I'm doing a similar thing to the OP to
analyze my proxy's traffic (never occured to me to do it in a db until
that post)

I've done the above regex to pull out the domain part of the URL and am
left with results such as:
"acvs.mediaonenetwork.net"
"profile.ak.facebook.com"
"www.bankwest.com.au"

What I want to do next is pull out the major domain part of the URL
I.e. for the above three records I should end up with
"mediaonenetwork.net"
"facebook.com"
"bankwest.com.au"

What would be the best way to do something like that? I assume it won't
be a simple regex like the above due to the country codes on the end of
some domains. My thought is look at the last portion of the domain, if
it's 2 characters long then assume it's a country code and grab the last
three sections, if it's not three characters long then assume it's an
international domain and grab the last two... but that sounds a bit dodgy.

--
Paul Lambert
Database Administrator
AutoLedgers


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-19-2008, 05:54 PM
Gregory Stark
 
Posts: n/a
Default Re: Extracting hostname from URI column

"Paul Lambert" <paul.lambert@autoledgers.com.au> writes:

> What would be the best way to do something like that? I assume it won't be a
> simple regex like the above due to the country codes on the end of some
> domains. My thought is look at the last portion of the domain, if it's 2
> characters long then assume it's a country code and grab the last three
> sections, if it's not three characters long then assume it's an international
> domain and grab the last two... but that sounds a bit dodgy.


Not all countries break up their national tld space into sections like .co.uk
or .com.au. Canadian domains can be bought directly under .ca like amazon.ca.

I think you'll have to store a specific list of tlds and how deep you want to
look.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 06:13 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com