This is a discussion on Extracting hostname from URI column within the pgsql Sql forums, part of the PostgreSQL category; --> Hi, I'm trying to use substr() and position() functions to extract the full host name (and later a domain) ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hi, I'm trying to use substr() and position() functions to extract the full host name (and later a domain) from a column that holds URLs. This is what I'm trying, but it clearly doesn't do the job. => select substr(href, position('://' in href)+3, position('://' in href)+3+position('/' in href)), href from url where id <10; substr | href ----------------+---------------------------------------------------------- texturizer.net | http://texturizer.net/firebird/extensions/ texturizer.net | http://texturizer.net/firebird/themes/ forums.mozilla | http://forums.mozillazine.org/index.php?c=4 www.mozillazin | http://www.mozillazine.org/ devedge.netsca | http://devedge.netscape.com/viewsource/2002/bookmarks/ www.google.com | http://www.google.com/search?&q=%s groups.google. | http://groups.google.com/groups?scoring=d&q=%s www.google.com | http://www.google.com/search?q=%s&btnI=I'm+Feeling+Lucky dictionary.ref | http://dictionary.reference.com/search?q=%s The 3rd param to the substr function is clearly wrong. Is it even doable without writing a procedure? Finally, is this the fastest way to get this data, or is there regex-based function that might be faster? Thanks, Otis ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| > I'm trying to use substr() and position() functions to extract the > full host name (and later a domain) from a column that holds URLs. substring( href from '.*://\([^/]*)' ); __________________________________________________ __________________________________ Pinpoint customers who are looking for what you sell. http://searchmarketing.yahoo.com/ ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| ogjunk-pgjedan@yahoo.com wrote: > Hi, > > I'm trying to use substr() and position() functions to extract the full host name (and later a domain) from a column that holds URLs. > This is what I'm trying, but it clearly doesn't do the job. > > => select substr(href, position('://' in href)+3, position('://' in href)+3+position('/' in href)), href from url where id <10; > substr | href > ----------------+---------------------------------------------------------- > texturizer.net | http://texturizer.net/firebird/extensions/ > texturizer.net | http://texturizer.net/firebird/themes/ > forums.mozilla | http://forums.mozillazine.org/index.php?c=4 > www.mozillazin | http://www.mozillazine.org/ > devedge.netsca | http://devedge.netscape.com/viewsource/2002/bookmarks/ > www.google.com | http://www.google.com/search?&q=%s > groups.google. | http://groups.google.com/groups?scoring=d&q=%s > www.google.com | http://www.google.com/search?q=%s&btnI=I'm+Feeling+Lucky > dictionary.ref | http://dictionary.reference.com/search?q=%s > > The 3rd param to the substr function is clearly wrong. Is it even doable without writing a procedure? > > Finally, is this the fastest way to get this data, or is there regex-based function that might be faster? > > Thanks, > Otis > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > > Try this: test=# select substr(href, position('://' in href)+3, position('/' in substr( href,position('://' in href)+3))-1), href from url; substr | href --------------------------+----------------------------------------------------- ----- devedge.netscape.com | http://devedge.netscape.com/viewsource/2002/bookmark s/ texturizer.net | http://texturizer.net/firebird/extensions/ texturizer.net | http://texturizer.net/firebird/themes/ forums.mozillazine.org | http://forums.mozillazine.org/index.php?c=4 www.mozillazine.org | http://www.mozillazine.org/ devedge.netscape.com | http://devedge.netscape.com/viewsource/2002/bookmark s/ www.google.com | http://www.google.com/search?&q=%s groups.google.com | http://groups.google.com/groups?scoring=d&q=%s www.google.com | http://www.google.com/search?q=%s&btnI=I'm+Feeling+L ucky dictionary.reference.com | http://dictionary.reference.com/search?q=%s (10 rows) -- Paul Lambert Database Administrator AutoLedgers ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| chester c young wrote: >> I'm trying to use substr() and position() functions to extract the >> full host name (and later a domain) from a column that holds URLs. > > substring( href from '.*://\([^/]*)' ); > Ok, your solution looks better than mine... but I have no idea how to interpret that, time to consult some manuals. -- Paul Lambert Database Administrator AutoLedgers ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| On 9/12/07, Paul Lambert <paul.lambert@autoledgers.com.au> wrote: > > substring( href from '.*://\([^/]*)' ); > Ok, your solution looks better than mine... but I have no idea how to > interpret that, time to consult some manuals. Plain regex.... The key are the parenthesis () ... basically it will omit ANYTHING + two slashes at the beginning of a string. Then it will match everything BUT a slash, and as much of that as possible since regex are greedy by default (hence the host name he was looking for) ... and everything AFTER a slash will be omitted. Cheers, Andrej -- Please don't top post, and don't use HTML e-Mail :} Make your quotes concise. http://www.american.edu/econ/notes/htmlmail.htm ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Andrej Ricnik-Bay wrote: > Plain regex.... The key are the parenthesis () ... > basically it will omit ANYTHING + two slashes at the beginning > of a string. Then it will match everything BUT a slash, and as > much of that as possible since regex are greedy by default > (hence the host name he was looking for) ... and everything > AFTER a slash will be omitted. > > > > Cheers, > Andrej > > Thanks - that makes a bit more sense. I'm in the middle of reading chapter 9.3.7 of the manual - POSIX Regular Expressions - which I'm assuming is dealing with this, so it's looking clearer. -- Paul Lambert Database Administrator AutoLedgers ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| > >> I'm trying to use substr() and position() functions to extract the > >> full host name (and later a domain) from a column that holds URLs. > > > > substring( href from '.*://\([^/]*)' ); > > typo: no backslash in front of left paren substring( href from '.*://([^/]*)' ) match up thru // within parens, match anything except / return match within parens __________________________________________________ __________________________________ Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online. http://smallbusiness.yahoo.com/webhosting ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Paul Lambert wrote: > Andrej Ricnik-Bay wrote: > >> Plain regex.... The key are the parenthesis () ... >> basically it will omit ANYTHING + two slashes at the beginning >> of a string. Then it will match everything BUT a slash, and as >> much of that as possible since regex are greedy by default >> (hence the host name he was looking for) ... and everything >> AFTER a slash will be omitted. >> Cheers, >> Andrej > Thanks - that makes a bit more sense. I'm in the middle of reading chapter > 9.3.7 of the manual - POSIX Regular Expressions - which I'm assuming is > dealing with this, so it's looking clearer. A very good resource on regular expressions is "Mastering Regular Expressions" by Jeffrey Friedl, now in its third edition: http://www.oreilly.com/catalog/regex3/ I read the first ed. years ago and it was very illuminating. -- Alvaro Herrera Developer, http://www.PostgreSQL.org/ "Acepta los honores y aplausos y perderás tu libertad" ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| Paul Lambert wrote: > chester c young wrote: >>> I'm trying to use substr() and position() functions to extract the >>> full host name (and later a domain) from a column that holds URLs. >> >> substring( href from '.*://\([^/]*)' ); >> > > Ok, your solution looks better than mine... but I have no idea how to > interpret that, time to consult some manuals. > OK - following on from this, I'm doing a similar thing to the OP to analyze my proxy's traffic (never occured to me to do it in a db until that post) I've done the above regex to pull out the domain part of the URL and am left with results such as: "acvs.mediaonenetwork.net" "profile.ak.facebook.com" "www.bankwest.com.au" What I want to do next is pull out the major domain part of the URL I.e. for the above three records I should end up with "mediaonenetwork.net" "facebook.com" "bankwest.com.au" What would be the best way to do something like that? I assume it won't be a simple regex like the above due to the country codes on the end of some domains. My thought is look at the last portion of the domain, if it's 2 characters long then assume it's a country code and grab the last three sections, if it's not three characters long then assume it's an international domain and grab the last two... but that sounds a bit dodgy. -- Paul Lambert Database Administrator AutoLedgers ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| ||||
| "Paul Lambert" <paul.lambert@autoledgers.com.au> writes: > What would be the best way to do something like that? I assume it won't be a > simple regex like the above due to the country codes on the end of some > domains. My thought is look at the last portion of the domain, if it's 2 > characters long then assume it's a country code and grab the last three > sections, if it's not three characters long then assume it's an international > domain and grab the last two... but that sounds a bit dodgy. Not all countries break up their national tld space into sections like .co.uk or .com.au. Canadian domains can be bought directly under .ca like amazon.ca. I think you'll have to store a specific list of tlds and how deep you want to look. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |