Unix Technical Forum

wget problem

This is a discussion on wget problem within the Slackware Linux Support forums, part of the Unix Operating Systems category; --> Hi everyone.. Everybody knows we can use wget for recursive download of webpages.. But some sites wont allow/prohibit recursive ...


Go Back   Unix Technical Forum > Unix Operating Systems > Slackware Linux Support

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 02-19-2008, 06:20 PM
guruteck@gmail.com
 
Posts: n/a
Default wget problem

Hi everyone..
Everybody knows we can use wget for recursive download of
webpages..
But some sites wont allow/prohibit recursive download..Is there any way
to download (recursively ) these sites.Some one told me by changing
port we can download from those sites..
Is it possible???
If so how it can be done???

Plz reply

Thanx in advance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 02-19-2008, 06:20 PM
notbob
 
Posts: n/a
Default Re: wget problem

On 2004-12-16, guruteck@gmail.com <guruteck@gmail.com> wrote:
> Hi everyone..
> Everybody knows we can use wget for recursive download of
> webpages..


You might try httrack.

http://www.httrack.com/

Worked great for me. You can get it as a package at:

http://www.linuxpackages.net/

nb

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 02-19-2008, 06:20 PM
Mark Hill
 
Posts: n/a
Default Re: wget problem

On 15 Dec 2004 23:49:43 -0800,
guruteck@gmail.com <guruteck@gmail.com> wrote:

> Everybody knows we can use wget for recursive download of
> webpages..
> But some sites wont allow/prohibit recursive download..Is there any way
> to download (recursively ) these sites.


What errors are you getting when downloading a site? If you think it's
a robots.txt issue, you can tell wget to ignore it.

wget -erobots=off http://example.com

If you suspect that the site is blocking wget, you can change wget's
user-agent string with the -U option.

> Some one told me by changing
> port we can download from those sites..


That doesn't seem likely with a typical website/webserver.

--
Mark Hill
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 02-19-2008, 06:20 PM
guruteck@gmail.com
 
Posts: n/a
Default Re: wget problem

Thanx u very much .It works greatly for me..
But I didnt completely understand why it is working for me now??
I didnt get complete information from man pages
Thanx in advance
regards

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 02-19-2008, 06:20 PM
Mark Hill
 
Posts: n/a
Default Re: wget problem

On 16 Dec 2004 03:58:21 -0800,
guruteck@gmail.com <guruteck@gmail.com> wrote:
> Thanx u very much .It works greatly for me..
> But I didnt completely understand why it is working for me now??
> I didnt get complete information from man pages


The man page doesn't seem to cover robots.txt very much. (Perhaps this
is intentional.) It's worth reading /etc/wgetrc as that will give you
some more ideas as to what wget can do.

If the '-erobots=off' option worked for you, then this option told wget
to ignore the http://some.example.com/robots.txt file on the website.
robots.txt is part of the Robots Exclusion Standard that well-behaved
web robots (like wget) will follow. It tells web robots what part(s) of
the site should not be downloaded or indexed. There is more information
on the robotstxt site:
<http://www.robotstxt.org/>
<http://www.robotstxt.org/wc/norobots.html#introduction>

If the '-U' option worked for you, the website you're downloading from
is blocking requests from any client called "Wget/1.9.1". The -U option
allows wget to look like another client, like firefox for instance:
wget -U \
"Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.7.5) Gecko/20041110 Firefox/1.0" \
http://some.example.com

--
Mark Hill
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 02-19-2008, 06:21 PM
guruteck@gmail.com
 
Posts: n/a
Default Re: wget problem


Thanx a lot for ur information
This is what I was looking for
Actually -U option worked me perfectly

Regards

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 02-19-2008, 06:21 PM
Keith Keller
 
Posts: n/a
Default Re: wget problem

On 2004-12-16, guruteck@gmail.com <guruteck@gmail.com> wrote:
>
> Thanx a lot for ur information
> This is what I was looking for
> Actually -U option worked me perfectly


If you are using this option (which is officially discouraged in the man
page), you might also consider asking the webmaster of the remote site
why they are blocking on User-Agent; if doing so might raise their ire,
perhaps you might reconsider why you need to mirror their site.

--keith

--
kkeller-usenet@wombat.san-francisco.ca.us
(try just my userid to email me)
AOLSFAQ=http://wombat.san-francisco.ca.us/cgi-bin/fom

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 05:45 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com