This is a discussion on wget problem within the Slackware Linux Support forums, part of the Unix Operating Systems category; --> Hi everyone.. Everybody knows we can use wget for recursive download of webpages.. But some sites wont allow/prohibit recursive ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hi everyone.. Everybody knows we can use wget for recursive download of webpages.. But some sites wont allow/prohibit recursive download..Is there any way to download (recursively ) these sites.Some one told me by changing port we can download from those sites.. Is it possible??? If so how it can be done??? Plz reply Thanx in advance |
| |||
| On 2004-12-16, guruteck@gmail.com <guruteck@gmail.com> wrote: > Hi everyone.. > Everybody knows we can use wget for recursive download of > webpages.. You might try httrack. http://www.httrack.com/ Worked great for me. You can get it as a package at: http://www.linuxpackages.net/ nb |
| |||
| On 15 Dec 2004 23:49:43 -0800, guruteck@gmail.com <guruteck@gmail.com> wrote: > Everybody knows we can use wget for recursive download of > webpages.. > But some sites wont allow/prohibit recursive download..Is there any way > to download (recursively ) these sites. What errors are you getting when downloading a site? If you think it's a robots.txt issue, you can tell wget to ignore it. wget -erobots=off http://example.com If you suspect that the site is blocking wget, you can change wget's user-agent string with the -U option. > Some one told me by changing > port we can download from those sites.. That doesn't seem likely with a typical website/webserver. -- Mark Hill |
| |||
| On 16 Dec 2004 03:58:21 -0800, guruteck@gmail.com <guruteck@gmail.com> wrote: > Thanx u very much .It works greatly for me.. > But I didnt completely understand why it is working for me now?? > I didnt get complete information from man pages The man page doesn't seem to cover robots.txt very much. (Perhaps this is intentional.) It's worth reading /etc/wgetrc as that will give you some more ideas as to what wget can do. If the '-erobots=off' option worked for you, then this option told wget to ignore the http://some.example.com/robots.txt file on the website. robots.txt is part of the Robots Exclusion Standard that well-behaved web robots (like wget) will follow. It tells web robots what part(s) of the site should not be downloaded or indexed. There is more information on the robotstxt site: <http://www.robotstxt.org/> <http://www.robotstxt.org/wc/norobots.html#introduction> If the '-U' option worked for you, the website you're downloading from is blocking requests from any client called "Wget/1.9.1". The -U option allows wget to look like another client, like firefox for instance: wget -U \ "Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.7.5) Gecko/20041110 Firefox/1.0" \ http://some.example.com -- Mark Hill |
| ||||
| On 2004-12-16, guruteck@gmail.com <guruteck@gmail.com> wrote: > > Thanx a lot for ur information > This is what I was looking for > Actually -U option worked me perfectly If you are using this option (which is officially discouraged in the man page), you might also consider asking the webmaster of the remote site why they are blocking on User-Agent; if doing so might raise their ire, perhaps you might reconsider why you need to mirror their site. --keith -- kkeller-usenet@wombat.san-francisco.ca.us (try just my userid to email me) AOLSFAQ=http://wombat.san-francisco.ca.us/cgi-bin/fom |