----------------------
BlueLeech README Notes
----------------------
===============================================================================
Preface
===============================================================================
The BlueLeech README/help documentation is released under the GNU Free
Documentation License (see FDL.txt for details).
Copyright (c) 2004 Benjamin Winters.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.2
or any later version published by the Free Software Foundation;
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
A copy of the license is included in the section entitled "GNU Free Documentation License".
*** For the GNU Free Documentation License see the file FDL.txt
===============================================================================
The Basics of Running/Using BlueLeech:
===============================================================================
Run the BlueLeech.exe file or in a command prompt (linux/windows) navigate
to the main "BlueLeech" directory and enter the command:
java blueleech.BlueLeech
For example:
cd C:\BlueLeech\
java blueleech.BlueLeech
Once the main window displays, type in the URL of the page you wish to
start leeching from, for example penny-arcade.com, then press the "Browse"
button to select where to save the leeched files to and set the maximum
number of simultaneous connections to the server to use (too few connections
tends to be less bandwidth efficient in terms of downloading while too
many connections can make downloading very slow, 4 to 6 tends to be a good
number for this).
*** Also note that some servers will only accept 1 or 2 simultaneous from
*** a single client and will deny any further connections. If you are
*** getting alot of URLs showing up as "failed" then this may be why.
===============================================================================
Using the Allow/Deny lists:
===============================================================================
It is generally best to first have a scout over the website you are about
to try and leech to figure out which settings you need/want to use and
what allow/deny list configuration will work best. For the example of
penny-arcade.com we don't want the leech spending forever trolling through
the forum pages, so press the "Deny" button under the "Allow/Deny URL Prefixes"
list and type in "http://penny-arcade.com/forums/".
Then for our example assuming that we only want to download the comics and
other images we can remove the "*.*" entry in the "Allow/Deny File Extensions"
and then add the value "*.jpg".
Allow/Deny URL Prefixes:
By URL prefixes this means the beginning of any URL that BlueLeech
encounters. For example a prefix of "http://penny-arcade.com/forums/"
would be "http://penny-arcade.com", another prefix would be "http://"
Allow/Deny Domain Suffixes:
This list is less useful than the other two allow/deny lists but
is used for checking the DOMAIN PORTION ONLY of a URL. So if the
full URL is "http://images.penny-arcade.com/blah/" and there is an
allow rule of "penny-arcade.com" then it will check against the
"images.penny-arcade.com" part of the URL and see that it has the
allowable domain suffix.
Allow/Deny File Extensions:
A file extension is assumed to be *ANY* suffix of a URL path.
For example "http://penny-arcade.com/forums/image1.jpg" has a
suffix of "image1.jpg", it also has a suffix of ".jpg" and it
also has a suffix of "/forums/image1.jpg".
*** Note that by default the URL that you type into the top textbox
*** is automatically added as an allowed URL prefix. In our example
*** this means that "http://penny-arcade.com" is allowed automatically
*** and it does not have to be entered manually as an allow entry.
===============================================================================
Checkbox Options:
===============================================================================
FIRST OPTION:
Now it's time to set the checkbox options. The first option if left unticked
means that the folder hierarchy of all files/folders on the site will be
preserved when saved to disk. So a file of:
"http://penny-arcade.com/tpl/latestcomic_tab.gif"
will get saved into "<leeching-directory>\penny-arcade.com\tpl\latestcomic_tab.gif"
wheras if the "Save all files to same folder" option is ticked then all files
will just be saved into the same folder for that site, so the our example file
would now be saved into "<leeching-directory>\penny-arcade.com\latestcomic_tab.gif"
SECOND OPTION:
The second option "Do not spider through child URL links" basically just
limits the leech from downloading any pages but those that are *directly*
linked to by the URL specified in the top textbox. For our example that
would mean only pages/files that are directly linked from http://penny-arcade.com
Once it has downloaded those files it will *not* go searching for links
to follow in the downloaded pages/files.
THIRD OPTION:
The third option "Only download files with a valid domain suffix OR a valid URL prefix"
is the most difficult to explain. By default BlueLeech will download *ALL* URLs
that it finds links to on any page that it is processing, but it will only process
those pages it has downloaded that match the allow/deny lists. By "process" I mean
actually search through the URL looking for other URLs to download. If the third
option is enabled then BlueLeech will *NOT* download URL links it finds that are
not explicitly in the allow/deny lists. This is where it gets tricks, this option
is great for stopping BlueLeech from downloading heaps of useless advertisements and
such that it encounters, but can render the leech useless if any of the files that
you are after are hosted on external sites. (See note on "external file hosting problems
for more details of this problem.)
FOURTH OPTION:
The fourth option "Overwrite existing files" simply means that if disabled then if
a URL is found to be already stored locally on disk then BlueLeech will assume that
it has already been download previously and rather than downloading and processing
the URL again will simply read in the URL from the local copy on disk and process
it for URL links. You can force BlueLeech to *always* re-download all URL pages/files
by enabling this option.
===============================================================================
Extra Notices and Bugs
===============================================================================
Default leeching directory:
By default all leeched files are saved into the "leeched" directory of the
"BlueLeech" folder. This default resets every time the program is restarted
or the "New Leech" button is pressed.
External File Hosting Problems:
Sometimes a URL may store most of its pages locally, but host some of its
larger files/images on other web sites. For example "www.blah.com" might
store all its image files on "www.imageserver.somedomain.com". In such a
case if the third option of "Only download files with a valid domain suffix
OR a valid URL prefix" is enabled then *all* of the images hosted on
"www.imageserver.somedomain.com" will be skipped by BlueLeech. There are
two ways around this. Firstly you could add an explicit allow entry for
the domain suffix of "imageserver.somedomain.com" or an allow entry for
the URL prefix of "www.imageserver.somedomain.com" which would now allow
for all images hosted externally by "www.imageserver.somedomain.com" to
be downloaded by BlueLeech. The downside to this is if there are many files
hosted on many different external locations you will have to add an awful
lot of allow ent
没有合适的资源?快使用搜索试试~ 我知道了~
网页搜索爬虫 BlueLeech.7z
共90个文件
html:32个
class:28个
java:12个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 142 浏览量
2022-07-06
14:09:21
上传
评论
收藏 120KB 7Z 举报
温馨提示
网页搜索爬虫 BlueLeech.7z
资源推荐
资源详情
资源评论
收起资源包目录
网页搜索爬虫 BlueLeech.7z (90个子文件)
网页搜索爬虫 BlueLeech
BlueLeech
README.txt 10KB
GPL.txt 18KB
blueleech
BlueLeech.java 2KB
BlueLeech.class 924B
engine
BLSessionHandler.java 9KB
BLSettingsHandler.java 12KB
BLIntermediateThreadLogger.java 3KB
BLSession.class 11KB
BLSettingsHandler.class 2KB
BLSessionHandler.class 3KB
BLSite.java 4KB
BLIntermediateThreadLogger.class 912B
BLSession.java 33KB
BLSiteHandler.java 12KB
BLSessionHandler$LeechCheckTask.class 542B
BLSite.class 872B
BLLogger.class 2KB
BLEngine.class 5KB
BLSiteHandler.class 3KB
BLEngine.java 14KB
BLSession$DoSearchTick.class 583B
BLLogger.java 3KB
gui
BLMainFrame$5.class 1KB
BLProgressFrame$1.class 993B
BLIntermediateThreadGUI.java 4KB
BLMainFrame$11.class 1KB
BLMainFrame$6.class 1KB
BLMainFrame$4.class 881B
BLMainFrame$2.class 2KB
BLMainFrame$10.class 883B
BLMainFrame.java 20KB
BLMainFrame$1.class 4KB
BLMainFrame.class 9KB
BLProgressFrame$3.class 625B
BLProgressFrame.class 6KB
BLProgressFrame.java 11KB
BLMainFrame$7.class 881B
BLMainFrame$9.class 1KB
BLIntermediateThreadGUI.class 1008B
BLMainFrame$3.class 2KB
BLMainFrame$8.class 1KB
BLProgressFrame$2.class 999B
FDL.txt 20KB
vb_src
blueleech.ico 6KB
GPL.txt 18KB
BlueLeech.frm 2KB
BlueLeech_Loader.vbw 53B
BlueLeech_Loader.vbp 934B
BlueLeech.exe 28KB
BlueLeech.frx 6KB
BlueLeech.kpx 7KB
BlueLeech.exe 28KB
leeched
javadoc
package-list 44B
deprecated-list.html 5KB
blueleech
package-summary.html 5KB
BlueLeech.html 9KB
package-frame.html 857B
package-tree.html 5KB
engine
BLEngine.html 24KB
package-summary.html 8KB
BLSessionHandler.html 16KB
package-frame.html 2KB
BLIntermediateThreadLogger.html 15KB
package-tree.html 7KB
BLSettingsHandler.html 24KB
BLSite.html 14KB
BLSession.html 16KB
BLLogger.html 14KB
BLSiteHandler.html 28KB
gui
package-summary.html 6KB
package-frame.html 1KB
package-tree.html 6KB
BLIntermediateThreadGUI.html 15KB
BLProgressFrame.html 29KB
BLMainFrame.html 22KB
help-doc.html 8KB
allclasses-frame.html 2KB
overview-summary.html 5KB
index.html 962B
allclasses-noframe.html 2KB
resources
inherit.gif 57B
serialized-form.html 20KB
overview-frame.html 1KB
stylesheet.css 1KB
constant-values.html 8KB
index-all.html 47KB
overview-tree.html 7KB
bl_images
blueleech.ico 6KB
blueleech.png 29KB
blueleech_model.ac 74KB
共 90 条
- 1
资源评论
BryanDing
- 粉丝: 299
- 资源: 5583
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功