<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>Arale User Manual</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
<style>
body {
background-color : #FFFFFF;
font-family: Verdana, Geneva, Arial, Helvetica, sans-serif;
font-size: x-small;
color: #000000;
}
td, p, li, a {
font-family: Verdana, Geneva, Arial, Helvetica, sans-serif;
font-size: x-small;
}
code, pre {
font-family: monospaced;
font-size: x-small;
}
</style>
</head>
<body>
<p style="font-size:small"><b>Arale User Manual</b>
<p>
author: Flavio Tordini<br>
email: <a href="mailto:flaviotordini@tiscali.it">flaviotordini@tiscali.it</a><br>
web: <a href="http://web.tiscali.it/_flat">http://web.tiscali.it/_flat</a>
<p>
<a href="#intro">Introduction</a><br>
<a href="#get">Getting Arale</a><br>
<a href="#sys">System Requirements</a><br>
<a href="#install">Installing Arale</a><br>
<a href="#run">Running Arale</a><br>
<a href="#settings">Arale settings</a><br>
<a href="#build">Building Arale</a><br>
<p><a name="intro"></a><b>Introduction</b>
<p>
Arale is a java multithreaded web spider. While many bots around are focused on page indexing, Arale is primarly designed for personal use. It fits the needs of advanced web surfers and web developers.
<p>
With Arale you can download entire web sites or specific resources from the web. Some real life cases are:<br>
<li>want to download only images, videos, mp3 or zip files from a site.</li>
<li>manuals, articles, ebooks fragmented in many files to discourage download.</li>
<li>user-unfriendly sites. Popups, banners and tricky javascripts annoying you before you can download a resource.</li>
<p>
<i>Multithreaded</i> means that Arale can download more than one file simultaneously. Arale can easily saturate your bandwith, thus providing the fastest possible download speed for your internet connection.
<p>
If you're developing dynamic sites using technologies such as JSP, PHP, ASP or whatever, you may be interested in rendering dynamic pages to static files.
Arale supports URL renaming: query string is encoded in the static filename and .html extension is appended.
let's make an example:
<p>
original URL: <code>mypage.jsp?myparam=myvalue</code><br>
static filename: <code>mypage.jsp!myparam=myvalue.html</code><br>
<p>
Existing links to renamed URLs are substituted with modified links. This preserves navigation among static files.
Once a dynamic site is trasformed into a set of static files it can be deployed on a server that does not support dynamic pages. For example you may deploy a JSP site in a free web space.
<p>
Currently Arale is a command-line tool. It would be nice to develop a GUI for it. I'd like to have some feedback from users, so if you think it's worth send me an email and tell me what you think. ;)
<p><a name="get"></a><b>Getting Arale</b>
<p>
The latest version of Arale can be downloaded from <a href="http://web.tiscali.it/_flat">http://web.tiscali.it/_flat</a>.
The distribution includes Arale sources along with building scripts (see <a href="#build">Building Arale</a>).
<p><a name="sys"></a><b>System Requirements</b>
<p>
In order to run Arale, you need the Java Development Kit (JDK) or the Java Runtime Environment (JRE) installed on your system.
Arale requires Java 2. The recommended Java version for running Arale is Java 2 version 1.3 or later.
<li><a href="http://java.sun.com/j2se/">Java Development Kit</a></li>
<li><a href="http://java.sun.com/j2se/">Java Runtime Environment</a></li>
<p><a name="install"></a><b>Installing Arale</b>
<p>
Simply extract the Arale distribution archive to a directory. Make sure you have the JAVA_HOME environment variable pointing to Java Development Kit installation directory.
As an option you may set an ARALE_OPTS environment variable. The value of ARALE_OPTS contains command line arguments that should be passed to the Java Virtual Machine when starting Arale. For example, you can define properties or set the maximum Java heap size.
The following sets up the environment on Windows:
<pre>
set JAVA_HOME=c:\jdk1.3.1
set ARALE_HOME=c:\arale
set ARALE_OPTS=-mx32m
</pre>
To complete Arale installation run <code>windows/setup.bat</code> in Arale installation directory. this will create shortcuts to Arale and will integrate Arale with Internet Explorer. Cool!
<p>
on Unix (bash):
<pre>
export JAVA_HOME=/usr/local/jdk-1.3.1
export ARALE_HOME=/usr/local/arale
export ARALE_OPTS=-mx32m
</pre>
<p><a name="run"></a><b>Running Arale</b>
<p>
Running Arale is simple, when you installed it as described in the previous section. Just type <code>arale</code> followed by an URL.
<pre>arale http://web.tiscali.it/_flat</pre>
By default Arale reads its settings from the <code>arale.properties</code> file. You can override this behaviour by typing:
<pre>arale http://web.tiscali.it/_flat -settings mysettings.properties</pre>
Command-line option summary:
<pre>
Usage: arale [<URL>] [<options>]
-settings <file>: Use specified property file
-output <dir>: Use specified output directory
-version: Print Arale version and exit
-help: Print this message and exit
</pre>
<p><a name="settings"></a><b>Arale settings</b>
<ul>
<li><b>URL</b>: start URL</li>
<li><b>output.directory</b>: this is Arale output directory. It may be a relative or an absolute path. Arale will put all downloaded files in subdirectories by recreating the directory structure found on the remote server.</li>
<li><b>download.tokens</b>: Arale will download URLs that contain these tokens. Tokens are separated by spaces. Just like this: <code>.html .gif .jpg .css</code>. </li> What <i>token</i> means? A token is a series of characters Arale will search for when scanning files. When Arale finds a token specified by this parameter, it then searches for right limit and a left limit of the ipothetic link. Then Arale tries to connect to that URL. If the resource is found then it is immediatly downloaded to disk, otherwise Arale just keeps going.</li>
<li><b>scan.tokens</b>: Arale will scan URLs that contain these tokens. Tokens are separated by spaces. URLs containing these tokens should all have a text/html content type. Resources found with these tokens will be scanned for new links. They will not be downloaded if they are not in the download.tokens list.</li>
<li><b>force.html.scanning</b>: Force scanning of resources having a text/html content type. Even if they're not listed in scan.tokens.</li>
<li><b>ensure.html.scanning</b>: Ensure that only resources having a text/html content type will be scanned. For example a dynamic resource (.jsp, .asp ...) may return any content type, not only text/html.</li>
<li><b>domain.depth</b>: This parameter represents how many domain levels deep should arale follow links. 1 means no domain change. Increasing this value will dramatically increase the number of followed links. For example 2 means Arale will crawl the starting domain plus all domains linked in the starting domain pages.</li>
<li><b>file.minsize</b>: Minimum downloaded file size. All files smaller than this value will be discarded. -1 means Arale will ignore this setting.</li>
<li><b>file.minsize</b>: Maximum downloaded file size. All files bigger than this value will be discarded. -1 means Arale will ignore this setting.</li>
<li><b>file.download.unknown.size</b>: Tells arale wheter to download files whose size cannot be predetermined. Sometimes the web server will not tell the file size, in that Arale will use this setting to decide what to do. this value may be true or false.</li>
<li><b>thread.count</b>: This is the number of threads Arale will allocate. In practice this is the number of simultaneous HTTP connections. Choosing a higher value may increase may increase processing speed, but may also st