Scoopi Installation and Quick Start
The easiest way to get start with Scoopi is to pull the image from DockerHub and run it straight away. Scoopi docker image comes pre-configured with JRE 8. In case, you are not using Docker then download the release from GitHub. We explain both the options here.
Install Scoopi from Docker Image
Scoopi releases are available as docker image from DockerHub. To run the image you need Docker installed in the system. The total download size of Scoopi docker image is about 120MB. The following command downloads Scoopi image, creates and run container named scoopi.
docker run --name scoopi codetab/scoopi
It executes quick-start example which outputs single record to an output file. However, we will not be able to view the output file nor modify the conf files as they are within the container. We need to externalize these folders with following commands.
mkdir scoopi
cd scoopi
docker cp scoopi:/scoopi/conf .
docker cp scoopi:/scoopi/output .
docker cp scoopi:/scoopi/docker .
docker cp scoopi:/scoopi/defs .
docker cp scoopi:/scoopi/logs .
Here, we make a folder named scoopi and then copy conf, output, docker, defs and logs folders from the container to the scoopi folder. Now, we can modify conf, def files and also, view the output file without login into the container. Next, remove the container as we are going to recreate it with a new set of parameters.
docker rm scoopi
Let’s run example 10 to output more data. To do that, edit
conf/scoopi.properties
file and change defs directory property as
scoopi.defs.dir=/defs/examples/fin/jsoup/ex-10 and run scoopi with
following docker command.
docker run --name scoopi --rm -p 9010:9010 -v "$PWD"/defs:/scoopi/defs -v "$PWD"/conf:/scoopi/conf -v "$PWD"/output:/scoopi/output codetab/scoopi
Above command mounts externalized folders using -v option. When
container run, it uses definitions from jsoup/ex-10. On completion, we
should have a new data.txt
file in the output
folder with 281 lines
of data.
Scoopi comes with a nice Angular dashboard which displays internal
metrics of the app and it can be accessed via http://localhost:9010
while Scoopi is running.
Install Scoopi from GitHub
In case your are not using docker, then install Scoopi either by downloading the release package which contains all dependencies or by building the source code with Maven. To run Scoopi, you need to install JRE 8 or above.
Download and install the Release package
Download the latest release zip file scoopi-x.x.x-production.zip
from
GitHub Scoopi Releases and extract the zip file to some location.
Download and build the Source
Alternatively, you can download the Scoopi source code zip from GitHub. To build it, extract it somewhere and from the project root folder run
mvn package -DskipTests
Maven compiles the source, downloads the dependencies and package the
app as scoopi-x.x.x-production.zip in target folder. Extract
target/scoopi-x.x.x-production.zip
to some location.
Quick start
Go to the extracted folder of scoopi-x.x.x-production.zip. The directory structure is as below.
scoopi-x.x.x/
├── conf
│ ├── scoopi.properties
│ └── log4j2.xml
├── defs
│ └── examples
│ └── fin
│ └── book
├── scoopi.bat
├── scoopi.sh
└── lib
└── scoopi-x.x.x.jar
└── ....
Application jar file scoopi-x.x.x.jar is in lib folder along with other
dependencies. The conf folder holds the configuration files and the main
configuration file is conf/scoopi.properties
. By default, following
two properties are defined.
scoopi.defs.dir=/defs/examples/fin/jsoup/quickstart
scoopi.datastore.enable=false
The property scoopi.defs.dir points to jsoup/quickstart which is loaded when we run Scoopi. The other property scoopi.datastore.enable is set to false which runs Scoopi without persistence. In a later chapter, we show how to configure datastore and use it to persist Scoopi objects. Till then, set it to false.
Let’s run Scoopi and check the installation.
cd scoopi-x.x.x
scoopi.sh // scoopi.bat for windows
It starts ScoopiEngine and loads files defined in
defs/examples/fin/jsoup/quickstart
folder and outputs data to
output/data.txt
file.
Examples
Now a word about the examples. Scoopi comes with a set of example
definition files which are located at defs/examples
directory. The
examples cover all aspects of definition files in a step-by-step
approach.
Examples directory contains three folders - fin, book, quote
The defs/examples/fin
folder contains examples which scrapes financial
data such as Balance Sheet, Profit and Loss Account and Share price
etc., of a company from HTML pages located in fin/page
folder.
The defs/examples/book
folder contains examples which scrapes book
details from a Bookstore.
The defs/examples/quote
folder contains examples which scrapes quotes
by famous personalities from a website which use JavaScript and Ajax to
load the pages.
Examples come in two flavors – JSoup which uses Selectors to query data and HtmlUnit which uses XPath as query. This guide focus on JSoup examples, as JSoup is easy to use and light on memory. HtmlUnit examples are same as JSoup ones but uses XPath for queries.
As we progress through the guide, we cover examples one by one. To load
and run a particular example, modify the scoopi.defs.dir property in
conf/scoopi.properties
and point to required example folder.
While running example you can disable persistence by setting
scoopi.datastore.enable=false in conf/scoopi.properties
file.
In the next chapter, we start with the QuickStart example.