Scoopi Installation and Quick Start

The easiest way to get start with Scoopi is to pull the image from DockerHub and run it straight away. Scoopi docker image comes pre-configured with JRE 8 and Maria DB. In case, you are not using Docker then download the release from GitHub. We explain both the options here.

Install Scoopi from Docker Image

Scoopi releases are available as docker image from DockerHub. To run the image you need Docker installed in the system and additionally, to run it with MariaDB database, you also need Docker Compose. The total download size of Scoopi docker image is about 120MB and Mariadb 130MB.

The following command downloads Scoopi image, creates and run container named scoopi.

docker run --name scoopi codetab/scoopi

It executes quick-start example which outputs single record to an output file. However, we will not be able to view the output file nor modify the conf files as they are within the container. We need to externalize these folders with following commands.

mkdir scoopi
cd scoopi
docker cp scoopi:/scoopi/conf .
docker cp scoopi:/scoopi/output .
docker cp scoopi:/scoopi/docker .
docker cp scoopi:/scoopi/defs .
docker cp scoopi:/scoopi/logs .

Here, we make a folder named scoopi and then copy conf, output, docker, defs and logs folders from the container to the scoopi folder. Now, we can modify conf, def files and also, view the output file without login into the container. Next, remove the container as we are going to recreate it with a new set of parameters.

docker rm scoopi

Let’s run example 10 to output more data. To do that, edit conf/scoopi.properties file and change defs directory property as scoopi.defs.dir=/defs/examples/fin/jsoup/ex-10 and run scoopi with following docker command.

docker run --name scoopi --rm -p 9010:9010 -v "$PWD"/defs:/scoopi/defs -v "$PWD"/conf:/scoopi/conf -v "$PWD"/output:/scoopi/output codetab/scoopi

Above command mounts externalized folders using -v option. When container run, it uses definitions from jsoup/ex-10. On completion, we should have a new data.txt file in the output folder with 281 lines of data.

Scoopi comes with a nice Angular dashboard which displays internal metrics of the app and it can be accessed via http://localhost:9010 while Scoopi is running.

Scoopi with MariaDB

To use MariaDB as datastore, we need Docker Compose which runs Scoopi and MariaDB in two separate containers. To do that, first move the docker/docker-compose.yml to scoopi folder

cd scoopi
mv docker/docker-compose.yml .

Next, edit conf/scoopi.properties and modify property scoopi.useDatastore=false as scoopi.useDatastore=true. Once configuration is ready, start database.

docker-compose up scoopi-db

Docker downloads the latest MariaDB image and run it as container. On first run, it creates and initializes the database and adds users and privileges. It creates new folder named data which contains the MariaDB data files.

If MySQL client are installed in your system then you can log into database with

mysql -pbar -u foo -h 127.0.0.1 -P 3306 scoopi

Now kill the database container with Ctrl+C. With that, one time setup is complete and from now on, we start using Scoopi with MariaDB with the following command.

docker-compose up --abort-on-container-exit

The above command brings up MariaDB and then Scoopi in separate containers. Scoopi stores locators, documents and parsed data in database.

Install Scoopi from GitHub

In case your are not using docker, then install Scoopi either by downloading the release package which contains all dependencies or by building the source code with Maven. To run Scoopi, you need to install JRE 8 or above. For persistence, if enabled, you also need to install and setup database such as MariaDB or HSQLDB and configure it. We explain the HSQLDB installation in a later chapter on persistence.

Download and install the Release package

Download the latest release zip file scoopi-x.x.x-production.zip from GitHub Scoopi Releases and extract the zip file to some location.

Download and build the Source

Alternatively, you can download the Scoopi source code zip from GitHub. To build it, extract it somewhere and from the project root folder run

mvn package -DskipTests

Maven compiles the source, downloads the dependencies and package the app as scoopi-x.x.x-production.zip in target folder. Extract target/scoopi-x.x.x-production.zip to some location.

Quick start

Go to the extracted folder of scoopi-x.x.x-production.zip. The directory structure is as below.

scoopi-x.x.x/
├── conf
│   ├── scoopi.properties
│   ├── jdoconfig.properties
│   ├── log4j.properties
│   └── logback.xml
├── defs
│   └── examples
│       └── jsoup
│       └── htmlunit
│       └── page
├── scoopi.bat
├── scoopi.sh
└── lib
    └── scoopi-x.x.x.jar
    └── ....

Application jar file scoopi-x.x.x.jar is in lib folder along with other dependencies. The conf folder holds the configuration files and the main configuration file is conf/scoopi.properties. By default, following two properties are defined.

scoopi.defs.dir=/defs/examples/fin/jsoup/quickstart
scoopi.useDatastore=false

The property scoopi.defs.dir points to jsoup/quickstart which is loaded when we run Scoopi. The other property scoopi.useDatastore is set to false which allows us to run Scoopi without setting up database. In a later chapter, we show how to setup database and use it to persist Scoopi objects. Till then, set it to false.

Let’s run Scoopi and check the installation.

cd scoopi-x.x.x
scoopi.sh               // scoopi.bat for windows

It starts ScoopiEngine and loads files defined in defs/examples/fin/jsoup/quickstart folder and outputs data to output/data.txt file.

Examples

Now a word about the examples. Scoopi comes with a set of example definition files which are located at defs/examples directory. The examples cover all aspects of definition files in a step-by-step approach.

Examples directory contains three folders - fin, book, quote

The defs/examples/fin folder contains examples which scrapes financial data such as Balance Sheet, Profit and Loss Account and Share price etc., of a company from HTML pages located in fin/page folder.

The defs/examples/book folder contains examples which scrapes book details from a Bookstore.

The defs/examples/quote folder contains examples which scrapes quotes by famous personalities from a website which use JavaScript and Ajax to load the pages.

Examples come in two flavors – JSoup which uses Selectors to query data and HtmlUnit which uses XPath as query. This guide focus on JSoup examples, as JSoup is easy to use and light on memory. HtmlUnit examples are same as JSoup ones but uses XPath for queries.

As we progress through the guide, we cover examples one by one. To load and run a particular example, modify the scoopi.defs.dir property in conf/scoopi.properties and point to required example folder.

While running example you can disable persistence by setting scoopi.useDatastore=false in conf/scoopi.properties file. However, there is no harm in running examples with useDatastore as true but ensure that database is up and running else it Scoopi throws database not found error.

In the next chapter, we start with the QuickStart example.