Persistence
With persistence, Scoopi offers following benefits.
- reduce network usage by reusing the downloaded pages
- recover from the aborted run by skipping the tasks already completed
- avoid expensive parse operation by reuse of persisted data
- set expiry date for each page
Enable Persistence
By default persistence is disabled in Scoopi through configuration
setting scoopi.datastore.enable=false in conf/scoopi.properties
file. To enable persistence, either set the property scoopi.datastore.enable=true or hash it out. By default, the scoopi objects are persisted in data/
directory and it is configurable through scoopi.datastore.path config property.
Now run Example 10 and the scoopi objects are stored in data directory.
Live
When we first run Example-10 with persistence enabled, the fetched pages, acme-snapshot-links.html, acme-bs.html and acme-pl.html and parsed data are compressed and stored in data folder. If we run Scoopi again they are re-fetched and new doc and data objects are created . For each run, Scoopi fetches fresh pages and parse it to create data even though persistence is enabled. This is because, by default, live setting which controls the expiry of page is set to zero days (P0D). Use live property in tasks group to alter this default behavior.
Suppose, we want to fetch new snapshot page once in a week. The Example 11 sets live property to P1W for snapshotGroup as shown below
taskGroups:
snapshotGroup:
priceTask:
dataDef: price
snapshotTask:
dataDef: snapshot
live: P1W
The P0D, P1W are ISO_8601 based representation of duration.
During initial run, Scoopi fetches fresh pages and persists. In subsequent runs, it reuses the persisted page and data till pages are expired. This speeds up the run by many fold.
Persist Control
We can further control persistence of locator or data using persist property.
In conf/scoopi.properties
we can set following properties
scoopi.datastore.enable=true|false
scoopi.persist.locator=true|false
scoopi.persist.data=true|false
If scoopi.datastore.enable is true, then
- if scoopi.persist.locator is false then no locator and its documents are stored
- if scoopi.persist.data is false then no data is stored
If scoopi.persist.data is true, then in addition, we can control whether to persist data of each task using persist/data property in task definition. The Example 12
taskGroups:
snapshotGroup:
priceTask:
dataDef: price
persist:
data: true
snapshotTask:
dataDef: snapshot
persist:
data: false
live: P1W
The data parsed by priceTask are stored but not the snapshotTask as persist/data is false for that task.
The next chapter describes splitting the lengthy definitions into multiple files for easy maintenance.