Defs, Locators and Tasks
Scoopi uses YML definition files to extract data from HTML pages. To
learn the YML elements used by the definition files, Scoopi distribution
comes with a set of examples which are under def/examples
folder.
Scoopi Definition Files
Scoopi creates the data model based on YML definition files. We can
specify the definition file using scoopi.defs.dir configuration
property, which is normally set in scoopi.properties
file located in
conf
folder. By default, it is set to
defs/examples/fin/jsoup/quickstart
which loads the quickstart example.
As we progress through the examples, you need to edit
conf/scoopi.properties
file and set property scoopi.defs.dir to the
specific example to run it.
Def file
The def file defines the definition required to run Scoopi. In examples,
we have named the definition file as job.yml
but it can be named
anything as long as file extension is yml. In otherwords, any file
from the defs directory with file extension yml is loaded by scooopi
as definition file.
The top level elements in the job.yml
are
- locatorGroups
- taskGroups
- dataDefs
In this chapter, we go through Quickstart job.xml and explain locatorGroups and taskGroups elements. Refer Scoopi Installation to know how to run Scoopi and examples.
LocatorGroups
LocatorGroups defines list of locators. The locator specifies the name and URL of the HTML page to fetch from the web or local file system.
In the example job.yml
, the locatorGroups is defined as
defs/examples/fin/jsoup/quickstart/job.yml
locatorGroups:
snapshotGroup:
locators: [
{ name: acme, url: "/defs/examples/fin/page/acme-snapshot.html" }
]
It defines a locatorGroup named snapshotGroup which in turn defines one
locator. The locator name is acme and its url points to local HTML
file acme-snapshot.html
which is in defs/examples/fin/page
folder.
Here is one more example with two groups
locatorGroups:
groupA:
locators: [
{ name: acme, url: "/defs/examples/fin/page/acme-snapshot.html" },
{ name: exPage, url: "http://example.org" }
]
groupB:
locators: [
{ name: acme, url: "/defs/examples/fin/page/acme-bs.html" }
]
It defines two locatorGroups named groupA and groupB. The first group defines two locators and the second group defined one locator. To scrape pages from website, specify the actual address of the page such as http://example.org.
Please note that in the above examples we have used JSON array construct using [ ] and {} as we can define one locator per line. But, you are free to use slightly lengthier YML array construct as show below
locatorGroups:
groupA:
locators:
- name: acme
url: "/defs/examples/page/acme-snapshot.html"
- name: exPage
url: "http://example.org"
TaskGroups
Once locator is loaded Scoopi has to run some task on it and taskGroups property is used to define task to execute for the page loaded by the locator.
The snippet from example job.yml
with locatorGroups and taskGroups is
defs/examples/fin/jsoup/quickstart/job.yml
taskGroups:
snapshotGroup:
priceTask:
dataDef: price
The taskGroups defines a task group named snapshotGroup. The task group has a task named priceTask with a property named dataDef and its value is price.
Scoopi executes this task to all locators defined for snapshotGroup in locatorGroups.
At this point, Scoopi knows
- which pages to download or load
- which tasks to execute for which page
- which dataDef to use for a task
In the next chapter, we describe dataDefs which is used to parse the data from the page.