Create Locators from Links
The definitions would become lengthy when we define each and every link in job.xml. Instead, Scoopi can scrape links from any page and dynamically create locators. This feature allows you to recursively scrape the web pages. Let’s see how to create locators from scraped links.
Link Scrape Step
The Example 9 scrapes Balance Sheet and Profit & Loss links from acme-snapshot.html page. Links snippet in the html page is as below.
defs/examples/fin/page/acme-snapshot-links.html
<!-- links to other pages -->
<div id="page_links">
<li><strong>Financial</strong></li>
<li><a href="acme-bs.html">Balance Sheet</a></li>
<li><a href="acme-pl.html">Profit & Loss</a></li>
</div>
In job.xml, instead of locators for bs and pl, we just define locator for acme-snapshot-links.html and task named linkTask to scrape and convert links.
locatorGroups:
snapshotGroup:
locators:
{ name: acme, url: "/defs/examples/fin/page/acme-snapshot-links.html" }
]
taskGroups:
snapshotGroup:
priceTask:
dataDef: price
linkTask:
dataDef: links
steps:
jsoupDefault:
process:
class: "org.codetab.scoopi.step.process.LocatorCreator"
previous: parser
next: seeder
In jsoupDefault steps the parser step handover data to filter filter which in turn handover the filtered data to appender. The work flow is
seeder -> loader -> parser -> filter -> appender
But, the task linkTask override inserts a new process step with step
class org.codetab.scoopi.step.convert.LocatorCreator
which creates a
new locator from the parsed link and hands over it to seeder step and
the work flow
becomes
seeder -> loader -> parser -> process (locator creator) -> seeder -> loader -> parser -> filter -> appender
DataDef for Links
The linkTask used dataDef named links where we defines link to scrape which is as follows
links:
query:
block: "#page_links > table > tbody > tr > td:nth-child(4) > ul"
items: [
item: { name: "link", linkGroup: bsGroup, index: 2,
selector: "li:nth-child(%{index}) > a attribute: href",
prefix: [ "/defs/examples/fin/page/" ] },
]
The axis item defines data item to hold scraped links. The linkGroup is name of the task group that has to set to the newly created locator. Let’s clarify this aspect in detail. The task group of the linkTask is snapshotGroup. So the parsed link initially belongs to task group snapshotGroup. But any dataDef defined in snapshotGroup are not able to parse the acme-bs.html page. Only tasks in bsGroup are able to parse the acme-bs.html as they use bs dataDef. So we need to assign newly created bs locator to bsGroup which is specified using linkGroup property of member. The groups changes in workflow is show below.
|----- snapshotGroup -----| bsGroup
seeder -> loader -> parser -> process (locator creator) -> seeder -> loader -> parser -> process (filter) -> converter
Prefix
The web pages uses absolute or relative links. The example acme-snapshot-links.html uses relative links as shown below.
defs/examples/fin/page/acme-snapshot-links.html
<!-- links to other pages -->
<div id="page_links">
<li><strong>Financial</strong></li>
<li><a href="acme-bs.html">Balance Sheet</a></li>
<li><a href="acme-pl.html">Profit & Loss</a></li>
</div>
We have to prefix path /defs/examples/fin/page/
to scraped link value
acme-bs.html
otherwise loader is not able to load the page. Use prefix
property in item to add any prefix to the item value.
The Example 10 combines all the definitions we have used so far (examples 1 to 9) - links, price, snapshot, bs and pl - into single job which outputs all the data to data.txt file.
The next chapter shows how to flip through pages with pagination and scrape data.