Multiple Dimensions
In this chapter, we use an example of a Book Store listing page to explain multiple dimensions.
The defs/examples/book/page/page-1.html page lists details of 20 books similar to an e-commerce site. The Example 1 extracts one book and its related data.
The snippet of HTML of a book item is
<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img src="media/cache/cc35693c.jpg" alt="A Light in the Attic" class="thumbnail"></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
In stock
</p>
</div>
</article>
....
The datadef used to extract data from this page is
defs/examples/book/jsoup/ex-1/job.yml
dataDefs:
bookData:
query:
block: "section ol > li:nth-child(%{item.index}) > article"
selector: "h3 > a attribute: title"
items:
item: { name: "book", index: 1, value: "book"},
]
dims: [
item: { name: "url", selector: "h3 > a attribute: href" },
item: { name: "price", selector: "p[class='price_color']" },
]
It defines one item (book) with two dims - url and price. The data item with its axis is as shown below
Axis Name | Item Name | Axis Selector | value |
---|---|---|---|
dim | price | p[class=‘price_color’] | £51.77 |
dim | url | h3 > a attribute: href | catalogue/a-light-in-the-attic_1000/index.html |
item | book | book | |
fact | fact | h3 > a attribute: title | A Light in the Attic |
The index=1 in items/item selects one book and instead use indexRange to select multiple books. The Example 2 extracts 20 books and related information.
defs/examples/book/jsoup/ex-2/job.yml
dataDefs:
bookData:
query:
block: "section ol > li:nth-child(%{item.index}) > article"
selector: "h3 > a attribute: title"
items:
item: { name: "book", indexRange: 1-20, value: "book"},
]
dims: [
item: { name: "url", selector: "h3 > a attribute: href" },
item: { name: "price", selector: "p[class='price_color']" },
]
The examples in earlier chapters used variables in selectors and never in block and because of that, one block was selected for the page and selectors are fired against that block. However, in the bookstore example we use %{item.index} variable in block. As item’s defines indexRange 1-20 and as index increments the query/block selects new sub tree of nodes (nothing but a book item) and caches it and then, other selectors are fired against it.
The [Example 3 adds some more dimensions such as image url, availability and date etc.,
defs/examples/book/jsoup/ex-3/job.yml
dims: [
item: { name: "url", selector: "h3 > a attribute: href" },
item: { name: "imgUrl", selector: "img attribute: src" },
item: { name: "price", selector: "p[class='price_color']" },
item: { name: "avail", selector: "p[class='instock availability']" },
item: { name: "date", script: "document.getFromDate()" },
]
Apart from multiple dimensions, the important takeaway from bookstore example is that using variables in query/block we can select multiple blocks in a page and select and associate data within each block as separate items.