Steps
So far, we explored locators, tasks and dataDef to scrape data from the pages. But, we haven’t explained how Scoopi executes tasks and scrape data.
Scoopi is designed to execute tasks as workflow which is normally
referred as steps which in turn consists of multiple step. Scoopi
ships with two in-built defaults steps jsoupDefault and
htmlUnitDefault and they are defined in steps-default.yml
which is
packaged inside Scoopi distribution jar. We can access the contents of
steps-default.yml from the
source.
Steps
Let’s go through the jsoupDefault to understand the workflow design.
steps-default.yml
steps:
jsoupDefault:
seeder:
class: "org.codetab.scoopi.step.extract.LocatorSeeder"
previous: start
next: loader
loader:
class: "org.codetab.scoopi.step.extract.PageLoader"
previous: seeder
next: parser
parser:
class: "org.codetab.scoopi.step.parse.jsoup.Parser"
previous: loader
next: filter
filter:
class: "org.codetab.scoopi.step.process.DataFilter"
previous: parser
next: appender
appender:
class: "org.codetab.scoopi.step.load.DataAppender"
previous: filter
next: end
plugins: [
plugin: {
name: dataFile,
class: "org.codetab.scoopi.plugin.appender.FileAppender",
file: "output/data.txt",
plugins: [
plugin: {
name: csv,
delimiter: "|",
class: "org.codetab.scoopi.plugin.encoder.CsvEncoder"
}
]
}
]
The sequence of step is
- seeder
- loader
- parser
- filter
- appender
Each step specifies three properties
- class that has to be executed for the step
- name of the previous
- name of the next step
For the the first step the previous is start and the last step next is set to end.
The list of built-in step classes are
Step type | Description | Class |
---|---|---|
seeder | create and seed locators | org.codetab.scoopi.step.extract.LocatorSeeder |
loader | load HTML page | org.codetab.scoopi.step.extract.URLLoader |
parser | parse using JSoup | org.codetab.scoopi.step.parse.jsoup.Parser |
parser | parse with HtmlUnit | org.codetab.scoopi.step.parse.htmlunit.Parser |
filter | filter parsed data | org.codetab.scoopi.step.process.DataFilter |
appender | encode and append data as output | org.codetab.scoopi.step.load.DataAppender |
The class name should be fully qualified including package otherwise class not found error is thrown.
Plugins
The last step appender is bit interesting. It uses FileAppender plugin to append data to output file which in turn uses another plugin CsvEncoder plugin to encode data into string delimited with | character before sending the output to file.
appender:
class: "org.codetab.scoopi.step.load.DataAppender"
previous: filter
next: end
plugins: [
plugin: {
name: dataFile,
class: "org.codetab.scoopi.plugin.appender.FileAppender",
file: "output/data.txt",
plugins: [
plugin: {
name: csv,
delimiter: "|",
class: "org.codetab.scoopi.plugin.encoder.CsvEncoder"
}
]
}
]
Plugins framework allows Scoopi to get configuration from definition file and execute any plugin class without modifying the source code and Scoopi ships with following plugins.
Plugin type | Description | Plugin class |
---|---|---|
encoder | encodes data as csv | org.codetab.scoopi.plugin.encoder.CsvEncoder |
appender | appends data to file | org.codetab.scoopi.plugin.appender.FileAppender |
appender | appends data to ListArray | org.codetab.scoopi.plugin.appender.ListAppender |
converter | change date format | org.codetab.scoopi.plugin.converter.DateFormater |
converter | roll date and change format | org.codetab.scoopi.plugin.converter.DateRoller |
script | run JavaScript to modify data | org.codetab.scoopi.plugin.script.DataScript |
In subsequent chapters we explain how to override default steps or add new one and plugins. The next chapter uses converter plugin to format dates.