CodeTab Gotz ETL
Gotz ETL is a tool to extract data from HTML pages. In Java, it’s easy to scrape web pages with libraries such as JSoup and HtmlUnit, but the task become daunting when we try to scrape data from huge set of pages.
Some of the challenges while extracting large set of data from unstructured sources such as HTML pages are:
Single web page may hold multiple types of data, queries should be able to dynamically handle different types of data
Net connection may go down in middle of a run and scraper should be able to recover from failed state
Some of the sources page may change frequently and others less frequently and scraper should avoid parsing the pages that are not changed otherwise it may take very long time to complete the run
Data in unstructured source such as HTML pages may not be in desired format or some data may be unwanted and scraper should be able to filter and transform the values
Scraping libraries such as JSoup and HtmlUnit do well in scraping data but they are not meant to handle the situations listed above.
Gotz is developed taking these aspects into consideration. It is built upon JSoup and HtmlUnit. Functionalities offered by Gotz over and above the scrapping libraries are:
Gotz is completely model driven like a real ETL tool. Data structure, task workflow and pages to scrape are defined with a set of XML definition files and no coding is required
It can be configured to use either JSoup or HtmlUnit as scraper
Queires can be written either using Selectors with JSoup or XPath with HtmlUnit
Gotz persists pages and data to database so that it recover from the failed state without repeating the tasks already completed
For Transparent persistence, Gotz uses JDO Standard and DataNucleus AccessPlatform and you can choose your Datastore from a very wide range!
Gotz is a multi-thread application which process pages in parallel for maximum throughput. Threads allotted to each task pool is configurable based on workload
Allows to transform, filter and sort the data
Comes with built-in appenders such as FileAppender, DBAppender and ListAppender.
GotzEngine can be embedded in other programs and access scrapped data with ListAppender
Flexible workflow allows one to change sequence of steps
Gotz is extensible. Developers can extends the predefined base steps or even create new ones with different functionality and weave them in workflow
Gotz ETL Reference

Maithilish
maithilish@gmail.com