Scoopi Web Scraper
Scoopi web scraper extracts and transform data from HTML pages. JSoup and HtmlUnit makes it quite easy to scrape web pages in Java, but the things get complicated when data is from large number of pages. Some of the challenges while extracting large set of data from unstructured sources such as HTML pages are:
- Data being unstructured, may requires many queries to scrape them
- Data may not be in desired format and to make them usable, needs filter and transform
- Connection may drop during a run and all the work is lost
- When data is from thousands of pages, performance does matter
- Need Java or Python proficiency to use scraper libraries
Scraping libraries do well in scraping data from limited set of pages but they are not meant to handle thousands of pages. Scoopi is developed taking these aspects into consideration. It is built upon JSoup and HtmlUnit. Some of the features of Scoopi are
- Scoopi is completely definition driven and no knowledge of coding is required. Data structure, task workflow and pages to scrape are defined with a set of YML definition files. It can be configured to use either JSoup or HtmlUnit as scraper
- Query can be written either using Selectors with JSoup or XPath with HtmlUnit
- Scoopi persists pages and parsed data to file system and recovers from the failed state without repeating the tasks that are completed
- Scoopi is a multi-thread application which process pages in parallel for maximum throughput.
- Allows to transform, filter and sort the data
For complete list of features see Scoopi GitHub page
In this step-by-step guide, we explain the Scoopi definition file in detail through a set of examples. For the sake of clarity, we have split the guide into fourteen and odd pages, however the overall concept is quite simple and should not take more than a day to learn.