Design Decisions
HTML Parsing with jsoup
To check HTML, we parse it into an internal (DOM-like) representation. For this task we use jsoup HTML parser, an open-source parser without external dependencies.
To quote from the jsoup website:
jsoup
is a Java library for working with real-world HTML.
It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods.
- Goals of this decision
-
Check HTML programmatically by using an existing API that provides access and finder methods to the DOM tree of the file(s) to be checked.
- Decision Criteria
-
-
Few dependencies, so the HSC binary stays as small as possible.
-
Accessor and finder methods to find images, links and link-targets within the DOM tree.
-
- Alternatives
-
-
HTTPUnit
: a testing framework for web applications and sites. Its main focus is web testing, and it suffers from a large number of dependencies. -
jsoup
: a plain HTML parser without any dependencies (!) and a rich API to access all HTML elements in DOM-like syntax.
-
Find details on how HSC implements HTML parsing in the HTML encapsulation concept.
String Similarity Checking with Jaro-Winkler-Distance
The small java-string-similarity library (by Ralph Allen Rice) contains implementations of several similarity-calculation algorithms. As it is a public binary, available at Maven Central, we have used it as external library dependencies. Primarily, we have used Jaro-Winkler strategy to find similarity.
Changing Groovy to Plain Java
In 2024/2025 we decided to port the html sanity checker to plain java, because in some environments the use of the groovy-framework caused trouble. In some cases different versions of the groovey enwironment uesed in maven projekts result in an incompatibility that renders the build not working. Using plain java we get rid of all gropvy dependencies and therefore eleminating all version conflicts.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.