HTML Sanity Check (HSC) Architecture Documentation

This material is open source and provided under the Creative Commons Sharealike 4.0 license. It comes without any guarantee. Use on your own risk. arc42 and its structure by Dr. Peter Hruschka and Dr. Gernot Starke. AsciiDoc version initiated by Markus Schärtel and Jürgen Krey, completed and maintained by Ralf Müller and Gernot Starke.

Version 2.0.0-rc3 of 2025-04-18

Note	Within the following text, the "HTML Sanity Check" shall be abbreviated with HSC.

Goals of this Documentation

This documentation is an example of arc42 documentation.

You may copy this documentation or parts of it for your own projects. In such cases you must include a link or reference to arc42 or aim42 (we regard this as fair-use).

For real-world projects, the relation of code and documentation is over-sized.

Disclaimer

We provide absolutely no guarantee, neither for the accuracy of this documentation nor for any property or feature of the software described here.

Do not use this software in critical situations or projects.

Introduction and Goals

HSC shall support authors creating digital formats with hyperlinks and integration of images and similar resources.

Requirements Overview

The overall goal of HSC is to create neat and clear reports, showing errors within HTML files - as shown in the adjoining figure.

Basic Usage

Precondition: A user configures the location (directory and filename) of one or more HTML file(s), including the corresponding image’s directory.
Action: HSC performs various checks on the files, and
Postcondition: Reports its results either on the console or as HTML report.

HSC can run

From the command line (CLI), or
As Gradle-plugin, or
As Maven-plugin.

Terminology: What Can Go Wrong in HTML Files?

Apart from purely syntactical errors, many things can go wrong in html, especially with respect to hyperlinks, anchors and id’s — as those are often manually maintained.

Primary sources of problems are bad links (in technical terms: URIs). For further information, see the background information on URIs.

Broken Cross References: Cross-references (internal links) can be broken, e.g. due to missing or misspelled link-targets. See BrokenCrossReferencesChecker
Missing image files: Referenced image files can be missing or misspelled. See MissingImageFilesChecker.
Missing local resources: Referenced local resources (other than images) can be missing or misspelled. See MissingLocalResourcesChecker
Duplicate link targets: link-targets can occur several times with the same name - so the browser cannot know which is the desired target. See DuplicateIdChecker.
Illegal links: The links (aka anchors or URIs) can contain illegal characters or violate HTML link syntax. See IllegalLinkChecker.
Broken external links: External http links can be broken due to myriads of reasons: misspelled, link-target currently offline, illegal link syntax. See BrokenHttpLinksChecker.
Missing Alt Attribute in Image Tags: Images missing an alt-attribute. See MissingImgAltAttributeChecker.

Checking and reporting these errors and flaws is the central business requirement of HSC.

Important terms (domain terms) of html sanity checking is documented in a (small) domain model.

General Functionality

Table 1. General Requirements
ID	Functionality	Description
G-1	read HTML file	HSC shall read a single (configurable) HTML file
G-2	Gradle-plugin	HSC can be run as Gradle-plugin.
G-3	command line usage	HSC can be called from the command line with arguments and options
G-4	configurable output	output can be configured to console or file
G-5	free and open source	all required dependencies shall be compliant to the CC-SA-4 licence.
G-6	available via public repositories	Maven Central
G-7	configurable to check multiple HTML files	configure a set of files to be processes in a single run and produce a joint report. (useful for e.g. API documentation with many HTML files referencing each other)

Types of Sanity Checks

Table 2. Required Checks
ID	Check	Description
R-1	missing image files	Check all image tags if the referenced image files exist. See MissingImageFilesChecker
R-2	broken internal links	Check all internal links from anchor-tags (href="#XYZ") if the link targets "XYZ" are defined. See BrokenCrossReferencesChecker
R-3	missing local files	either other html-files, pdf’s or similar. See MissingLocalResourcesChecker
R-4	duplicate link targets	Check all bookmark definitions (… id="XYZ") whether the id’s ("XYZ") are unique. See DuplicateIdChecker
R-5	malformed links	Check all links for syntactical correctness
R-6	missing alt-attribute	in image-tags. See MissingImgAltAttributeChecker
R-7	unused-images	Check for files in image-directories that are not referenced by any of the HTML files in this run
R-8	illegal link targets	Check for malformed or illegal anchors (link targets).

Table 3. Optional Checks
ID	Check	Description
Opt-1	missing external images	Check externally referenced images for availability
Opt-2	broken external links	Check external links for both syntax and availability

Reporting and Output Requirements

Table 4. Reporting Requirements
ID	Requirement	Description
R-1	various output formats	Checking output in plain text and HTML
R-2	output to stdout	HtmlSC can output results on stdout (the console)
R-3	configurable file output	HtmlSC can store results in file in configurable directories

Quality Goals

Table 5. Quality-Goals
Priority	Quality-Goal	Scenario
1	Correctness	Every broken internal link (cross reference) is found.
1	Correctness	Every missing local image is found.
2	Flexibility	Multiple checking algorithms, report formats and clients. At least Gradle, command-line and a graphical client have to be supported.
2	Safety	Content of the files to be checked is never altered.
2	Correctness	Correctness of every checker is automatically tested for positive AND negative cases
2	Correctness	Every reporting format is tested: Reports must exactly reflect checking results.
3	Performance	Check of 100kB html file performed under 10 secs (excluding gradle startup)

Stakeholder

Table 6. Stakeholder
Role	Description	Goal, Intention
Documentation author	writes documentation with Html output	wants to check that the resulting document contains good links, image references
arc42 user	uses the arc42 template for architecture documentation	wants a small but practical example of how to apply arc42.
aim42 contributor	contributes to aim42 methode-guide	check generated html code to ensure links and images are correct during (gradle-based) build process
software developer		wants an example of pragmatic architecture documentation and arc42 usage

Background Information on URIs

The generic structure of a Uniform Resource Identifier consists of the following parts: [type][://][subdomain][domain][port][path][file][query][hash]

An example, visualized:

The java.net.URL class contains a generic parser for URLs and URIs. See the following snippet, taken from the unit test class WebTest.groovy:

Generic URI Structure

    @Test
    void testGenericURISyntax() {
        // based upon an example from the Oracle(tm) Java tutorial:
        // http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html
        def aURL =
                new URL("http://example.com:42/docs/tutorial/index.html?name=aim42#INTRO")
        aURL.with {
            assert getProtocol() == "http"
            assert getAuthority() == "example.com:42"
            assert getHost() == "example.com"
            assert getPort() == 42
            assert getPath() == "/docs/tutorial/index.html"
            assert getQuery() == "name=aim42"
            assert getRef() == "INTRO"
        }
    }

URIs are used to reference other resources. For HSC it is useful to distinguish between internal (== local)and external references:

Internal references, a.k.a. Cross-References
External references

Intra-Document URIs

a file… ref can be an internal link, or a URI without protocol…

References on URIs and HTML Syntax

IETF RFC-2396 on URI Syntax: The fundamental reference!
W3C HTML Reference
Wikipedia on URI-Schemes

Constraints

HSC shall be:

platform-independent and should run on the major operating systems (Windows™, Linux, and Mac-OS™)
integrated with the Gradle build tool
runnable from the command line
developed under a liberal open-source license

Context

Business Context

Table 7. Business Context
Neighbor	Description
user	documents software with toolchain that generates html. Wants to ensure that links within this html are valid.
build system
local html files	HSC reads and parses local html files and performs sanity checks within those.
local image files	HSC checks if linked images exist as (local) files.
external web resources	HSC can be configured to optionally check for the existence of external web resources. Due to the nature of web systems, this check might need a significant amount of time and might yield invalid results due to network and latency issues.

Deployment Context

The following diagram shows the participating computers (Node) with their technical connections plus the major Artifact of HSC, the hsc-plugin-binary.

Table 8. Deployment Context
Node / Artifact	Description
Node hsc-development	where development of HSC takes place
Artifact hsc-cli	compiled and packaged version of HSC including required dependencies.
Artifact hsc-maven-plugin	compiled and packaged version of HSC including required dependencies.
Artifact hsc-gradle-plugin	compiled and packaged version of HSC including required dependencies.
Artifact hsc-core	compiled and packaged version of HSC core functionality including required dependencies.
Node artifact repository (Maven Central)	public Java artifact repository, cf. Maven Central Search. HSC binaries are uploaded to this server.
Node hsc user computer	where arbitrary documentation takes place with html as output formats.
Artifact build.gradle	Gradle build script configuring (among other things) the HSC plugin to perform the Html checking.
Artifact build.maven	Maven POM configuring (among other things) the HSC plugin to perform the Html checking.
Artifact commandline	Commandline used to invole the hsc command line interface.

Details see deployment view.

Solution Strategy

Implement HSC in Java with minimal external dependencies.
- Implement a core libary for the functionality with minimal external dependencies and tool independent.
- Wrap this implementation into a Gradle and a Maven plugin, so it can be used within automated builds. Details are given in the Gradle plugin concept and Maven plugin concept.
- Create a command line interface.
Apply the template-method-pattern (see e.g. https://sourcemaking.com/design_patterns/template_method) to enable:
- multiple checking algorithms. See the concept for checking algorithms,
- both HTML (file) and text (console) output. See the reporting concept.

Building Block View

Whitebox HtmlSanityChecker

Rationale

We used functional decomposition to separate responsibilities:

CheckerCore shall encapsulate checking logic and Html parsing/processing.
all kinds of outputs (console, html-file, graphical) shall be handled in a separate component (Reporter)
Implementation of Gradle specific stuff shall be encapsulated.

Contained Blackboxes

Table 9. HtmlSanityChecker building blocks
HSC Core	hsc core: html parsing and sanity checking, configuration, reporting.
HSC Gradle Plugin	integrates the Gradle build tool with HSC, enabling arbitrary gradle builds to use HSC functionality.
HSC Maven Plugin	integrates the Maven build tool with HSC, enabling arbitrary maven builds to use HSC functionality.
HSC Graphical Interface	(planned, not implemented)

Interfaces

Table 10. HtmlSanityChecker internal interfaces
Interface	Description
usage via shell	an (arc42) user uses a command line shell to call HSC
Buildsystem	Currently restricted to Gradle: The build system uses HSC as configured in the buildscript.
Local filesystem	HSC needs access to several local files, especially the html page to be checked and to the corresponding image directories.
External websites	to check external links, HSC needs to access external sites via http HEAD or GET requests.

HSC Core (Blackbox)

Intent/Responsibility: HSC Core contains the core functions to perform the various sanity checks. It parses the html file into a DOM-like in-memory representation, which is then used to perform the actual checks.
Interfaces

Table 11. HSC Core Interfaces
Interface (From-To)	Description
Command Line Interface → Checker	Uses the #AllChecksRunner class.
Gradle Plugin → Checker	Exposes HSC via a standard Gradle plugin, as described in the Gradle user guide.

Files

org.aim42.htmlsanitycheck.AllChecksRunner
org.aim42.htmlsanitycheck.HtmlSanityCheckGradlePlugin

Building Blocks - Level 2

HSC Core (Whitebox)

Figure 1. HSC Core (Whitebox)

Rationale

This structures follows a strictly functional decomposition:

parsing and handling html input
checking
collecting checking results

Contained Blackboxes

Table 12. HSC Core building blocks
Checker	Abstract class, used in form of the template-pattern. Shall be subclassed for all checking algorithms.
AllChecksRunner	Facade to the different Checker instances. Provides a (parameter-driven) command-line interface.
ResultsCollector (Whitebox)	Collects all checking results. Its interface `Results` is contained in the whitebox description
Reporter	Reports checking results to either console or an html file.
HtmlParser	Encapsulates html parsing, provides methods to search within the (parsed) html.
Suggester	In case of checking issues, suggests alternatives by comparing the faulty element to the one present in the html file. Currently not implemented

Checker and xyzChecker Subclasses

The abstract Checker provides a uniform interface (public void check()) to different checking algorithms. It is based upon the extensible concept for checking algorithms.

Building Blocks - Level 3

ResultsCollector (Whitebox)

Figure 2. Results Collector (Whitebox)

Rationale

This structures follows the hierarchy of checks - namely managing results for:

a number of pages/documents, containing:
a single page, each containing many
single checks within a page

Contained Blackboxes

Table 13. ResultsCollector building blocks
Per-Run Results	results for potentially many Html pages/documents.
Single-Page-Results	results for a single page
Single-Check-Results	results for a single type of check (e.g. missing-images check)
Finding	a single finding, (e.g., "image 'logo.png' missing"). Can hold suggestions and (planned for future releases) the responsible html element.

Interface `Results`

The Result interface is used by all clients (especially Reporter subclasses, graphical and command-line clients) to access checking results. It consists of three distinct APIs for overall RunResults, single-page results (PageResults) and single-check results (CheckResults). See the interface definitions below - taken from the Groovy- source code:

Interface RunResults

public interface RunResults {

    // returns results for all pages which have been checked
    List<SinglePageResults> getResultsForAllPages();

    // how many pages were checked in this run?
    int nrOfPagesChecked();

    // how many checks were performed in all?
    int nrOfChecksPerformedOnAllPages();

    // how many findings (errors and issues) were found in all?
    int nrOfFindingsOnAllPages();

    // how long took checking (in milliseconds)?
    Long checkingTookHowManyMillis();
}

Interface PageResults

public interface PageResults {

    // what's the title of this page?
    String getPageTitle();

    // what's the filename and path?
    String getPageFileName();

    String getPageFilePath();

    // how many items have been checked?
    int nrOfItemsCheckedOnPage();

    // how many problems were found on this page?
    int nrOfFindingsOnPage();

    // how many different checks have run on this page?
    int howManyCheckersHaveRun();
}

Interface CheckResults

public interface CheckResults {

    // return a description of what is checked
    // (e.g. "Missing Images Checker" or "Broken Cross-References Checker"
    String description();

    // returns all findings/problems found during this check
    List<Finding> getFindings();
}

Runtime View

Note	Not appropriate for this system due to very simple implementation.

Deployment View

Figure 3. Deployment^[1]

Table 14. Deployment
Node / Artifact	Description
HSC plugin binary	The compiled version of HSC, including required dependencies.
hsc-development	Where development of HSC takes place
Artifact Repository (Maven Central)	Public Java artifact repository, cf. Maven Central Search.
HSC user computer	Where arbitrary documentation takes place with HTML as output formats.
`build.gradle`	Gradle build script configuring (among other things) the HSC plugin to check some documentation.

The three nodes (computers) shown in Deployment^[1] are connected via Internet.

Sanity checker will be:

Bundled
1. As a core jar file (including all checkers and reporters),
2. As a respective Plugin jar file for Maven and Gradle,
3. As a CLI jar file, providing a main method with parameters and options to run all checks from the command line,
Uploaded to the Maven Central repository (Gradle Plugin also to Gradle Plugin Portal),
Referencable within a Gradle build-file.

Technical and Crosscutting Concepts

HTML Checking Domain Model

Figure 4. HTML Checking Domain Model

Table 15. Domain Model
Term	Description
Anchor	Html element to create →Links. Contains link-target in the form `<a href="link-target">`
Cross Reference	Link from one part of the document to another part within the same document. A special form of →Internal Link, with a →Link Target in the same document.
External Link	Link to another page or resource at another domain.
Finding	Description of a problem found by one →Checker within the →Html Page.
Html Element	HTML pages (documents) are made up by HTML elements .e.g., `<a href="link target">`, `<img src="image.png">` and others. See the W3-Consortium
Html Page	A single chunk of HTML, mostly regarded as a single file. Shall comply to standard HTML syntax. Minimal requirement: Our HTML parser can successfully parse this page. Contains →Html Elements. Also called Html Document.
id	Identifier for a specific part of a document, e.g. `<h2 id="#someHeader">`. Often used to describe →Link Targets.
Internal Link	Link to another section of the same page or to another page of the same domain. Also called Local Link.
Link	Any reference in the →Html Page that lets you display or activate another part of this document (→Internal Link) or another document, image or resource (can be either →Internal (local) or →External Link). Every link leads from the Link Source to the Link Target
Link Target	The target of any →Link, e.g. heading or any other a part of a →Html Document, any internal or external resource (identified by URI). Expressed by →id
Local Resource	local file, either other Html files or other types (e.g. pdf, docx)
Run Result	The overall results of checking a number of pages (at least one page).
Single Page Result	A collection of all checks for a single →Html Page.
URI	Universal Resource Identifier. Defined in RFC-2396. The ultimate source of truth concerning link syntax and semantic.

Gradle Plugin Concept and Development

You should definitely read the original Gradle User Guide on custom plugin development.

To enable the required Gradle integration, we implement a lean wrapper as described in the Gradle user guide.

class HtmlSanityCheckPlugin implements Plugin<Project> {

    final static String HTML_SANITY_CHECK = "htmlSanityCheck"

    void apply(Project project) {
        project.tasks.register( HTML_SANITY_CHECK, HtmlSanityCheckTask.class)
    }
}

Directory Structure and Required Files

|-htmlSanityCheck
   |  |-src
   |  |  |-main
   |  |  |  |-org
   |  |  |  |  |-aim42
   |  |  |  |  |  |-htmlsanitycheck
   |  |  |  |  |  |  | ...
   |  |  |  |  |  |  |-HtmlSanityCheckPlugin.groovy // (1)
   |  |  |  |  |  |  |-HtmlSanityCheckTask.groovy
   |  |  |  |-resources
   |  |  |  |  |-META-INF                          // (2)
   |  |  |  |  |  |-gradle-plugins
   |  |  |  |  |  |  |-htmlSanityCheck.properties  // (3)
   |  |  |-test
   |  |  |  |-org
   |  |  |  |  |-aim42
   |  |  |  |  |  |-htmlsanitycheck
   |  |  |  |  |  |  | ...
   |  |  |  |  |  |  |-HtmlSanityCheckPluginTest
   |

the actual plugin code: HtmlSanityCheckPlugin.groovy and HtmlSanityCheckTask.groovy groovy files
Gradle expects plugin properties in META-INF
property file containing the name of the actual implementation class: implementation-class=org.aim42.htmlsanitycheck.HtmlSanityCheckPlugin

Passing Parameters From Buildfile to Plugin

To be done

Building the Plugin

The plugin code itself is built with gradle.

Uploading to Public Archives

TBD

Further Information on Creating Gradle Plugins

The Gradle user guide describes how to write a plugins. :jbake-status: draft :jbake-order: -1 :jbake-type: page_toc :jbake-menu: - :jbake-title: Maven Plugin Concept and Development

Maven Plugin Concept and Development

Basic information on creating a Maven Plugin is described in the Maven User Guide chapter of plugin development.

Unresolved directive in chapters/chap-08-maven-plugin.adoc - include::/home/runner/work/htmlSanityCheck/htmlSanityCheck/htmlSanityCheck-maven-plugin/src/main/java/org/aim42/htmlSanityCheck/maven/HtmlSanityCheckMojo.java[tag=maven-plugin-implementation]

Directory Structure and Required Files

To be dome //todo

Passing Parameters From Buildfile to Plugin

To be done

Building the Plugin

The plugin code itself is built with maven.

Uploading to Public Archives

TBD

Further Information on Creating Maven Plugins

The Maven user guide describes how to write a plugins.

Flexible Checking Algorithms

HSC uses the template-method-pattern to enable flexible checking algorithms:

The Template Method defines a skeleton of an algorithm in an operation, and defers some steps to subclasses.

— https://sourcemaking.com/design_patterns/template_method

We achieve that by defining the skeleton of the checking algorithm in one operation, deferring the specific checking algorithm steps to subclasses.

The invariant steps are implemented in the abstract base class, while the variant checking algorithms have to be provided by the subclasses.

Template method "performCheck"

/**
 * * template method for performing a single type of checks on the given @see HtmlPage.
 * <p>
 * Prerequisite: pageToCheck has been successfully parsed,
 * prior to constructing this Checker instance.
 **/
public SingleCheckResults performCheck(final HtmlPage pageToCheck) {
    // assert non-null htmlPage
    assert pageToCheck != null; // NOSONAR(S4274)

    checkingResults = new SingleCheckResults();

    // description is set by subclasses
    initCheckingResultsDescription();

    return check(pageToCheck);// <1> delegate check() to subclass
}

Figure 5. Template-Method Overview

Table 16. Template Method
Component	Description
Checker	abstract base class, containing the template method check() plus the public method performCheck()
MissingImageFilesChecker	checks if referenced local image files exist
MissingImgAltAttributeChecker	checks if there are image tags without alt-attributes
BrokenCrossReferencesChecker	checks if cross references (links referenced within the page) exist
DuplicateIdChecker	checks if any id has multiple definitions
MissingLocalResourcesChecker	checks if referenced other resources exist
BrokenHttpLinksChecker	checks if external links are valid
IllegalLinkChecker	checks if links do not violate HTML link syntax

MissingImageFilesChecker

Addresses requirement Required Checks (R-1).

Checks if image files referenced in <img src="someFile.jpg"> really exists on the local file system.

The (little) problem with checking images is their path: Consider the following HTML fragment (from the file testme.html):

<img src="./images/one-image.jpg">

This image file ("one-image.jpg") has to be located relative to the directory containing the corresponding HTML file.

Therefore the expected absolute path of the "one-image.jpg" has to be determined from the absolute path of the html file under test.

We check for existing files using the usual Java API, but have to do some directory arithmetic to get the absolutePathToImageFile:

File f = new File( absolutePathToImageFile );
if(f.exists() && !f.isDirectory())

MissingImgAltAttributeChecker

Addresses requirement Required Checks (R-6).

Simple syntactic check: iterates over all <img> tags to check if the image has an alt-tag.

BrokenCrossReferencesChecker

Addresses requirement Required Checks (R-2).

Cross references are document-internal links where the href="link-target" from the html anchor tag has no prefix like +http, https, ftp, telnet, mailto, file and such.

Only links with prefix # shall be taken into account, e.g. <a href="#internalLink">.

DuplicateIdChecker

Addresses requirement Required Checks (R-4).

Sections, especially headings, can be made link-targets by adding the id="#xyz" element, yielding for example html headings like the following example.

Problems occur if the same link target is defined several times (also shown below).

<h2 id="seealso">First Heading</h2>
<h2 id="seealso">Second Heading</h2>
<a href="#seealso">Duplicate definition - where shall I go now?</a>

MissingLocalResourcesChecker

Addresses requirement Required Checks (R-3).

Current limitations:

Does NOT deep-checking of references-with-anchors of the following form:

<a href="api/Artifact.html#target">GroupInit</a>

containing both a local (file) reference plus an internal anchor #target

See issues #252 (false positives) and #253 (deep links shall be checked)

BrokenHttpLinksChecker

Addresses requirement Required Checks (R-9).

Problem here are networking issues, latency and HTTP return codes. This checker is planned, but currently not implemented.

IllegalLinkChecker

Addresses requirement Required Checks (R-5).

This checker is planned, but currently not implemented. :jbake-status: draft :jbake-order: -1 :jbake-type: page_toc :jbake-menu: - :jbake-title: Encapsulate HTML Parsing

Encapsulate HTML Parsing

We encapsulate the third-party HTML parser (https://jsoup.org) in simple wrapper classes with interfaces specific to our different checking algorithms.

Flexible Reporting

HSC allows for different output formats:

formats (HTML and text) and
destinations (file and console)

The reporting subsystem uses the template method pattern to allow different output formats (e.g. Console and HTML). The overall structure of reports is always the same:

Graphical clients can use the API of the reporting subsystem to display reports in arbitrary formats.

The (generic and abstract) reporting is implemented in the abstract Reporter class as follows:

/**
 * main entry point for reporting - to be called when a report is requested
 * Uses template-method to delegate concrete implementations to subclasses
*/
    public void reportFindings() {
        initReport()            // (1)
        reportOverallSummary()  // (2)
        reportAllPages()        // (3)
        closeReport()           // (4)
    }
//
    private void reportAllPages() {
        pageResults.each { pageResult ->
            reportPageSummary( pageResult ) // (5)
            pageResult.singleCheckResults.each { resultForOneCheck ->
               reportSingleCheckSummary( resultForOneCheck )  // (6)
               reportSingleCheckDetails( resultForOneCheck )  // (7)
               reportPageFooter()                             // (8)
        }
    }

initialize the report, e.g. create and open the file, copy css-, javascript and image files.
create the overall summary, with the overall success percentage and a list of all checked pages with their success rate.
iterate over all pages
write report footer - in HTML report also create back-to-top-link
for a single page, report the nr of checks and problems plus the success rate
for every singleCheck on that page, report a summary and
all detailed findings for a singleCheck.
for every checked page, create a footer, page break or similar to graphically distinguish pages between each other.

Styling the Reporting Output

The HtmlReporter explicitly generates css classes together with the html elements, based upon css styling re-used from the Gradle JUnit plugin.
Stylesheets, a minimized version of jQuery javascript library plus some icons are copied at report-generation time from the jar-file to the report output directory.
Styling the back-to-top arrow/button is done as a combination of JavaScript plus some css styling, as described in https://www.webtipblog.com/adding-scroll-top-button-website/.

Copy Required Resources to Output Directory

When creating the HTML report, we need to copy the required resource files (css, JavaScript) to the output directory.

The appropriate copy method took a Gradle source as blueprint.

Attributions

Credits for the arrow-icon https://www.iconfinder.com/icons/118743/arrow_up_icon

Design Decisions

HTML Parsing with jsoup

To check HTML, we parse it into an internal (DOM-like) representation. For this task we use jsoup HTML parser, an open-source parser without external dependencies.

To quote from the jsoup website:

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods.

Goals of this decision

Check HTML programmatically by using an existing API that provides access and finder methods to the DOM tree of the file(s) to be checked.

Decision Criteria

Few dependencies, so the HSC binary stays as small as possible.
Accessor and finder methods to find images, links and link-targets within the DOM tree.

Alternatives

HTTPUnit: a testing framework for web applications and sites. Its main focus is web testing, and it suffers from a large number of dependencies.
jsoup: a plain HTML parser without any dependencies (!) and a rich API to access all HTML elements in DOM-like syntax.

Find details on how HSC implements HTML parsing in the HTML encapsulation concept.

String Similarity Checking with Jaro-Winkler-Distance

The small java-string-similarity library (by Ralph Allen Rice) contains implementations of several similarity-calculation algorithms. As it is a public binary, available at Maven Central, we have used it as external library dependencies. Primarily, we have used Jaro-Winkler strategy to find similarity.

Changing Groovy to Plain Java

In 2024/2025 we decided to port the html sanity checker to plain java, because in some environments the use of the groovy-framework caused trouble. In some cases different versions of the groovey enwironment uesed in maven projekts result in an incompatibility that renders the build not working. Using plain java we get rid of all gropvy dependencies and therefore eleminating all version conflicts.

Glossary

See the domain model for explanations of important terms.

1. This diagram is outdated and needs replacement, e.g., with a PlantUML diagram.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.