Introduction and Goals

HSC shall support authors creating digital formats with hyperlinks and integration of images and similar resources.

Requirements Overview

The overall goal of HSC is to create neat and clear reports, showing errors within HTML files - as shown in the adjoining figure.

sample hsc report

Basic Usage


A user configures the location (directory and filename) of one or more HTML file(s), including the corresponding image’s directory.


HSC performs various checks on the files, and


Reports its results either on the console or as HTML report.

HSC can run

  • From the command line (CLI), or

  • As Gradle-plugin.

Terminology: What Can Go Wrong in HTML Files?

Apart from purely syntactical errors, many things can go wrong in html, especially with respect to hyperlinks, anchors and id’s — as those are often manually maintained.

Primary sources of problems are bad links (in technical terms: URIs). For further information, see the background information on URIs.

Broken Cross References

Cross-references (internal links) can be broken, e.g. due to missing or misspelled link-targets. See BrokenCrossReferencesChecker

Missing image files

Referenced image files can be missing or misspelled. See MissingImageFilesChecker.

Missing local resources

Referenced local resources (other than images) can be missing or misspelled. See MissingLocalResourcesChecker

Duplicate link targets

link-targets can occur several times with the same name - so the browser cannot know which is the desired target. See DuplicateIdChecker.

Illegal links

The links (aka anchors or URIs) can contain illegal characters or violate HTML link syntax. See IllegalLinkChecker.

Broken external links

External http links can be broken due to myriads of reasons: misspelled, link-target currently offline, illegal link syntax. See BrokenHttpLinksChecker.

Missing Alt Attribute in Image Tags

Images missing an alt-attribute. See MissingImgAltAttributeChecker.

Checking and reporting these errors and flaws is the central business requirement of HSC.

Important terms (domain terms) of html sanity checking is documented in a (small) domain model.

General Functionality

Table 1. General Requirements
ID Functionality Description


read HTML file

HSC shall read a single (configurable) HTML file



HSC can be run as Gradle-plugin.


command line usage

HSC can be called from the command line with arguments and options


configurable output

output can be configured to console or file


free and open source

all required dependencies shall be compliant to the CC-SA-4 licence.


available via public repositories

Maven Central


configurable to check multiple HTML files

configure a set of files to be processes in a single run and produce a joint report. (useful for e.g. API documentation with many HTML files referencing each other)

Types of Sanity Checks

Table 2. Required Checks
ID Check Description


missing image files

Check all image tags if the referenced image files exist. See MissingImageFilesChecker


broken internal links

Check all internal links from anchor-tags (href="#XYZ") if the link targets "XYZ" are defined. See BrokenCrossReferencesChecker


missing local files

either other html-files, pdf’s or similar. See MissingLocalResourcesChecker


duplicate link targets

Check all bookmark definitions (…​ id="XYZ") whether the id’s ("XYZ") are unique. See DuplicateIdChecker


malformed links

Check all links for syntactical correctness


missing alt-attribute

in image-tags. See MissingImgAltAttributeChecker



Check for files in image-directories that are not referenced by any of the HTML files in this run


illegal link targets

Check for malformed or illegal anchors (link targets).

Table 3. Optional Checks
ID Check Description


missing external images

Check externally referenced images for availability


broken external links

Check external links for both syntax and availability

Reporting and Output Requirements

Table 4. Reporting Requirements
ID Requirement Description


various output formats

Checking output in plain text and HTML


output to stdout

HtmlSC can output results on stdout (the console)


configurable file output

HtmlSC can store results in file in configurable directories

Quality Goals

Table 5. Quality-Goals
Priority Quality-Goal Scenario



Every broken internal link (cross reference) is found.



Every missing local image is found.



Multiple checking algorithms, report formats and clients. At least Gradle, command-line and a graphical client have to be supported.



Content of the files to be checked is never altered.



Correctness of every checker is automatically tested for positive AND negative cases



Every reporting format is tested: Reports must exactly reflect checking results.



Check of 100kB html file performed under 10 secs (excluding gradle startup)


Table 6. Stakeholder
Role Description Goal, Intention

Documentation author

writes documentation with Html output

wants to check that the resulting document contains good links, image references

arc42 user

uses the arc42 template for architecture documentation

wants a small but practical example of how to apply arc42.

aim42 contributor

contributes to aim42 methode-guide

check generated html code to ensure links and images are correct during (gradle-based) build process

software developer

wants an example of pragmatic architecture documentation and arc42 usage

Background Information on URIs

The generic structure of a Uniform Resource Identifier consists of the following parts: [type][://][subdomain][domain][port][path][file][query][hash]

An example, visualized:

uri generic example

The class contains a generic parser for URLs and URIs. See the following snippet, taken from the unit test class WebTest.groovy:

Generic URI Structure
    void testGenericURISyntax() {
        // based upon an example from the Oracle(tm) Java tutorial:
        def aURL =
                new URL("")
        aURL.with {
            assert getProtocol() == "http"
            assert getAuthority() == ""
            assert getHost() == ""
            assert getPort() == 42
            assert getPath() == "/docs/tutorial/index.html"
            assert getQuery() == "name=aim42"
            assert getRef() == "INTRO"

URIs are used to reference other resources. For HSC it is useful to distinguish between internal (== local)and external references:

  • Internal references, a.k.a. Cross-References

  • External references

Intra-Document URIs

a file…​ ref can be an internal link, or a URI without protocol…​

References on URIs and HTML Syntax