HTML Sanity Check (HSC) provides some basic sanity checking on HTML files.
It can be helpful in case of HTML generated from, e.g., Asciidoctor, Markdown or other formats — as converters usually don’t check for missing images or broken links.
HTML Sanity Check (HSC) provides some basic sanity checking on HTML files.
It can be helpful in case of HTML generated from, e.g., Asciidoctor, Markdown or other formats — as converters usually don’t check for missing images or broken links.
HSC can be currently used
As a Gradle plugin, or
Programmatically from Java or other JVM languages (TBD).
Future releases
In the future, we plan to provide
|
Depending on your usage you have to
Install the Gradle Plugin, or
Install the core library for programmatic use (TBD).
Depending on your usage find respective
Gradle Plugin examples, and
Core library examples (TBD).
The overall goal is to create neat and clear reports, showing eventual errors within HTML files - as shown in the adjoining figure. |
Find all '<a href="XYZ">' where XYZ is not defined.
<a href="#missing">internal anchor</a>
...
<h2 id="missinG">Bookmark-Header</h2>
In this example, the bookmark is misspelled.
Use checkerClass BrokenCrossReferencesChecker.
Images, referenced in <img src="XYZ"…
tags, refer to external files.
The plugin checks the existence of these files.
Use checkerClass MissingImageFilesChecker.
If any is defined more than once, any anchor linking to it will be confused.
Use checkerClass DuplicateIdChecker.
All files, (e.g., downloads) referenced from HTML.
Use checkerClass MissingLocalResourcesChecker.
Image-tags should contain an alt-attribute that the browser displays when the original image file cannot be found or cannot be rendered. Having alt-attributes is a good and defensive style.
Use checkerClass MissingAltInImageTagsChecker.
The current version (derived from branch 1.0.0-RC-2) contains a simple implementation that identifies errors (status >400) and warnings (status 1xx
or 3xx
).
StatusCodes are configurable ranges (as some people might want some content behind paywalls NOT to result in errors…)
Localhost or numerical IP addresses are currently NOT marked as suspicious.
Please comment in case you have additional requirements.
Use checkerClass BrokenHttpLinksChecker.
ftp
, ntp
or other protocols are currently not checked, but should…
In addition to checking HTML, this project serves as an example for arc42.
Please see our software architecture documentation.
This tiny piece rests on incredible groundwork:
Jsoup HTML parser and analysis toolkit - robust and easy-to-use.
IntelliJ IDEA - my (Gernot) best (programming) friend.
Of course, Groovy, Gradle, JUnit and Spock framework.
The plugin heavily relies on code provided by Gradle.
Inspiration on code organization, implementation and testing of the plugin came from the Asciidoctor-Gradle-Plugin by Andres Almiray.
Code for string similarity calculation by Ralph Rice.
Implementation, maintenance and documentation by
Initially: Gernot Starke,
Currently: Gerd Aschemann and several other contributors.
Once upon a time the rackerlabs hosted gradle-linkchecker-plugin
which was an (open source) Gradle plugin.
It validated that all links in a local HTML file tree go out to other existing local files or remote web locations, creating a simple text file report.
However, as of 2024-08-14 they have deleted the repository (there seems to be a fork in https://github.com/leonard84/gradle-linkchecker-plugin). |
It was perhaps based on a similar approach (linkchecker-maven-plugin) for Maven.
Benjamin Muschko has created a (Go-based) command-line tool to check links, called link verifier.
html-proofer is written in Ruby and provides different usage scenarios (programmatically, CLI, and Docker).
htmltest is also written in Go(Lang) and claims to be rapid compared to html-proofer
(stay tuned; we have plans for HSC to run with Graal quickly).
Please report issues or suggestions.
In case you want to check out, build, fork and/or contribute, take a look into our Development Information
Currently, code is published under the Apache-2.0 licence, documentation under Creative-Commons-Sharealike-4.0. Some day we’ll unify that 😬.
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.