class AllCheckers {
public final static Set<Class> checkerClazzes =
[BrokenCrossReferencesChecker,
BrokenHttpLinksChecker,
DuplicateIdChecker,
ImageMapChecker,
MissingAltInImageTagsChecker,
MissingImageFilesChecker,
MissingLocalResourcesChecker].toSet()
}
Design Discussions
Automated Checker Factory
Issue 185 proposes to automatically detect and instantiate Checker (sub-)classes.
Problem
We want to adhere to the Open-Close principle (OCP) when adding new checkers:
There shall be no need to touch other code when writing a new checker, especially should there be no central registry of checkers.
Up to v 1.0.0-RC-1 we violated the OCP and allowed major redundancies in the code:
Solution Approach
A solution needs to address two issues:
1.) find all Checker subclasses One approach here relies on Java annotations, described by Przemysław Wojnowski, and uses the Reflections library.
2.) instantiate these classes (aka Factory for the checkers)
Refactoring the Main Checking Loop
Issue 190 proposes to merge the two classes AllChecksRunner and ChecksRunner.
Proposed (new) Scenario
Handling 403 and 405 Status Codes
Issue 219 explains that the naive link checks based upon a http HEAD request sometimes fail, e.g. for links to amazon.com:
For example: (`curl -I performs a HEAD request)
<snip> curl -I https://www.amazon.de/dp/3446443487 HTTP/2 405 server: Server content-type: text/html;charset=UTF-8 </snip>
This is clearly a false negative, as the URL itself is correct and the page exists.
Proposed Approach
Reason for this "double check" is to keep the transmitted data volume low (and performance higher).
The general behavior we implement shall be the following `
if (responseCode in successCodes) then return
else {
// try GET
connection.setRequestMethod("GET")
int finalResponseCode = connection.getResponseCode()
switch (realResponseCode) {
case warningCodes: println "real warning $responseCode"; break
case errorCodes: println "real error $responseCode"; break
default: println "Error: Unknown or unclassified response code"
}
}
Handling 30x Status Codes
Issue 244 explains that status codes 30x (redirect of some kind) might signal an error, but can return the new location of the requested resource.
For example: (`curl -I performs a HEAD request)
<snip> curl -I http://www.arc.de HTTP/1.1 301 Moved Permanently Content-Type: text/html; charset=UTF-8 Connection: keep-alive Keep-Alive: timeout=15 Date: Wed, 24 Oct 2018 14:16:04 GMT Server: Apache X-Powered-By: PHP/7.0.32 Location: https://arc42.de/ </snip>
So - the url exists in principle (in the example above,
it only changed from http
to https
…
Proposed Approach
A very simple solution would be to access the location header and return that together with the warning message.
if (responseCode in [301, 302, 303, 307, 308]) then {
// get location
String newLocation
if (connection.headerFields.'Location') {
// add connection.headerFields.Location.first() to warning message
}
}
Issue 252 (MissingLocalResourcesChecker gives false positives)
Background
The short form of a URL:
<a href="/foo">
can actually mean one of the following:
-
/foo.html
-
more generally
/foo.<extension>
-
/foo/index.<extension>
where the extension is one of the following: html, htm, shtml, phtml, php, asp, aspx, xml
Proposed Approach
-
Define a proper name for the "link without extension" (
prefixOnlyHref
) -
When checking local resources, add a special case for URLs with
prefixOnlyHref
. -
Add a new configuration item with the extensions to be checked. The default should be the list given above.
-
Add these defaults to
Web
Decision
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.