Design Discussions

Automated Checker Factory

Issue 185 proposes to automatically detect and instantiate Checker (sub-)classes.

Problem

We want to adhere to the Open-Close principle (OCP) when adding new checkers:

There shall be no need to touch other code when writing a new checker, especially should there be no central registry of checkers.

Up to v 1.0.0-RC-1 we violated the OCP and allowed major redundancies in the code:

class AllCheckers {

    public final static Set<Class> checkerClazzes =
            [BrokenCrossReferencesChecker,
             BrokenHttpLinksChecker,
             DuplicateIdChecker,
             ImageMapChecker,
             MissingAltInImageTagsChecker,
             MissingImageFilesChecker,
             MissingLocalResourcesChecker].toSet()
}

Solution Approach

A solution needs to address two issues:

1.) find all Checker subclasses One approach here relies on Java annotations, described by Przemysław Wojnowski, and uses the Reflections library.

2.) instantiate these classes (aka Factory for the checkers)

Refactoring the Main Checking Loop

Issue 190 proposes to merge the two classes AllChecksRunner and ChecksRunner.

Proposed (new) Scenario

mainloop new

Handling 403 and 405 Status Codes

Issue 219 explains that the naive link checks based upon a http HEAD request sometimes fail, e.g. for links to amazon.com:

For example: (`curl -I performs a HEAD request)

<snip>

curl -I https://www.amazon.de/dp/3446443487
HTTP/2 405
server: Server
content-type: text/html;charset=UTF-8

</snip>

This is clearly a false negative, as the URL itself is correct and the page exists.

Proposed Approach

GET after HEAD failapproach

Reason for this "double check" is to keep the transmitted data volume low (and performance higher).

The general behavior we implement shall be the following `

if (responseCode in successCodes) then return
else {

    // try GET
    connection.setRequestMethod("GET")
    int finalResponseCode = connection.getResponseCode()

    switch (realResponseCode) {
        case warningCodes: println "real warning $responseCode"; break
        case errorCodes: println "real error $responseCode"; break

        default: println "Error: Unknown or unclassified response code"
    }
}

Handling 30x Status Codes

Issue 244 explains that status codes 30x (redirect of some kind) might signal an error, but can return the new location of the requested resource.

For example: (`curl -I performs a HEAD request)

<snip>

curl -I http://www.arc.de

HTTP/1.1 301 Moved Permanently
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Keep-Alive: timeout=15
Date: Wed, 24 Oct 2018 14:16:04 GMT
Server: Apache
X-Powered-By: PHP/7.0.32
Location: https://arc42.de/
</snip>

So - the url exists in principle (in the example above, it only changed from http to https…​

Proposed Approach

A very simple solution would be to access the location header and return that together with the warning message.

if (responseCode in [301, 302, 303, 307, 308]) then {
    // get location
    String newLocation
    if (connection.headerFields.'Location') {
         // add connection.headerFields.Location.first() to warning message
    }
}

Issue 252 (MissingLocalResourcesChecker gives false positives)

Problem Description

The MLR-Checker currently gives false positives ("false alarms") if the link has a form like

<a href="/foo">

Background

The short form of a URL:

<a href="/foo">

can actually mean one of the following:

  • /foo.html

  • more generally /foo.<extension>

  • /foo/index.<extension>

where the extension is one of the following: html, htm, shtml, phtml, php, asp, aspx, xml

Proposed Approach

  • Define a proper name for the "link without extension" (prefixOnlyHref)

  • When checking local resources, add a special case for URLs with prefixOnlyHref.

  • Add a new configuration item with the extensions to be checked. The default should be the list given above.

  • Add these defaults to Web

Decision

Document the decision in Detail by moving all information from this Discussion to chapter 9 "Design Decisions" of your arc42 docs.