wget spider to find broken links

Tue, 24 Jun 2025 09:04 UTC

Image: CC BY 4.0 by cybrkyd

There are many online services which offer to probe a website for broken links. I prefer using wget to do the legwork for me. It is slow but it gets the job done for my audits, reporting on both internal and external broken links.

wget --spider -r -nd -nv -w 2 -o run1.log https://example.org

The command uses wget to recursively scan a website (https://example.org) in a non-intrusive way (spider mode).

--spider: Makes wget act like a web crawler (checks links without downloading)
-r: Recursive download (follows links)
-nd: Do not create a hierarchy of directories
-nv: Non-verbose (quiet mode)
-w 2: Wait 2 seconds between requests (so as to not flood the website)
-o run1.log: Save output to run1.log instead of stdout

Find broken links in the log file

An example output of the log:

2025-06-24 10:00:00 URL: https://example.org/good-page 200 OK
2025-06-24 10:00:02 URL: https://example.org/broken-page 404 Not Found
2025-06-24 10:00:04 URL: https://example.org/missing.jpg [following]
2025-06-24 10:00:06 URL: https://example.org/missing.jpg 404 Not Found
2025-06-24 10:00:08 URL: https://example.org/forbidden-page: 403 Forbidden
2025-06-24 10:00:10 URL: https://broken.com: Failed: Name or service not known
2025-06-24 10:00:12 URL: https://example.org/redirect-loop: Too many redirects

The URLs with the 404 Not Found error are missing.
The URLs with the 403 Forbidden error means the requested resource is forbidden.
The URLs with the Failed: Name or service not known error indicates failed DNS resolution or connection.
The URLs with the Too many redirects error indicates redirect loops (HTTP 301/302 issues).

All the above are types of broken links.

Instead of trawling through the log file line-by-line, use grep to filter for errors.

grep -E '404|Failed|error' run1.log

Or for a cleaner list of just broken URLs:

grep -B1 '404 Not Found' run1.log | grep 'https://'

Tagged in:

Visitors: Loading...