Broken Link Checker

We use a python script and wget to check broken links in this site.

Log Broken Links

The python script scrapes links from specified folders, checks the links and logs the broken links to /tmp/brokenlinks.log.

brokenlinks.py


import glob
import bs4
import urllib.request
import os.path

def checkLinks(links):

  brokenCount = 0
  log = open("/tmp/brokenlinks.log", "w")
  for link in links:
   message = ""
   try:
     status_code = urllib.request.urlopen(link).getcode()
     if(status_code != 200):
       message = link + " status code: "  + status_code
   except:
     message = link + " broken"

   if message:
     brokenCount += 1
     print(message)
     log.write(link + "\n")

  log.close();
  return brokenCount

def extractLinks(files):

  httpLinks = []
  for file in files:
    with open(file) as f:
      soup = bs4.BeautifulSoup(f,features="html5lib")

    links = [link['href'] for link in soup('a') if 'href' in link.attrs]

    for link in links:
      if link.startswith('http'):
        httpLinks.append(link)

  linkSet = set()
  for link in httpLinks:
    linkSet.add(link.split('?')[0])

  return linkSet

def findFiles(dirs,pattern):
  files = list();
  for dir in dirs:
   glb = dir + '/**/' + pattern
   paths = glob.glob(glb,recursive=True)
   files.extend(paths)

  return files

def checkDirs(dirs):
  clean = True
  for dir in dirs:
    if not os.path.isdir(dir):
       print("no such dir: " + dir)
       errors = False
  return clean

# entry point

dirs = []
dirs.append('public/tutorial');
dirs.append('public/post');

if not checkDirs(dirs):
   exit()

htmlFiles = findFiles(dirs,'*.html')

links = extractLinks(htmlFiles)
print("total links: " + str(len(links)))

brokenCount = checkLinks(links)
print("broken links: " + str(brokenCount))

Recheck Broken Links

The python urlib doesn’t follow the moved links and shows them as broken. Use following bash script to recheck the broken links with wget. The scripts takes /tmp/brokenlinks.log as input and outputs the final list to /tmp/brokenlinks.lst.

Alternatively, you can hash out checkLinks(links) call in brokenlinks.py, so that it just scapes the links. Then, use recheck.sh to check all links.

recheck.sh


echo "-- Broken Links --" > /tmp/brokenlinks.lst

while read -r line; do
  wget -q --spider "$line"
  retVal=$?
  if [ $retVal -ne 0 ]; then
    echo "$line : broken"
    echo "$line" >> /tmp/brokenlinks.lst
  fi
done < /tmp/brokenlinks.log

Setup and Run

To run the above scripts, first install required python modules.

sudo apt install python3-pip
pip3 install BeautifulSoup4

Next edit brokenlinks.py and specify the folders to scan in dir[] array which is defined in the entry point of script. By default, output is written to /tmp folder, change the path if required. Build hugo site and run the script.

cd hugo-workspace
rm -rf public
hugo

python3 brokenlinks.py

Finally, run recheck.sh to get the final list of broken links.

Broken Links Checker

Broken Link Checker

Log Broken Links

Recheck Broken Links

Setup and Run