Broken Links Checker

Mar 24, 2021 . 2 min read

Broken Link Checker

We use a python script and wget to check broken links in this site.

The python script scrapes links from specified folders, checks the links and logs the broken links to /tmp/brokenlinks.log.

 
 

brokenlinks.py


import glob
import bs4
import urllib.request
import os.path

def checkLinks(links):

  brokenCount = 0
  log = open("/tmp/brokenlinks.log", "w")
  for link in links:
   message = ""
   try:
     status_code = urllib.request.urlopen(link).getcode()
     if(status_code != 200):
       message = link + " status code: "  + status_code
   except:
     message = link + " broken"

   if message:
     brokenCount += 1
     print(message)
     log.write(link + "\n")

  log.close();
  return brokenCount

def extractLinks(files):

  httpLinks = []
  for file in files:
    with open(file) as f:
      soup = bs4.BeautifulSoup(f,features="html5lib")

    links = [link['href'] for link in soup('a') if 'href' in link.attrs]

    for link in links:
      if link.startswith('http'):
        httpLinks.append(link)

  linkSet = set()
  for link in httpLinks:
    linkSet.add(link.split('?')[0])

  return linkSet

def findFiles(dirs,pattern):
  files = list();
  for dir in dirs:
   glb = dir + '/**/' + pattern
   paths = glob.glob(glb,recursive=True)
   files.extend(paths)

  return files

def checkDirs(dirs):
  clean = True
  for dir in dirs:
    if not os.path.isdir(dir):
       print("no such dir: " + dir)
       errors = False
  return clean

# entry point

dirs = []
dirs.append('public/tutorial');
dirs.append('public/post');

if not checkDirs(dirs):
   exit()

htmlFiles = findFiles(dirs,'*.html')

links = extractLinks(htmlFiles)
print("total links: " + str(len(links)))

brokenCount = checkLinks(links)
print("broken links: " + str(brokenCount))

The python urlib doesn’t follow the moved links and shows them as broken. Use following bash script to recheck the broken links with wget. The scripts takes /tmp/brokenlinks.log as input and outputs the final list to /tmp/brokenlinks.lst.

Alternatively, you can hash out checkLinks(links) call in brokenlinks.py, so that it just scapes the links. Then, use recheck.sh to check all links.

recheck.sh


echo "-- Broken Links --" > /tmp/brokenlinks.lst

while read -r line; do
  wget -q --spider "$line"
  retVal=$?
  if [ $retVal -ne 0 ]; then
    echo "$line : broken"
    echo "$line" >> /tmp/brokenlinks.lst
  fi
done < /tmp/brokenlinks.log
 
 

Setup and Run

To run the above scripts, first install required python modules.

sudo apt install python3-pip
pip3 install BeautifulSoup4

Next edit brokenlinks.py and specify the folders to scan in dir[] array which is defined in the entry point of script. By default, output is written to /tmp folder, change the path if required. Build hugo site and run the script.

cd hugo-workspace
rm -rf public
hugo

python3 brokenlinks.py

Finally, run recheck.sh to get the final list of broken links.