Broken Link Checker
We use a python script and wget to check broken links in this site.
Log Broken Links
The python script scrapes links from specified folders, checks the links and logs the broken links to /tmp/brokenlinks.log
.
brokenlinks.py
import glob
import bs4
import urllib.request
import os.path
def checkLinks(links):
brokenCount = 0
log = open("/tmp/brokenlinks.log", "w")
for link in links:
message = ""
try:
status_code = urllib.request.urlopen(link).getcode()
if(status_code != 200):
message = link + " status code: " + status_code
except:
message = link + " broken"
if message:
brokenCount += 1
print(message)
log.write(link + "\n")
log.close();
return brokenCount
def extractLinks(files):
httpLinks = []
for file in files:
with open(file) as f:
soup = bs4.BeautifulSoup(f,features="html5lib")
links = [link['href'] for link in soup('a') if 'href' in link.attrs]
for link in links:
if link.startswith('http'):
httpLinks.append(link)
linkSet = set()
for link in httpLinks:
linkSet.add(link.split('?')[0])
return linkSet
def findFiles(dirs,pattern):
files = list();
for dir in dirs:
glb = dir + '/**/' + pattern
paths = glob.glob(glb,recursive=True)
files.extend(paths)
return files
def checkDirs(dirs):
clean = True
for dir in dirs:
if not os.path.isdir(dir):
print("no such dir: " + dir)
errors = False
return clean
# entry point
dirs = []
dirs.append('public/tutorial');
dirs.append('public/post');
if not checkDirs(dirs):
exit()
htmlFiles = findFiles(dirs,'*.html')
links = extractLinks(htmlFiles)
print("total links: " + str(len(links)))
brokenCount = checkLinks(links)
print("broken links: " + str(brokenCount))
Recheck Broken Links
The python urlib doesn’t follow the moved links and shows them as broken. Use following bash script to recheck the broken links with wget. The scripts takes /tmp/brokenlinks.log
as input and outputs the final list to /tmp/brokenlinks.lst
.
Alternatively, you can hash out checkLinks(links)
call in brokenlinks.py, so that it just scapes the links. Then, use recheck.sh to check all links.
recheck.sh
echo "-- Broken Links --" > /tmp/brokenlinks.lst
while read -r line; do
wget -q --spider "$line"
retVal=$?
if [ $retVal -ne 0 ]; then
echo "$line : broken"
echo "$line" >> /tmp/brokenlinks.lst
fi
done < /tmp/brokenlinks.log
Setup and Run
To run the above scripts, first install required python modules.
sudo apt install python3-pip
pip3 install BeautifulSoup4
Next edit brokenlinks.py and specify the folders to scan in dir[] array which is defined in the entry point of script. By default, output is written to /tmp
folder, change the path if required. Build hugo site and run the script.
cd hugo-workspace
rm -rf public
hugo
python3 brokenlinks.py
Finally, run recheck.sh
to get the final list of broken links.