Warm up pages using Trafilatura Web Scraping tool

Hello,

I wanted to share today how you can use this very user-friendly web scraping tool and warm-up pages on your website or store.

$ python3 --version
Python 3.8.6 # version 3.6 or higher is fine

pip3 install trafilatura  

There are two ways to generate a list of links.

The following examples return lists of links. If --list is absent the pages that have been found are directly retrieved, processed, and returned in the chosen output format (per default: TXT and standard output).

run link discovery through a sitemap for sitemaps.org and store the resulting links in a file

trafilatura --sitemap "https://www.sitemaps.org/" --list > mylinks.txt  

using an already known sitemap URL

trafilatura --sitemap "https://www.sitemaps.org/sitemap.xml" --list  

Substitute sitemaps.org address with your website or store

Next, when links are ready we can start and crawl website pages. We will use a popular curl (https://linux.die.net/man/1/curl) tool combined with parallel (https://linux.die.net/man/1/parallel) to get things done faster.

pip3 install trafilatura  
sudo apt-get update -y  
sudo apt-get install -y parallel  
trafilatura --sitemap "https://website-address.com/" --list > mylinks.txt;  
rm -f config;  
cat mylinks.txt | while read f; do echo -e "curl -s -o /dev/null $f" >> config; done;  
curl -K config  

2nd faster way using parallel:

rm -f config; echo 'curl -s -o /dev/null $1' > config; chmod a+rwx config; cat mylinks.txt | parallel -I% -j 5 ./config %&  

You can combine and make a single liner and run this as a cron task, or you can check/test the Docker image I've created just for this and run it as an image or configure it as Kubernetes pod as a scheduled task.

As a free gift at the end of this article please check/test instructions from https://github.com/nemke82/asger-cache-warm my GitHub repository and use that as a start point in your Trafilatura and Parallel journey. Good luck!