I wanted to share today how you can use this very user-friendly web scraping tool and warm-up pages on your website or store.
$ python3 --version
Python 3.8.6 # version 3.6 or higher is fine
pip3 install trafilatura
There are two ways to generate a list of links.
The following examples return lists of links. If --list is absent the pages that have been found are directly retrieved, processed, and returned in the chosen output format (per default: TXT and standard output).
run link discovery through a sitemap for sitemaps.org and store the resulting links in a file
trafilatura --sitemap "https://www.sitemaps.org/" --list > mylinks.txt
using an already known sitemap URL
trafilatura --sitemap "https://www.sitemaps.org/sitemap.xml" --list
Substitute sitemaps.org address with your website or store
Next, when links are ready we can start and crawl website pages. We will use a popular curl (https://linux.die.net/man/1/curl) tool combined with parallel (https://linux.die.net/man/1/parallel) to get things done faster.
pip3 install trafilatura sudo apt-get update -y sudo apt-get install -y parallel trafilatura --sitemap "https://website-address.com/" --list > mylinks.txt; rm -f config; cat mylinks.txt | while read f; do echo -e "curl -s -o /dev/null $f" >> config; done; curl -K config
2nd faster way using parallel:
rm -f config; echo 'curl -s -o /dev/null $1' > config; chmod a+rwx config; cat mylinks.txt | parallel -I% -j 5 ./config %&
You can combine and make a single liner and run this as a cron task, or you can check/test the Docker image I've created just for this and run it as an image or configure it as Kubernetes pod as a scheduled task.
As a free gift at the end of this article please check/test instructions from https://github.com/nemke82/asger-cache-warm my GitHub repository and use that as a start point in your Trafilatura and Parallel journey. Good luck!