
A self-challenge: make a Python bot to post to Hubzilla. Having left Mastodon, I needed to re-enable my Al Jazeera RSS feed bot.
The concept is simple enough: (1) get the RSS feed, (2) look for new items, (3) post only the latest items, and (4) make note of the time when the last post was made.
Hubzilla is a powerhouse and it is possible to (1) follow RSS feeds and (2) mirror those as posts in a channel. However, there is a warning about RSS feeds in Hubzilla which reads “Heavy system resource usage”. Having previously used Hubzilla on a shared hosting platform, I can confirm that it does add some strain to the system. I’m now on a VPS so I could quite easily enable this and be done with it but where is the fun in that? Let’s keep the strain off the application and get Python to do the heavy lifting.
The challenge with this came with the formatting and how Hubzilla parses links. In Hubzilla, a link is a URL link and will be displayed as such, with no expansion to retrieve the Open Graph data. So, you post www.example.com
and that’s what you end up with in the post.
When posting from the UI, Hubzilla will retrieve the Open Graph data at the time of post creation, not after the post has been made, the way I think this works in Mastodon and other ActivityPub applications. I had to therefore retrieve not just the new item in the RSS feed but also, fetch the associated image of the article and explicitly include this in the post sent to Hubzilla.
The code
The full code is below. There are three unique variables to be changed: rss_url
, api_url
and auth =
which will be the channel’s username and password.
The text file rssbot_last_run.txt
is used to store the date/time of the last post transmission. This file needs to be writable.
The script itself runs on a crontab job, once every 20 min. My cron looks like this:
*/20 * * * * cd /var/www/html/hubzilla-bot && python3 hubzilla-bot.py >/dev/null 2>&1
Finally, if there is more than one post to make (there will be at first run), there is a delay of 20 seconds set between posting.
#!/usr/bin/env python3
import time
from datetime import datetime
from dateutil import parser
import requests
import xml.etree.ElementTree as ET
from newspaper import Article
from urllib.parse import quote
import subprocess
# Setup variables
LOCAL_TIMEZONE = "UTC"
last_run_path = "./rssbot_last_run.txt"
time_format_code = '%a, %d %b %Y %X'
now_dt = datetime.now()
now_str = now_dt.strftime(time_format_code)
now_tim = time.mktime(now_dt.timetuple())
rss_url = "https://www.aljazeera.com/xml/rss/all.xml"
# Fetch the image URL from the article using newspaper3k
def fetch_image_url(article_link):
try:
article = Article(article_link)
article.download()
article.parse()
return article.top_image
except Exception as e:
print(f"Error fetching image URL: {e}")
return ""
# Get last run date/time
try:
with open(last_run_path, "r") as myfile:
data = myfile.read()
if not data:
# Set last run date on the first run if the file is empty
with open(last_run_path, "w") as myfile:
myfile.write("%s %s" % (now_str, LOCAL_TIMEZONE))
print("Wrote %s" % (last_run_path))
with open(last_run_path, "r") as myfile:
data = myfile.read()
except Exception as e:
print(f"Error reading last run file: {e}")
data = "%s %s" % (now_str, LOCAL_TIMEZONE)
lr_dt = parser.parse(data)
lr_tim = time.mktime(lr_dt.timetuple())
print("LAST RUN: %s" % (lr_dt))
lrgr_entry_count = 0
# Get RSS feed and new entries
new_entries = []
response = requests.get(rss_url)
if response.status_code == 200:
xml_data = response.text
root = ET.fromstring(xml_data)
for item in root.findall(".//item"):
link = item.find("link").text
title = item.find("title").text
description = item.find("description").text
# Check if entry is new
pub_date_str = item.find("pubDate").text
pub_date = parser.parse(pub_date_str)
pub_tim = time.mktime(pub_date.timetuple())
if pub_tim > lr_tim:
lrgr_entry_count += 1
print("New Entry: %s" % (title))
new_entries.append({
"title": title,
"link": link,
"description": description
})
# New entries found
if len(new_entries) > 0:
toots_attempted_count = 0
for entry in new_entries:
link = entry["link"]
title = entry["title"]
description = entry["description"]
# Fetch the image URL from the linked article using newspaper3k
image_url = fetch_image_url(link)
# Make POST request using requests library
api_url = "https://example.com/api/z/1.0/item/update" # Replace with your URL path
payload = {
"body": f"[url={link}] {title}[/url]<br>{description}<br><br>[img]{image_url}[/img]",
}
auth = ("USER", "PASSWORD") # Replace with your actual username and password
headers = {"Content-Type": "application/x-www-form-urlencoded"}
# Introduce a 20-second delay between posts
time.sleep(20)
response = requests.post(api_url, data=payload, auth=auth, headers=headers)
if response.status_code == 200:
print(f"Successfully posted: {title}")
else:
print(f"Error posting: {title} - Status Code: {response.status_code}")
# Save new last run if new entries
if lrgr_entry_count > 0:
with open(last_run_path, "w") as myfile:
myfile.write("%s %s" % (now_str, LOCAL_TIMEZONE))
else:
print("No New Entries")