Wow, my RSS reader has exploded! What's with all these articles here?
Simple. I decided to make up for another lightwo year by mirroring guides I published on Steam! Yes, it's been even longer than a year since my previous blog post, but it's fiiiiine...
Why?
Try it for yourself. Unless the server is experiencing an immense load, every page mirrored to this blog should load faster than the source pages. Moreover, if you take a closer look at download stats, a guide without images is 2 MB smaller as a result!
But does it contain JavaScript?
No SoyScript here, chief.
Why didn't you scrape the BBCode and generate it soylessly without all those divs?
Another problem with Steam: short session length. Every 24 hours, this machine would have to re-authenticate in order to access section edit pages. Otherwise, I suppose it would be more than doable, considering the only other publicly inaccessible data, image IDs, are loaded on the edit page as well.
But what about the Web API Key? lolol
I don't know to what extent it can be used, but if it CAN be used for guides... I'll be damned.
Still, why not have a way to scrape guides if you can't access your account? Pollute SEO? Now, of course, my intention was to mirror only what I created.
This is static as fuck! I can't see anyone else's guides!
Yeah. I don't dare let hell loose with unreliable CGIs and vulnerabilities that could arise from having no understanding of [insert scripting language here] or, well, vulnerabilities. Sorry.
Now, onto the real meat and potatoes that are hopefully, but probably not up to date.
This was made to be used with Pelican, which processes markdown as well as HTML to generate articles. Some adjustments may need to be done to the code depending how pages are generated.
This is the script, using bash, sed, wget, xidel (variables on top must be adjusted):
#!/bin/bash
sg_dir=$HOME/blog/content/steam_guides
content_dir=$HOME/blog/persist
url_list=$sg_dir/guides_list
template=$sg_dir/template.html_
guide_raw=/tmp/sg_page.html
guide_contents=/tmp/sg_contents.html
string_file=/tmp/sg_string_file
img_dir_remote=/mirrored/ugc
img_dir=$content_dir$img_dir_remote
# Start loop for each URL from list in $url_list
#TODO: Implement that script arguments are used to process only given articles (names from $url_list)
while read line
do
blogpost_name=$(echo "$line" | cut -d' ' -f1)
blogpost_path="${sg_dir}/${blogpost_name}.html"
guide_url="$(echo "$line" | cut -d' ' -f2)"
cp "$template" "$blogpost_path"
# Download the page
# Wait and retry in case of an error
# Error: wget exit code != 0 (response code != 200) or page title ends with "Error" (Valve moment)
i=0
until [ $i -ge 6 ]
do
if wget -O "$guide_raw" "$guide_url" && xidel --xpath "//title/not(ends-with(text(),':: Error'))" "$guide_raw"; then
break
else
if [ $i -ge 5 ]; then # Abort instead of processing error page
exit
fi
echo "Error downloading page! Retry $((i+1))/5..."
sleep 1
i=$((i+1))
fi
done
# Remove shit from URLs
# Modals
sed -i "s/&insideModal=1//g" "$guide_raw"
# Link filter
sed -i "s|https://steamcommunity.com/linkfilter/?url=||g" "$guide_raw"
# Extract title, date, tags, etc. and the guide itself, insert to template page
# Trivial
title="$(xidel --xpath "//div[@class='workshopItemTitle']" "$guide_raw")"
appname="$(xidel --xpath "//div[contains(@class, 'apphub_AppName')]" "$guide_raw")"
summary="$(xidel --xpath "//div[@class='guideTopDescription']" "$guide_raw")"
# Tricky
authors="$(xidel --xpath "//div[@class='friendBlockContent']/text()[1]" "$guide_raw" | sed '{:q;N;s/\n/, /g;t q}')"
tags="$(xidel --xpath "//div[@class='workshopTags']/a" "$guide_raw" | sed -z 's/\n/, /g;s/, $/\n/')"
# Try to wrap your head around this one
date="$(date --date "$(xidel --xpath "//div[@class='detailsStatRight'][1]" "$guide_raw" | sed 's/,//' | cut -d' ' -f1-3 | sed "s/@/$(date +%Y)/")" "+%Y-%m-%d")"
# Page
xidel --printed-node-format=html --xpath "//div[@class='guide subSections']" "$guide_raw" > "$guide_contents"
# Apply stuff
for value in title appname summary authors tags date guide_url
do
declare ${value}="$(echo "${!value}" | sed -e 's/[()&]/\\&/g')" # Escaping characters
sed -i "s|_${value}_|${!value}|" "$blogpost_path"
done
# Apply content
echo "$content" > $string_file
sed -e "/_content_/ {" -e "r $guide_contents" -e "d" -e "}" -i "$blogpost_path"
# Save all images
# Must use wget --content-disposition to preserve file name
# Workaround: Use response headers printed by wget for the $img_filename variable
# XXX: Some images fail with 404, seemingly due to wget -N
mkdir -p $img_dir
while read line
do
if [ -n "$line" ]; then # Skip if no images are present
echo "Downloading image (skipping if not present)..."
i=0
until [ $i -ge 5 ]
do
img_filename="$(wget --server-response -N --content-disposition -q -P $img_dir "$line" 2>&1 | grep "Content-Disposition:" | tail -1 | awk 'match($0, /filename\*=UTF-8\047\047(.+);/, f){ print f[1] }')"
# If $img_filename is empty, wait 1s and retry
if [ ! -z "$img_filename" ]; then
sed -i "s|$line|$img_dir_remote/$img_filename|g" "$blogpost_path"
sleep 0.25 # Seems to cause less trouble
break
else
echo "Empty image filename! Retry $((i+1))/5..."
echo "Image: $line"
sleep 1
i=$((i+1))
fi
done
fi
done <<< "$(xidel --xpath '//img[starts-with(@src, "https://steamuserimages-a.akamaihd.net/ugc/")]/@src' "$blogpost_path")"
# Insert id=#Section_name and href=#Section_name for sections to be navigatable
#TODO
#while read line
#do
# #
#done <<< "$(xidel --xpath "//div[@class='subSection detailBox']/@id" "$blogpost_path")"
# Replace YouTube embeds with Invidious
#TODO
# Implementation
# Add into this div:
# <div class="sharedFilePreviewYouTubeVideo sizeFull" id="W_eFZ4HzU7Q"></div>
# ...this iframe:
# <iframe class="sharedFilePreviewYouTubeVideo sizeFull" src="https://yewtu.be/embed/W_eFZ4HzU7Q?local=true&autoplay=0" allowfullscreen="1" frameborder="0"></iframe>
done <<< $(cat "$url_list")
rm "$guide_raw" "$guide_contents"
This is an example template which works with Pelican ($template):
<html>
<head>
<title>_title_ - _appname_</title>
<meta name="tags" content="Steam Guide, _tags_" />
<meta name="category" content="Guides" />
<meta name="date" content="_date_" />
<meta name="authors" content="_authors_" />
<meta name="summary" content="_summary_" />
</head>
<body>
<p class="summary">_summary_</p>
_content_
<p class="mirroredFrom"><p>-- <a href="_guide_url_">Mirrored from Steam</a></p>
</body>
</html>
Since it doesn't scrape Steam guide lists, it relies on a ready-made list (which also reduces the amount of requests to Steam) ($url_list):
foo https://steamcommunity.com/sharedfiles/filedetails/?id=6969696969
bar https://steamcommunity.com/sharedfiles/filedetails/?id=6969696969
baz https://steamcommunity.com/sharedfiles/filedetails/?id=6969696969
Last but not least, an excerpt of Steam's massively bloated CSS good enough to look the same in most cases.
This project is a bit rough around the edges, and these issues will hopefully be addressed:
- Provide arguments to process only certain articles from the list, not all of them
- Fix the 404 occuring while saving/applying images (wget timestamping issue)
- Add hyperlinks to section titles to link to them
Please note that I made it like a total amateur, relying on code snippets from elsewhere, even though it wouldn't have come anywhere close to a working state if I hadn't had some prior experience. Still, I am very happy with the outcome and hope that you will benefit from it.
Just to get it off my chest, I blame Steam for the majority of stupid compromises I had to do. Not only are entire pages generated by Steam made out of divs, but images are served dynamically without filename in URL, their error pages use response code 200 OK, timestamps are given as fuzzy dates... and so many other annoyances! AAAGHH!
What's the loicense, mate?
No clue. Use it however you like, but don't blame me if it doesn't work or it breaks everything. Read through it at least once for good measure, I'm not an expert. Also, at least consider pointing to this article if you end up actually using it and not generating a black hole or something.
Comments
There are no comments yet.