Adding dynamic content to your home page with WebFetch, wget and sed...
08/21/2000
(copyleft) 2000 by Brian Coyle brianc@magicnet.net
This document may be freely distributed under the FDL.
http://www.gnu.org/copyleft/fdl.html
INTRO
In November 1999, Ian Kluft author of WebFetch [1], gave a presentation
to ELUG [2]. WebFetch is a collection of Perl modules providing the
framework to collect (or prefetch) web information for inclusion on a
web server.
Since I was running Apache on my home server, I figured this would be
an excellent tool to add some dynamic content to my home page. Prior to
building the dynamic home page, it was a simple collection of search engine
forms [0].
Following the WebFetch instructions, I setup a cron job to collect the
headlines for Slashdot and LinuxToday. Then I modified my Apache config
to process .shtml files and embed the include files. Finally, I copied
my index.html file to index.shtml and added the include comments.
This worked well as an initial prototype, but I wanted to also fetch
headlines from SolarisCentral and Hacker News Network. These sites
didn't have WebFetch modules. I also decided to grab the Daily
Static comic from UserFriendly. [See important note below on
copyrights and politics of prefetching - 3]
I was quite happy with the new look of my new page, but my wife asked "Gee,
you've got all that Geek News and stuff, how about adding something I
might be interested in"?
So, I set off to get the newspaper headlines from the Orlando Sentinel,
the local radar image from Channel 9 television and the regional
satellite image from CNN.
Now my personal home page is automatically updated a few times each day
and since the content is stored locally on my server, I don't have to
connect to the Internet to peruse the headlines. I can pick and choose the
links I want to follow...
PUTTING IT ALL TOGETHER
I won't go into the details of getting, installing and configuring your
system for WebFetch. That's covered in the WebFetch documentation and FAQ.
I will show you how I put together the complete package, including the
cron scripts to collect and parse the data.
Once you have WebFetch installed and tested, you'll need to decide where
to put your scripts and resulting content files. I use $HOME/bin to
store my scripts and created a 'fetched' directory under my Apache
html doc root.
Next, decide what content, headlines or whatever you wish to fetch.
Check to see if the site already has a WebFetch module or offers
a 'syndication file'. Obviously, an existing WebFetch module will
make things much easier.
As mentioned in the INTRO, I selected Slashdot and LinuxToday headlines
(these have WebFetch modules), the syndicated SolarisCentral and Hacker
News Headlines, along with headlines from the Orlando Sentinel.
Additionally, I prefetch the UserFriendly comic strip and weather images.
While SolarisCentral offers a syndication file, it is in text format and
requires some parsing to convert to html. Fortunately, SC also offers
a ksh script to do just that [4]. I modified that script, primarily changing
to bash, but also to output the html as I wanted it formatted.
Now we need to fetch, parse and output the desired HTML. I accomplished
this with a 'webfetch' script kicked off a few times each day via cron.
Here's the crontab entry-
30 7,12,17,20 * * * /home/brian/bin/webfetch
The quickest way to learn something is to dive in- so let's!
-----------------------------------------------------------------------------
$ cat ~/bin/webfetch
#!/bin/bash
# Fetch various html headlines and
# store in /home/httpd/html/fetched
#
/usr/local/bin/ping.net >/dev/null 2>&1 # fire up the ppp link
sleep 5 # wait for ppp to settle
echo "Headlines as of: $(date) " > /home/httpd/html/fetched/fetch_time.html
perl -w -MWebFetch::Slashdot -e "&fetch_main" -- --dir /home/httpd/html/fetched/
perl -w -MWebFetch::LinuxToday -e "&fetch_main" -- --dir /home/httpd/html/fetched/
wget -q -r -l1 -O /home/httpd/html/fetched/SolarisCentral.txt http://www.SolarisCentral.org/news/ultramode.txt && /home/brian/bin/parseSC.sh
wget -q -r -l1 -O /home/httpd/html/fetched/headlines.html http://www.hackernews.com/headlines.html
wget -q -r -l1 -O /home/httpd/html/fetched/radar.gif http://www.insidecentralflorida.com/auto_includes/weather/doppler/max1.gif
wget -q -r -l1 -O /home/httpd/html/fetched/sesat.png http://www.cnn.com/WEATHER/accu.data/sesat.png
wget -q -r -l1 -O /home/httpd/html/fetched/OSO.txt http://www.orlandosentinel.com/ && /home/brian/bin/parseOSO.sh
#
# Only get the UserFriendly comic once a day (at 07:30 cron)
#
if [ `date +%k` -eq 7 ]; then
wget -q -r -l1 -O /home/httpd/html/fetched/UF.txt http://www.userfriendly.org/static/ && /home/brian/bin/parseUF.sh
fi
--------------------------------------------------------------------------------
I use dial-on-demand, so I run a ping.net script first. This script simply
issues a single ping to my ISP. I often use this in cron jobs that need
Internet access; it ensures the link is active.
Next, I echo the current date and time into the fetch_time.html file. This
is used later to set the title and a header on the home page.
A couple of WebFetch calls are next. One gets the Slashdot headlines, the other
LinuxToday. Since Webfetch formats the output properly, there's no need to
further massage the results.
wget is used next to retrieve the SolarisCentral, Hacker News, Orlando Sentinel,
and weather images.
Finally, the UserFriendly strip is pulled down, but only if the hour is 7.
Because the comic is only updated once each day, there is no need to get it
more often than this.
There are three sites that require additional parsing of the fetched page -
SolarisCentral, Orlando Sentinel, and UserFriendly. For each of these, I
have written a parse script. The SolarisCentral script is available from
there, so I won't go into that, but let's look at the other two.
-------------------------------------------------------------------------------
$ cat parseOSO.sh
#!/bin/bash
# parse the Orlando Sentinel Online headlines
theFile=/home/httpd/html/fetched/OSO.txt # location of the file
tmpFile=/tmp/OSO.tmp # location of the temp file
outFile=/home/httpd/html/fetched/OSO.html # html with links
# need to cut twice since there's more than one occurance...
#
startHL=`grep -n "<\!-- BEGIN BREAKING NEWS HEADLINES -->" $theFile | cut -d: -f1`
startHL=`echo $startHL | cut -d' ' -f1`
endHL=`grep -n "<\!-- END BREAKING NEWS HEADLINES -->" $theFile | cut -d: -f1`
endHL=`echo $endHL | cut -d' ' -f1`
startmoreHL=`grep -n "<\!-- START BOTTOM HEADLINES -->" $theFile | cut -d: -f1`
endmoreHL=`grep -n "<\!-- BEGIN CALENDAR" $theFile | cut -d: -f1`
endmoreHL=`echo $endmoreHL | cut -d' ' -f1`
sed -n -e "$startHL,$endHL w $tmpFile.HL" \
-e "$startmoreHL,$endmoreHL w $tmpFile.HL" $theFile # get headlines
sed -n -e "s:.*
$/!s:^M:
:' $tmpFile.new > $outFile
rm $tmpFile*
exit
-------------------------------------------------------------------------------
I won't describe each line in detail, but in a nutshell... This script
finds the beginning and end of the headlines (flagged by the 'BEGIN' and
'END' comments) and some additional headlines. Then sed is used to find
the href links, cleanup the trailing '' tags and write out the results.
Now let's take a look at the UserFriendly parser.
-------------------------------------------------------------------------------
$ cat parseUF.sh
#!/bin/bash
# parse the UserFriendly daily static comic
theFile=/home/httpd/html/fetched/UF.txt # location of the html file
outFile=/home/httpd/html/fetched/UF.gif #
#
start=`grep -n "<\!--Start Current Strip-->" $theFile | cut -d: -f1`
let start="$start + 2" # skip the "" line
url=`sed -n -e "$start s:.* SRC=\":: " -e "$start s:\">:: p" $theFile`
wget -q -r -l1 -O $outFile $url
exit
-------------------------------------------------------------------------------
Besides being much smaller than the other parsers, this script invokes
wget to retrieve the .gif for the comic strip. This is trivial, once sed
provides the url.
FINISHING UP
OK, now that we have all these headlines and such, what do we do with them?
Well, the UserFriendly strip and weather images can be coded as regular
'img src=' html. But, we need to embed the headline filenames in a .shtml
file so Apache will do the server-side include magic.
This is done as described in the Apache and WebFetch documentation with
an html comment containing '$include file={filename}'. You can even
place the comments within other tags for special effects. For example,
remember the date and time file we created when the webfetch script
started? Here's how I use that file to set the title and a header:
FINAL NOTES
Someday when I feel the need to brush up on my perl skills, I may create
WebFetch modules for these sites. Until then, wget and sed are getting
the job done...
To avoid reconfiguring my client browsers [5], I changed my index.html file
to redirect to the index.shtml (note, hank is my server name)...
-----------------------------------------------------------------------------
$ cat index.html
-----------------------------------------------------------------------------
I always create a link back to the original site for any content I prefetch.
That way, I can quickly get to the very latest information. This can be
pretty important in Central Florida with hurricane season well under way!
For example, here's how my weather images and OSO headlines are linked:
------------------------------------------------------------------------------