Adding dynamic content to your home page with WebFetch, wget and sed... 08/21/2000 (copyleft) 2000 by Brian Coyle brianc@magicnet.net This document may be freely distributed under the FDL. http://www.gnu.org/copyleft/fdl.html INTRO In November 1999, Ian Kluft author of WebFetch [1], gave a presentation to ELUG [2]. WebFetch is a collection of Perl modules providing the framework to collect (or prefetch) web information for inclusion on a web server. Since I was running Apache on my home server, I figured this would be an excellent tool to add some dynamic content to my home page. Prior to building the dynamic home page, it was a simple collection of search engine forms [0]. Following the WebFetch instructions, I setup a cron job to collect the headlines for Slashdot and LinuxToday. Then I modified my Apache config to process .shtml files and embed the include files. Finally, I copied my index.html file to index.shtml and added the include comments. This worked well as an initial prototype, but I wanted to also fetch headlines from SolarisCentral and Hacker News Network. These sites didn't have WebFetch modules. I also decided to grab the Daily Static comic from UserFriendly. [See important note below on copyrights and politics of prefetching - 3] I was quite happy with the new look of my new page, but my wife asked "Gee, you've got all that Geek News and stuff, how about adding something I might be interested in"? So, I set off to get the newspaper headlines from the Orlando Sentinel, the local radar image from Channel 9 television and the regional satellite image from CNN. Now my personal home page is automatically updated a few times each day and since the content is stored locally on my server, I don't have to connect to the Internet to peruse the headlines. I can pick and choose the links I want to follow... PUTTING IT ALL TOGETHER I won't go into the details of getting, installing and configuring your system for WebFetch. That's covered in the WebFetch documentation and FAQ. I will show you how I put together the complete package, including the cron scripts to collect and parse the data. Once you have WebFetch installed and tested, you'll need to decide where to put your scripts and resulting content files. I use $HOME/bin to store my scripts and created a 'fetched' directory under my Apache html doc root. Next, decide what content, headlines or whatever you wish to fetch. Check to see if the site already has a WebFetch module or offers a 'syndication file'. Obviously, an existing WebFetch module will make things much easier. As mentioned in the INTRO, I selected Slashdot and LinuxToday headlines (these have WebFetch modules), the syndicated SolarisCentral and Hacker News Headlines, along with headlines from the Orlando Sentinel. Additionally, I prefetch the UserFriendly comic strip and weather images. While SolarisCentral offers a syndication file, it is in text format and requires some parsing to convert to html. Fortunately, SC also offers a ksh script to do just that [4]. I modified that script, primarily changing to bash, but also to output the html as I wanted it formatted. Now we need to fetch, parse and output the desired HTML. I accomplished this with a 'webfetch' script kicked off a few times each day via cron. Here's the crontab entry- 30 7,12,17,20 * * * /home/brian/bin/webfetch The quickest way to learn something is to dive in- so let's! ----------------------------------------------------------------------------- $ cat ~/bin/webfetch #!/bin/bash # Fetch various html headlines and # store in /home/httpd/html/fetched # /usr/local/bin/ping.net >/dev/null 2>&1 # fire up the ppp link sleep 5 # wait for ppp to settle echo "Headlines as of: $(date) " > /home/httpd/html/fetched/fetch_time.html perl -w -MWebFetch::Slashdot -e "&fetch_main" -- --dir /home/httpd/html/fetched/ perl -w -MWebFetch::LinuxToday -e "&fetch_main" -- --dir /home/httpd/html/fetched/ wget -q -r -l1 -O /home/httpd/html/fetched/SolarisCentral.txt http://www.SolarisCentral.org/news/ultramode.txt && /home/brian/bin/parseSC.sh wget -q -r -l1 -O /home/httpd/html/fetched/headlines.html http://www.hackernews.com/headlines.html wget -q -r -l1 -O /home/httpd/html/fetched/radar.gif http://www.insidecentralflorida.com/auto_includes/weather/doppler/max1.gif wget -q -r -l1 -O /home/httpd/html/fetched/sesat.png http://www.cnn.com/WEATHER/accu.data/sesat.png wget -q -r -l1 -O /home/httpd/html/fetched/OSO.txt http://www.orlandosentinel.com/ && /home/brian/bin/parseOSO.sh # # Only get the UserFriendly comic once a day (at 07:30 cron) # if [ `date +%k` -eq 7 ]; then wget -q -r -l1 -O /home/httpd/html/fetched/UF.txt http://www.userfriendly.org/static/ && /home/brian/bin/parseUF.sh fi -------------------------------------------------------------------------------- I use dial-on-demand, so I run a ping.net script first. This script simply issues a single ping to my ISP. I often use this in cron jobs that need Internet access; it ensures the link is active. Next, I echo the current date and time into the fetch_time.html file. This is used later to set the title and a header on the home page. A couple of WebFetch calls are next. One gets the Slashdot headlines, the other LinuxToday. Since Webfetch formats the output properly, there's no need to further massage the results. wget is used next to retrieve the SolarisCentral, Hacker News, Orlando Sentinel, and weather images. Finally, the UserFriendly strip is pulled down, but only if the hour is 7. Because the comic is only updated once each day, there is no need to get it more often than this. There are three sites that require additional parsing of the fetched page - SolarisCentral, Orlando Sentinel, and UserFriendly. For each of these, I have written a parse script. The SolarisCentral script is available from there, so I won't go into that, but let's look at the other two. ------------------------------------------------------------------------------- $ cat parseOSO.sh #!/bin/bash # parse the Orlando Sentinel Online headlines theFile=/home/httpd/html/fetched/OSO.txt # location of the file tmpFile=/tmp/OSO.tmp # location of the temp file outFile=/home/httpd/html/fetched/OSO.html # html with links # need to cut twice since there's more than one occurance... # startHL=`grep -n "<\!-- BEGIN BREAKING NEWS HEADLINES -->" $theFile | cut -d: -f1` startHL=`echo $startHL | cut -d' ' -f1` endHL=`grep -n "<\!-- END BREAKING NEWS HEADLINES -->" $theFile | cut -d: -f1` endHL=`echo $endHL | cut -d' ' -f1` startmoreHL=`grep -n "<\!-- START BOTTOM HEADLINES -->" $theFile | cut -d: -f1` endmoreHL=`grep -n "<\!-- BEGIN CALENDAR" $theFile | cut -d: -f1` endmoreHL=`echo $endmoreHL | cut -d' ' -f1` sed -n -e "$startHL,$endHL w $tmpFile.HL" \ -e "$startmoreHL,$endmoreHL w $tmpFile.HL" $theFile # get headlines sed -n -e "s:.*
$/!s:^M:

:' $tmpFile.new > $outFile rm $tmpFile* exit ------------------------------------------------------------------------------- I won't describe each line in detail, but in a nutshell... This script finds the beginning and end of the headlines (flagged by the 'BEGIN' and 'END' comments) and some additional headlines. Then sed is used to find the href links, cleanup the trailing '' tags and write out the results. Now let's take a look at the UserFriendly parser. ------------------------------------------------------------------------------- $ cat parseUF.sh #!/bin/bash # parse the UserFriendly daily static comic theFile=/home/httpd/html/fetched/UF.txt # location of the html file outFile=/home/httpd/html/fetched/UF.gif # # start=`grep -n "<\!--Start Current Strip-->" $theFile | cut -d: -f1` let start="$start + 2" # skip the "" line url=`sed -n -e "$start s:.* SRC=\":: " -e "$start s:\">:: p" $theFile` wget -q -r -l1 -O $outFile $url exit ------------------------------------------------------------------------------- Besides being much smaller than the other parsers, this script invokes wget to retrieve the .gif for the comic strip. This is trivial, once sed provides the url. FINISHING UP OK, now that we have all these headlines and such, what do we do with them? Well, the UserFriendly strip and weather images can be coded as regular 'img src=' html. But, we need to embed the headline filenames in a .shtml file so Apache will do the server-side include magic. This is done as described in the Apache and WebFetch documentation with an html comment containing '$include file={filename}'. You can even place the comments within other tags for special effects. For example, remember the date and time file we created when the webfetch script started? Here's how I use that file to set the title and a header: <!--#include file="fetched/fetch_time.html" --> -- Find On the 'Net...

FINAL NOTES Someday when I feel the need to brush up on my perl skills, I may create WebFetch modules for these sites. Until then, wget and sed are getting the job done... To avoid reconfiguring my client browsers [5], I changed my index.html file to redirect to the index.shtml (note, hank is my server name)... ----------------------------------------------------------------------------- $ cat index.html ----------------------------------------------------------------------------- I always create a link back to the original site for any content I prefetch. That way, I can quickly get to the very latest information. This can be pretty important in Central Florida with hurricane season well under way! For example, here's how my weather images and OSO headlines are linked: ------------------------------------------------------------------------------ FL Radar image

SE Satellite image

Orlando Sentinel Headlines

------------------------------------------------------------------------------ As you can see, I built my page with tables to hold the headlines, but I'll leave the exercise of formatting of your page to you... REFERENCES [0] You can find a example of the search forms at the bottom of the bitbucket http://www.magicnet.net/~brianc/#SEARCH [1] WebFetch home page - http://www.webfetch.org/ [2] Ian's ELUG presentation - http://www.webfetch.org/elug-199911/ [3] It is important to emphasize- this is for your own personal use. Don't use these techniques to grab content and republish it as your own. It's a violation of copyright laws. Also, don't fetch more than a few times each day. Most content doesn't update more often than that. [4] http://www.SolarisCentral.org/who/parse.shtml [5] Actually, it's because I'm lazy and didn't want to go change the home page default for every web browser, on every box in the house, in every OS, for each user. :) I've got several machines and all have at least two OS's and many web browsers, and several users. Some of the browser defaults point to just the host URL, some actually specify host/index.html. Redirecting index.html was simply the easy way out.