All of the interesting technological, artistic or just plain fun subjects I'd investigate if I had an infinite number of lifetimes. In other words, a dumping ground...

Thursday 27 September 2007

A good Python exercise - needs a mail log file


http://www.cs.ucla.edu/classes/fall04/cs131/hw/pr.html

Project. Python email log web front end
Background

Your boss at Amalgamated Data Associates is concerned about recent
problems the company is having with sending email. Many of the company's
business consulting issues are handled via email, and your boss needs to
know in real time whether email is being sent to your customers on time.

Your boss doesn't like the hassle of logging into your company's
Unix-based email servers to check on the status of incoming and outgoing
email, and would prefer have a simple web page that can be visited to
view message status. You are assigned the job of building a
quick-and-dirty web server that can let your boss view the log.

"While you're at it," your boss says, "let's condense the information
into a simple form that I can understand, rather than the gibberish that
is in the actual log file." This is referring to the fact that the
actual log lines look like this:

Nov 18 12:37:21 kelton sendmail[8943]: [ID 801593 mail.info]
iAIKbNg08943: from=poobah, size=76, class=0, nrcpts=1,
msgid=<200411182037.iAIKbNg08943@kelton.seas.ucla.edu >,
relay=poobah@localhost
Nov 18 12:37:22 kelton sendmail[8972]: [ID 801593 mail.info]
iAIKcBX08972: from=bazzfazz, size=80, class=0, nrcpts=1,
msgid=< 200411182038.iAIKcBX08972@kelton.seas.ucla.edu>,
relay=bazzfazz@localhost
Nov 18 12:37:23 kelton sendmail[8943]: [ID 801593 mail.info]
iAIKbNg08943: to= dyore@cs.ucla.edu, ctladdr=poobah (5836/102),
delay=00:00:00, xdelay=00:00:00, mailer=relay, pri=30076,
relay=mailscanner.seas.ucla.edu. [ 164.67.100.71], dsn=2.0.0, stat=Sent
(MAA19327 Message accepted for delivery)
Nov 18 12:48:12 kelton sendmail[8972]: [ID 801593 mail.info]
iAIKcBX08972: to="moe&joe"@ almalgda.com, ctladdr=bazzfazz (5836/102),
delay=00:10:49, xdelay=00:10:49, mailer=relay, pri=30080,
relay=mailscanner.seas.ucla.edu. [ 164.67.100.80], dsn=2.0.0, stat=Sent
(MAA26789 Message accepted for delivery)

whereas your boss wants to see something that looks more like this, with
more-recently-delivered messages first:
Email report generated 2004-11-18 12:45:12 delay        date    time    from    to      id
10:50   2004-11-18      12:48:12        bazzfazz        "moe&joe"@amalgda.com
200411182038.iAIKcBX08972
00:02   2004-11-18      12:37:23        poobah  dyore@cs.ucla.edu       200411182037.iAIKbNg08943
Assignment

Write a web server that repeatedly does the following:

   1. Accept a request that specifies a URL of the form
<http://HOST:PORT/emaillog.rpy?date=DATE> or
<http://HOST:PORT/emaillog.rpy >.
   2. For each such request, find all email log entries corresponding to
messages delivered on DATE. DATE must be either a string of the form
YYYY-MM-DD that specifies a date using the Gregorian calendar, or the
empty string which stands for today's date. If ?date=DATE is not given,
find all email log entries regardless of date.
   3. Generate a text/html response that contains a a summary of the
email message log in question, using a table of the form shown above.

If an invalid DATE parameter is given, your web server should respond
with an appropriate error diagnostic.

Your web server should consult the log file named by the environment
variable SYSLOGFILE. If that variable is not set, your server should
either exit immediately, or act as if SYSLOGFILE named the empty file
/dev/null.

Your web server should allow multiple overlapping requests. For example,
if one browser wants the logs for November 17 and another browser
simultaneously asks for the logs for November 18, both requests should
be processed independently: your server should not wait to answer the
second request simply because the first browser is slow to accept its
response.

When analyzing system logs you need not worry about hiccups in the
system clock due to operator adjustments or daylight-saving changes.

Implement your web server with Twisted, an event-driven networking
framework written in Python. Twisted documentation is available but is a
bit scanty; you may find it easier simply to read the source code. You
can find a copy of the Twisted 1.3.0 source code and documentation in
~eggert/src/Twisted-1.3.0 on SEASnet. A compressed tar image is also
available in the file ~eggert/src/tarpit/Twisted-1.3.0.tar.gz and at
Twisted Matrix Laboratories.

You may find it necessary to modify Twisted. If so, minimize the number
of changes to the existing source code. For example, instead of
modifying an existing class, use a new subclass of your own instead,
whenever possible.

An .rpy file is a resource script: it is like a normal Python .py file,
except that there is one extra restriction: an .rpy file must define a
global variable resource whose value is an instance of a subclass of
twisted.web.resource.Resource. By default, the Twisted web server
renders the resource in response to a URL naming the resource script.
Please see the twisted.web.resource.Resource.render API for a few more
details, but the full story is perhaps best discovered by reading the
Twisted source code mentioned above. A simple example .rpy file can be
found in ~eggert/twisted/public-html/test.rpy on SEASnet.

All web pages that your web server generates must conform to
XHTML-1.0-Strict.For an example of a conforming web page, please see
this very web page itself.
Suggestions for testing

Verify your web pages with the W3C Markup Validation Service.

To figure out which SEASnet host you're running on, type the shell
command hostname. If it outputs the string lindbrook, you are running on
lindbrook.seas.ucla.edu. To simplify exposition the rest of this section
assumes that you're running on lindbrook.

To avoid clashes with other people's web servers, use a port number that
is 14000 greater than your student ID modulo 284. For example, if your
student ID is 123456789, then use port 14001 (because 14001 is 14000 +
123456789%284, as you can easily confirm by typing that expression into
an interactive Python session) by using the --port=14001 option when
generating your web server with mktap as described below.

Here's a recipe for getting started with Twisted. Execute the following
shell commands:

   # Specify log file location.
   setenv SYSLOGFILE /var/log/syslog

   # Make a new directory and build a web server that will run in it.
   mkdir twisted
   cd twisted
   mktap web --path=public-html --port=14001

   # Put your web pages in this directory.  Here is a sample:
   mkdir public-html
   cp ~eggert/twisted/public-html/test.rpy public-html/test.rpy

   # Run your web server.  Always use "-n", for debug mode.
   twistd -n -o -f web.tap
   # You should see some log messages starting with "Log opened"
   # and ending with "set uid/gid".

   # Now you can test your web server.
   # You should see more log messages as you test.

   # Type Control-C to exit your web server.

To test whether your server is working with the sample web pages
mentioned above, use a browser to visit
< http://lindbrook.seas.ucla.edu:14001/test.rpy?date=2004-11-30>. You
should get a Test Web Page that lists its URL, URI, and query arguments,
among other things.

If the above commands don't seem to work for you, perhaps it's because
your environment is not set up correctly. Make sure that your PYTHONPATH
environment variable contains
~eggert/opt/SunOS-5.8-sparc/Twisted-1.3.0/lib/python2.4/site-packages
and that your PATH environment variable contains both
/usr/local/python-2.4b2/bin and
~eggert/opt/SunOS-5.8-sparc/Twisted-1.3.0/bin.
Submitting your work

Submit a file named pr.tgz. It should be a gzipped tar file containing
all the source files that are needed to build and run your project. One
of these files must be called README.txt which must be a simple ASCII
text file that starts with your name and student ID, and contains a
brief discussion and documentation of any extra features that you have
added to your web server (see below). For example, you might use the
following command to create pr.tgz:

   cd $HOME/twisted
   tar cf - README.txt *.py public-html | gzip -9 >pr.tgz

This causes tar to generate a single output stream containing the named
files; the | is the Unix pipe symbol, which causes tar's output to be
sent as input to gzip. Don't forget the - (separated by spaces) after
the cf.

Before submitting pr.tgz, test it by running the following commands:

   mkdir testdirectory
   cd testdirectory
   # Substitute your own port for "14001".
   mktap web --path=public-html --port=14001
   gunzip <../pr.tgz | tar xf -
   setenv SYSLOGFILE /var/log/syslog
   twistd -n -o -f web.tap

   # [Test your web server here.]

   # When testing is done, clean up as follows.
   # First, type Control-C.
   # Then, remove your test copy as follows:
   cd ..
   rm -fr testdirectory

Grading

The base assignment, described above, is worth 70% of the total grade.
You can make up the difference, and then some, by adding extra features;
this can boost your score up to at most 120% of the total grade. Feel
free to gussy up your web server, by creating a nice input form front
end, having it efficiently propagate log changes to the browser, or any
other goodies you might think of. For ideas, you might look at Dominique
Hazaƫl-Massieux's Crow's Nest (which also happens to be written using
Twisted, but please don't look at its Python source code, just look at
its user-visible behavior). If you'd like extra credit for your
improvements, please propose them to the TAs in advance to get an idea
of how much they think they're worth.
Questions and answers

This section answers questions that came up after the assignment was
originally published. Thanks to Brandon Gabbert for Q1 through Q10.

Q1. If the syslog entries do not contain a year field for the date, do
we just assume that all years are the current year?

A1. Yes.

Q2. From an OS standpoint, why does the log not contain a year field?

A2. There's no good reason, really. The original designer of the log
file format did it that way, and it's too late to fix it now.

Q3. Are we supposed to return all lines in the log from the date given,
only those lines whose facility and severity are mail.info, or only
those lines that correspond to actual sending/receiving of mail? If the
latter, is it sufficient to look at the fields after the colon to
determine if an entry is a send/recv line?

A3. For the basic report you can look at only the lines of the general
form shown in the assignment. You can also look at other lines, for
extra-credit reports, if you want.

Q4. Can we assume that all messages that have a 'from' entry will
eventually have a 'to' entry, or do we need to consider the case when
the email is somehow cancelled before it is sent?

A4. You can't assume that, no. For example, you might be looking at the
log file when the first line has been generated, but the second one
hasn't been generated yet.

Q5. From the man page and the existing syslog, it is not clear to me
whether all email entries will have a fixed format (i.e. [from, size,
class, nrcpts, msgid, relay] and [to, ctladdr, delay, xdelay, mailer,
pri, relay, dsn, stat]). Is this the case?

A5. For the basic assignment, please just ignore all fields that aren't
relevant to the assignment. You can also ignore log lines that seem to
be corrupted.

Q6. Must we ignore all fields except those listed in the spec (delay,
date, time, to, from, id)?

A6. For the basic assignment you can ignore the fields you don't need.
For extra credit you can pay attention to them.

Q7. Can we assume that all entries tied to a particular email will have
the same ID (e.g. the ID iAIKbNg08943 from the spec occurs in both
halves of that particular email)?

A7. Yes.

Q8. What exactly constitutes an invalid date? Obviously we need to check
for invalid format (date=thisIsNotADate, etc.), but are all sequences of
numbers that match the format valid dates? It would seem to me that any
date before, say, the invention of the internet is not a valid date.
Similarly, tomorrow is not a valid date. Should we consider the semantic
validity of the input date in addition to its syntactic validity?

A8. You needn't do range checking on date years. Dates can use any year
from 0000 through 9999. However, if someone gives you a bogus month,
day, hour, minute, or second (e.g., "2004-00-33 24:60:60") you should
report an error instead of crashing.

Q9. If the user enters an invalid date, we are to issue a diagnostic
error. Is this something that needs to be standardized across the class,
or can I do something interesting in this case, like prompt the user to
correct the date using a form or something like that?

A9. You can do something interesting.

Q10. Is it proper to include the angle brackets when displaying nonlocal
addresses? E.g., a local account is logged as fazzbazz, but nonlocal
accounts are logged as <acct@somewhere.org>. In the spec, the sample log
entries do not include angle brackets, but in the real syslogs, they do.
I currently remove them so the output corresponds to the sample in the
spec, but since there are some emails whose to/from fields are <>, when
I remove the brackets, I get an empty field which looks funny.

A10. Please leave them empty; it'll make it easier to grade.
© 2004 Paul Eggert. See copying rules.
$Id: pr.html,v 1.7 2004/12/02 18:35:47 eggert Exp $

No comments:

tim's shared items

Blog Archive

Add to Google Reader or Homepage