Perl one liners for email analysis

I thought it’d be interesting to know what times of day people were most likely to send me email. My email is stored in mbox format (I used Thunderbird and mutt for email), so I wrote a perl one-liner to analyze it for me.

The first one-liner prints a histogram, in 80 columns, of activity per-hour of the day. The second prints it in a form suitable for import into a spreadsheet

Histogram:

perl -nle ‘$sum[$1]++ if m/^Date: .* (\d\d):\d\d:\d\d/; END {foreach (@sum) { $max = $_ if $_ > $max }; $div = $max/80; foreach (@sum) { print $i++ . ” ” . (“#” x ($_ / $div)) . ” ($_)”;}}’ /path/to/Inbox

0 #################################### (115)
1 ########################## (84)
2 ################### (62)
3 ################ (54)
4 ############ (40)
5 ######### (31)
6 ####### (23)
7 ######################## (79)
8 ####################################### (126)
9 ############################################### (152)
10 ######################################### (133)
11 ###################################### (124)
12 ############################################################### (202)
13 ############################################################## (200)
14 ############################################################ (192)
15 #################################################################### (218)
16 ######################################################################## (229)
17 ################################################################ (206)
18 ################################################## (160)
19 ############################### (101)
20 ##################################### (118)
21 ######################################## (129)
22 ######################################################### (183)
23 ######################################## (129)

Tabular data:

perl -nle ‘$sum[$1] += 1 if m/^Date: \w{3}, \d+ \w{3} \d{4} (\d\d):\d\d:\d\d/; END {foreach (@sum) { print $i++ . “\t” . $_;} }’ /path/to/Inbox

While I was at it, I wanted to know what the most common timezone offsets were. Again, I wrote two separate one-liners. One prints a histogram, and the other doesn’t.

Histogram:

perl -nle ‘$tz{$1} += 1 if m/^Date: .*([+-]\d{4})/; END {foreach (values %tz) {$max = $_ if $_ > $max }; $div = $max/80; foreach (sort(keys %tz)) { print “$_ ” . (“#” x ($tz{$_}/$div)) . ” ($tz{$_})”; }}’ /path/to/Inbox

Non-histogram:

perl -nle ‘$tz{$1} += 1 if m/^Date: .*([+-]\d{4})/; END {foreach (sort(keys %tz)) { print “$_ $tz{$_}”; }}’ /path/to/Inbox

I subscribe to various email lists, and each has different characteristics. I was surprised to find that my family email box usage pattern was fairly spread out around the clock, except that it drops off significantly during dinner and during the wee hours of the morning. Evening hours are the most active.

I’ve taken the timezone one-liner and modified it to tell me the most common months of the year, or the most common days of the week for email to be sent. For all my email boxes, analyzed over the last few years, email is most active on weekdays, and drops off on weekends.

Mon ############################################################### (5630)
Tue ##################################################################### (6129)
Wed ######################################################################## (6372)
Thu ##################################################################### (6155)
Fri ############################################################ (5329)
Sat ############################## (2675)
Sun ########################## (2368)

I tried translating those one-liners into Ruby, but it wasn’t as compact, and doing it as a one-liner in Java just isn’t going to happen.

Perl 5 to 6

Moritz Lenz has written a series of informative blog posts about Perl 6, for Perl 5 programmers. Here’s a bit of his introduction:

> Perl 6 is underdocumented. That’s no surprise, because (apart from the specification) writing a compiler for Perl 6 seems to be much more urgent than writing documentation that targets the user.

> Unfortunately that means that it’s not easy to learn Perl 6, and that you have to have a profound interest in Perl 6 to actually find the motivation to learn it from the specification, IRC channels or from the test suite.

> This project, which I’ll preliminary call “Perl 5 to 6” (in lack of a better name) attempts to fill that gap with a series of short articles.

[Read more…](http://perlgeek.de/blog-en/perl-5-to-6/)

Google’s new web browser: Chrome

Google is [releasing](http://www.google.com/chrome) a beta web browser called “[Chrome](http://www.google.com/chrome)” tomorrow, and they’ve even got a [comic strip](http://www.google.com/googlebooks/chrome/) to explain the design choices they made, and how it’s supposed to make life better.

The browser is based on [WebKit](http://en.wikipedia.org/wiki/WebKit).
They aim to make JavaScript vastly faster with a new JavaScript virtual
machine called V8. At the same time, the Mozilla team is beefing up
Firefox 3.1 with a faster JavaScript engine called [TraceMonkey](http://www.pcmag.com/article2/0,2704,2328737,00.asp).

V8 and TraceMonkey reportedly race down the freeway while IE 7 and IE 8
are left puttering along at pedestrian speeds.

Mozilla 3.1 to include Theora video support

[LWN reports](http://lwn.net/Articles/292939) that the OGG Theora video format will be supported in Firefox 3.1. I believe this is a game-changing move on the web. It will make it easier and cheaper to distribute video that will render on any OS running Firefox (because there are no patent royalties to pay). It will catapult the Theora video format into the mainstream.

An LWN reader [pointed out](http://lwn.net/Articles/293076/) that Theora has traditionally lacked quality and performance compared to MPEG-4, but that it’s being remedied by the in-progress “Thusnelda” project.

xguest

I just discovered and installed the _xguest_ package for Fedora 8 and 9. Here’s what it does:

> Installing this package sets up the xguest user to be used as a temporary account to switch to or as a kiosk user account. The account is disabled unless SELinux is in enforcing mode. The user is only allowed to log in via gdm [or the fast-user-swiching applet]. The home and temporary directories of the user will be polyinstantiated and mounted on tmpfs.

Here’s how to install it:

yum install xguest

I hit a brick wall when I first tried it. I thought my machine was in SELinux Enforcing mode, when it wasn’t — it was in Permissive mode. I fixed it using system-config-selinux.

It’s possible to change what the xguest user can do using system-config-selinux. I’ve attached a screenshot showing what capabilities can be granted or revoked.

SELinux Administration for xguest user

Test-driven development in Perl

There’s an impressively in-depth presentation from [OSCON 2008](http://en.oreilly.com/oscon2008/public/schedule/proceedings) about [Practical Test Driven Development in Perl](http://assets.en.oreilly.com/1/event/12/Practical%20Test-driven%20Development%20Presentation.pdf). It covers Test::More, Test::Class, Test::Differences, Test::Deep and Test::MockObject.

I also found the following to be interesting: [Even Faster Web Sites](http://assets.en.oreilly.com/1/event/12/Even%20Faster%20Web%20Sites%20Presentation%202.ppt) and [Pro PostgreSQL](http://assets.en.oreilly.com/1/event/12/Pro%20PostgreSQL%20Presentation.odp). Reading these helps me to know a little bit about what I don’t know.

Visualize your hard drive using a TreeMap viewer

Every once in a while, I get low on disk space, and hunting for large directories or files to delete can be difficult manually. [Tree Map visualization](http://en.wikipedia.org/wiki/Treemap) tools make the job easier. There’s [WinDirStat](http://windirstat.info/) for Windows, [KDirStat](http://kdirstat.sourceforge.net) for KDE, and [Disk Usage Analyzer](http://live.gnome.org/GnomeUtils/Baobab) (baobab) for Gnome.

![TreeMap Image](http://library.gnome.org/users/baobab/stable/figures/baobab_fullscan.png.en)

Article: A Patent Lie, and other patent happenings

Timothy B. Lee of the Cato Institute wrote [A Patent Lie](http://www.nytimes.com/2007/06/09/opinion/09lee.html?_r=3&oref=slogin&oref=slogin&oref=slogin), in which he explains why copyright is better for the software industry than patents:

> Don’t software companies need patent protection? In fact, companies, especially those that are focused on innovation, don’t: software is already protected by copyright law, and there’s no reason any industry needs both types of protection. The rules of copyright are simpler and protection is available to everyone at very low cost. In contrast, the patent system is cumbersome and expensive. Applying for patents and conducting patent searches can cost tens of thousands of dollars. That is not a huge burden for large companies like Microsoft, but it can be a serious burden for the small start-up firms that produce some of the most important software innovations.

The good news about software patents is that [they’ve been weakened](http://en.wikipedia.org/wiki/KSR_v._Teleflex) so that patent troll companies can’t wreak quite as much havoc as they have in the past. Now there’s not as much money in it. Apparently, [patent troll companies are getting smarter](http://www.linuxworld.com/community/?q=node/16789) about working with open source — most recently with RedHat:

> Trolls need to collect money to survive, and open source vendors can’t give it to them. The good news from this settlement [with RedHat], and [Blackboard’s](http://www.linuxworld.com/news/2007/020107-blackboard-no-action-against-open-source.html), is that trolls are realizing that hitting an open source company is like robbing a store where the safe is on a time lock. They can do damage and hurt people, but the money isn’t available to them.

The settlement was also [documented by Groklaw](http://www.groklaw.net/article.php?story=20080611191302741).

Products to avoid

The nice thing about mass-market commercial software is that I can purchase it at a small fraction of the cost to develop it myself, which I would never do because I don’t have the time. Unfortunately, home-user mass-market software seems to lack quality. Here are some that I recommend against.

* [Greeting Card Factory](http://www.google.com/search?q=greeting+card+factory). When I opened the package, I discovered that the software shipped on about six separate CDs! I purchased the software in 2007 — an enlightened age where most people have DVD drives. I’m impatient, and disliked having to play disk jockey to install the software. Once installed, I discovered that it’s cumbersome to use — too much clicking with the mouse required to get the job done. There’s no good preview of card greeting messages in the template browser, so I have to load each one in, click through the buttons to see the message, and then start all over again to find an appropriate card. It sure is a waste of time. The best greeting card software I’ve used was American Greetings, but that version was designed years ago and required inserting CDs to load some of the cards. Hallmark’s software was the most polished, robust, and least annoying, but I liked the quality of cards from American Greetings better.

UPDATE: There is a good way to preview greeting card messages in the template browser — you have to increase the zoom level to the maximum, and additional preview controls become visible.

* Symantec and McAffe AntiVirus. They slow down a computer too much (by 20% or more!). Anything that annoys my grandmother about activation is too much of a hassle. Switch to [AVG Free](http://www.google.com/search?q=AVG+free). I run Vista with an unprivileged account, and so far, I haven’t needed AV. I ran AVG Free on Windows XP for several years, and never got a virus — because I didn’t download and install random software — and because my user account didn’t have administrative privileges.

There’s hardware to avoid as well:

* [Kodak printers](http://printers.kodak.com/). I decided to give a Kodak printer a try because of the promise of cheaper ink. The printer has been a constant hassle ever since we purchased it. Just tonight, even after selecting the best print quality, it still printed every other line as faded and smudgy. My wife seems to know the ritual to make it print better, but she’s not here at the moment. Avoid Kodak printers at all costs. Go with an Epson or an HP — they provide quality results. If a laser printer fits your needs, they’re usually more reliable than an inkjet printer.

Fedora 9, NVidia, VMWare Server

I’ve upgraded four systems to Fedora 9 in the past couple of weeks. For those that have NVidia cards, it was a bumpy ride until NVidia released a [new driver](http://www.nvidia.com/object/linux_display_ia32_173.08.html). To install it as a pre-built RPM package, see [this blog post](http://nareshv.blogspot.com/2008/04/fedora-9-rawhide-and-latest-nvidia-179.html).

For the system that runs VMWare Server, it was necessary to [upgrade to version 1.0.6](http://www.howtoforge.com/vmware-server-installation-on-a-fedora9-desktop), which supports the 2.6.25 kernel shipped with Fedora 9.