Backup: Immediate, Full and Long-term

Preserving the availability of digital artifacts is a goal worthy of pursuit. First, I’ve got thousands of digital family photos, and I don’t want to lose them to hard drive failure or lock them up so that they’re hard to get to. Second, I’ve got my email stored on my computer for the past several years. The recent email is what’s most valuable to me, but every once in a while, I need to search through email archives to find things, like a license key for [Date Book 5](http://www.pimlicosoftware.com/datebk5.htm). Third, it took weeks to install software and configure our laptop. I don’t want to have to repeat that work if the hard drive happens to stop working — especially if a project I’m working on needs to be done soon.

There are three main types of backup that are important to me: Immediate backup, full backup and long-term archival.

__Immediate Backup__
—————–

What I’m currently working on with the computer is usually more important than what I was doing on the computer a few weeks ago. The auto-save and even the “undo” feature of most word processing programs can help me when one of my children touches the keyboard and accidentally deletes most of the text. Auto-save and undo won’t help if my laptop is stolen or the hard drive fails. That’s why I use [mozy.com](http://mozy.com) for automated, off-site backups of my Windows laptop. It’s well worth $5.00 per month for this service, and it’s easy to pay for: skip eating out for lunch once per month.

For backup to happen regularly, automation is key — especially for immediate backup. I would make full backups more frequently if it were an automated process. I use a monthly repeating reminder so I remember to backup the things that aren’t automated.

__Full Backup__
———–

Admittedly, hard drives don’t fail often, and laptops that usually stay at home aren’t often stolen (at least, not in my neighborhood). But when it does happen, it’s a pain to reinstall the myriad of applications we use on a semi-regular basis. This is why a periodic, full backup is valuable. Doing a full backup with optical media takes too much time. External hard drives are much faster, have more capacity, and are inexpensive. They plug in using standard connectors such as USB, FireWire or eSATA. I store my external USB hard drive in a fireproof box.

__Long Term Archival__
——————

I want the best of my digital memories (e.g. photos) to be preserved for decades or centuries. A CDROM may be readable in ten or twenty years, but not fifty or a hundred. There probably won’t be hardware to read it in fifty years. Will computers in fifty years recognize JPEG format? No idea!

To preserve digital artifacts for that long requires refreshing it periodically into newer formats and storage media. It’s a good idea to use open, standardized formats rather than proprietary formats. For photos, this means to use JPEG and PNG in preference to Photoshop format.

Rather than refresh constantly, there’s the option of _printing_ photos and documents. It’s going to be easier to view a physical photo or a printed document in a hundred years than to unlock the secrets of an old hard drive.

__Trust, but verify__
—————-

I tend to trust my backup solutions, but it’s necessary to verify that they’re working. My brother’s computer periodically downloads my digital photos. I trusted that this was, at least in part, a good off-site backup. I learned recently, however, that his computer deletes old photos when space gets low, which is often.

__Resources__
———-
[Preserving Your Digital Memories: What you can do](http://www.digitalpreservation.gov/you/digitalmemories.html)

A few backup solutions: Mozy, Carbonite, SyncBackSE, and JungleDisk.

Interesting projects to backup using P2P protocols (including featurs such as encryption and
fault tolerance): [Tahoe](http://allmydata.org/trac/tahoe)
with a [writeup from LWN](http://lwn.net/Articles/280483/) and [Flud](http://www.flud.org).

Using the 2.6.26 Linux Kernel Debugger (KGDB) with VMware

Reading the linux kernel documentation on KGDB wasn’t enough for me to be able
to use the newly built-in KGDB kernel debugger with version 2.6.26 or 2.6.27.
The breakthrough for me was reading [part of Jason Wessel’s
guide](http://www.kernel.org/pub/linux/kernel/people/jwessel/kgdb/ch03s03.html).

I have two machines:

* developer – where I run gdb
* target – where the kernel is being debugged, running in VMware

Configure VMware on the developer machine

* Power down the guest (target)
* Edit the VM guest settings
* Add a serial port
* Use named pipe `/tmp/com_1` (it’s really a UNIX domain socket)
* Configure it to “Yield CPU on poll” (under Advanced)
* Install ‘socat’, if not already installed

Configure and Compile the kernel on the developer or the target machine

– Get kernel 2.6.26 or newer
– `make menuconfig` # or make gconfig
– Under Kernel Hacking:
– enable KGDB
– enable the Magic SysRq key
– enable “Compile the kernel with debug info”
– Build kernel: `make`

Configure target

– Enable Magic SysRq key on target:
– Edit /etc/sysctl.conf and set `kernel.sysrq = 1`
– or run `sysctl -w kernel.sysrq=1` # this doesn’t survive a reboot
– Install developer kernel
– On the developer machine:
`rsync -av –exclude .git ./ root@target.host.name:/mnt/work/linux-2.6.26`
– On the target, a RedHat based system:
`make install`
`make modules_install`
– Edit /boot/grub/grub.conf and set `timeout=15`
– Boot into the newly installed kernel

Start debugging

– On target:
`echo ttyS0 > /sys/module/kgdboc/parameters/kgdboc`
– On developer:
`socat -d -d /tmp/com_1 PTY:` # notice what pty is allocated — /dev/pts/1 in my case
`gdb vmlinux`
`set remotebaud 115200`
`target remote /dev/pts/1`
– On target, do one of the following:
– `echo “g” > /proc/sysrq-trigger`
– Type ALT-SysRq-G
– Ready, get set, go! Go back to developer machine and use gdb to set
breakpoints, continue, etc.

I set up debugging because I wanted to understand the behavior of the kernel
when loading a module. It turns out that loading of the module failed because
sitting in a debugger delayed the execution, causing a timeout in module load
by the time I stepped through the code. Use of printk turned out to work
better.

HP xw4600: HOWTO enable hardware virtualization

How to enable Intel hardware virtualization on an HP xw4600:

* Boot into the hardware BIOS setup
* Got to Security -> System Security
* Enable both types of virtualization (VTx and VTd)
* Save settings, and power-cycle the machine.

I’m running Linux, Fedora 9, and using KVM, so I run the following:

modprobe kvm-intel

Loading that module will fail if hardware virtualization isn’t enabled.

DjangoCon videos on YouTube

I’m not a Django programmer, but for those who are, this may be useful. YouTube has videos of the inagural DjangoCon conference:
[http://www.youtube.com/results?search_query=djangocon&search=tag](http://www.youtube.com/results?search_query=djangocon&search=tag)

Transferring a linux hard drive to a new machine

For over a year, I’ve endured a development machine that would lock up under heavy disk I/O. Yesterday, I apparently complained loudly enough that I was given a new machine to replace it. I didn’t want to reinstall Fedora 9, so I transferred my old hard drive to the new machine, as the primary drive. To get it to boot and function properly, here’s what I did:

* Booted with the Fedora 9 install CD into “rescue mode”
* Ran the following commands once I had a shell:

mount –bind /dev /mnt/sysimage/dev
mount –bind /sys /mnt/sysimage/sys
chroot /mnt/sysimage
mv /boot/initrd-2.6.25…i686.img /boot/initrd-2.6.25…i686.img.orig
mkinitrd /boot/initrd-2.6.25…i686.img 2.6.25…i686

* Then I ran ‘grub’, and typed the following:

root (hd0,0)
setup (hd0)
quit

* Ejected the install CD, and rebooted. Once booted, I noticed that my network cards weren’t set up quite right. My new network card was listed as “eth2” in system-config-network, and I didn’t actually have cards for the listed “eth0” and “eth1” interfaces anymore. I didn’t know what file to change to get my new card listed as “eth0”, so I ran the following command to find out what files I might need to edit:

find /etc -type f -print0 | xargs -0 grep “eth[01]”

That command listed the following files, among others:

* /etc/udev/rules.d/70-persistent-net.rules
* /etc/vmware/locations

I edited /etc/udev/rules.d/70-persistent-net.rules and ripped out the assignments for my old NIC interfaces, and set the new one to be “eth0”, then rebooted and used `system-config-network` to set up my network.

When I ran my VMware guest, VMware Server gave me an error message about not being able to use bridged mode for the selected interface. With my old computer, VMware had used eth1 for bridged networking, and I didn’t have an “eth1” interface anymore. I edited /etc/vmware/locations and changed “eth1” to “eth0”, and restarted vmware. This time, bridged mode worked correctly.

Web App Security Statistics

Perhaps this is a bit old, but it’s the first time I’ve seen it, and I thought it was interesting enough to share.

[http://www.webappsec.org/projects/statistics/](http://www.webappsec.org/projects/statistics/)

* more than 7% of analyzed sites can be compromised automatically
* Detailed manual and automated assessment using white and black box methods shows that probability to detect high severity vulnerability reaches 96.85%.
* The most prevalent vulnerabilities are Cross-Site Scripting, Information Leakage, SQL Injection and Predictable Resource Location

Git Book, yap

The Pragmatic Bookshelf is releasing a [book on using Git](http://www.pragprog.com/titles/tsgit/pragmatic-version-control-using-git) for version control.

Steven Walter released a new command-line front-end for git called [yap](http://lwn.net/Articles/297285/). It’s not only supposed to make it easier to work with git, but also with subversion repositories. It’s available from [http://repo.or.cz/w/yap.git](http://repo.or.cz/w/yap.git)

MySQL or PostgreSQL?

I’ve often wondered why people seem to prefer either MySQL or PostgreSQL. For the most part, I think it comes down to the following:

* Familiarity.
* Friends (a.k.a. support system) being more familiar with one over the other.
* Ease of getting started. Most web hosting providers support MySQL out-of-the box.
* Name recognition.
* Ease of support.

Here are some resources that could be useful for learning the pros and cons of each database:

* [What MySQL can learn from PostgreSQL](http://www.scribd.com/doc/2575733/The-future-of-MySQL-The-Project)
* [What can PostgreSQL learn from MySQL](http://www.postgresonline.com/journal/index.php?/archives/48-What-can-PostgreSQL-learn-from-MySQL.html) and the [accompanying presentation](http://www.commandprompt.com/files/mysql_learn.pdf)
* [MySQL quirks and limitations](http://use.perl.org/~Smylers/journal/34246)
* [Why PostgreSQL?](http://wiki.postgresql.org/wiki/Why_PostgreSQL_Instead_of_MySQL:_Comparing_Reliability_and_Speed_in_2007)

Effective forms of communication

Have you ever wondered what forms of communication are the most and the least
effective for software engineers? See Scott Ambler’s [“Models of Communication” diagram in his essay](http://www.agilemodeling.com/essays/communication.htm). Face-to-face is most effective, and paper is the least effective, with email, telephone and video conferencing falling in-between the two ends of the spectrum.

REST versus RPC

Have you considered the merits and applicability of RESTful web apps? Here are a few notes I’ve made.

There was quite a [discussion about RPC, REST, and message queuing](http://steve.vinoski.net/blog/2008/07/13/protocol-buffers-leaky-rpc) — they are not the same thing. Each one is needed in a different scenario. All are used in building distributed systems.

Wikipedia’s [explanation of REST](http://en.wikipedia.org/wiki/Representational_State_Transfer) is quite informative, especially their [examples](http://en.wikipedia.org/wiki/Representational_State_Transfer#Example) of RPC versus REST.

The poster “soabloke” says RPC “Promotes tightly coupled systems which are difficult to
scale and maintain. Other abstractions have been more successful in building
distributed systems. One such abstraction is message queueing where systems
communicate with each other by passing messages through a distributed queue.
REST is another completely different abstraction based around the concept of a
‘Resource’. Message queuing can be used to simulate RPC-type calls
(request/reply) and REST might commonly use a request/reply protocol (HTTP) but
they are fundamentally different from RPC as most people conceive it. ”

The [REST FAQ](http://rest.blueoxen.net/cgi-bin/wiki.pl?RestFaq) says, “Most applications that self-identify as using “RPC” do not conform to the REST. In particular,
most use a single URL to represent the end-point (dispatch point) instead of using a multitude of
URLs representing every interesting data object. Then they hide their data objects behind method
calls and parameters, making them unavailable to applications built of the Web. REST-based
services give addresses to every useful data object and use the resources themselves as the
targets for method calls (typically using HTTP methods)… REST is incompatible with
‘end-point’ RPC. Either you address data objects (REST) or you don’t.”

RPC: Remote Procedure Call assumes that people agree on what kinds of procedures they would like
to do. RPC is about algorithms, code, etc. that operate on data, rather than about the data
itself. Usually fast. Usually binary encoded. Okay for software designed and consumed by a
single vendor.

REST: All data is addressed using URLs, and is encoded using a standard MIME type. Data that is
made up of other data would simply have URLs pointing to the other data. Assumes that people
won’t agree on what they want to do with data, so they let people get the data, and act on it
independently, without agreeing on procedures.