mtnwestruby: Simple Bayesian Networks

Mountain West Ruby Conference: SBN
16 March 2007

Carl Youngblood. Simple Bayesian Networks with Ruby.

How do we handle uncertainty? Probablilty theory.
How do we come up with probabilities? From experience, from beliefs — but either of these may not be accurate because of too few samples or skewed samples.


gem install sbn

The Ruby version is faster than his C++ version. Premature optimization is the root of all evil. — Donald Knuth

xmlbif – the format that the bayes network is stored to and read from. Compatible with the Java Bayes package. See http://en.wikipedia.org/wiki/Bayesian_network

He couldn’t get his SBN demonstration program working as expected. Sounds like SBN isn’t ready for prime time. Future improvements:

  • Exact inference
  • True continuous variables
  • Etc.

http://www.pixelglow.com/macstl/

Q: Why not do this in R, and connect R to Ruby?
A: Good idea.

I think it would be cool if someone showed how to connect SBN into CruiseControl.rb to identify what patterns of development and which developers are most likely to break a build.

mtnwestruby: Ruport

Mountain West Ruby Conference by Gregory Brown
16 March 2007
Pragmatic Community Driven Development in Ruby. Or “Rolling the dice with Ruport”. “Reporting Sucks”.

Gregory spent several minutes at the beginning of the presentation trying to get his Nvidia X display settings worked out for both the projector and his screen. None of the other people using Linux desktops had the same problems, and neither did the people using Windows or Mac OS X.

Ruport couldn’t have happened without a community to support it. Community = people + problems. Why is community important? Community helps root out bad ideas, and we all have bad ideas. With a good community, you can discover bad ideas sooner, and replace them with better ideas (hopefully).

How do we leverage community? One aspect is to pick a license that encourages people to get involved.

What license do we choose? Licenses aren’t as easy to “refactor” as code. Choose a license that is “right” for your project. Don’t write your own. In order to be taken seriously at a community and at a corporate level, pick one of the already approved OSS licenses. Picking a license is a compromise.

BSD/MIT favors individuals.
GPL favors communities.

Reporting is a big domain. Sometimes, it’s necessary to integrate with software with less than ideal licensing terms. The Ruport community chose the Ruby License.

Communities are idea warehouses.

You can only really work passionately on your problems. Find people who have the same problems. Learn to say “no”. Sometimes, less is more – you’ve got so many ideas floating around and such large groups – sometimes you need smaller groups and a focused, trimmed down set of ideas so that you can define software and implement it. It’s easier to read tens of emails per week than thousands.

Discussion led to code, which led to bug reports.

Mailing lists are not a bug tracker. Gforge is, but it does too much. They decided to use SVN + Trac for Ruport. Allows a casual user to subscribe to what’s happening and to contribute. It’s amazing how much good tools can affect a community and allow contributions. But what about RubyForge? Virtualy all interesting Ruby software is on RubyForge, which gives your software exposure. They used svk and cron to mirror their repository over to RubyForge.

Friction affects contribution. Explain how people can contribute bug reports, patches, etc.

Every patch is valuable. Not every patch is suitable, but the relationships you can establish are important. Patches (suitable or not) give you an idea of what people are trying to get out of your software. Still want relevant patches. What is relevant? It took them 15 months to figure it out – way to long. They came up with a roadmap and a scope definition. They didn’t just scope features, but the design as well.

We have to be careful not to accept code/features that are only for ourselves – that will never be used by anyone else. They moved some things out of the core into plugins.

Unique project identity is good. Half-way implemented features are bad for users.

Recommends the book “Producing OSS Software” by Karl Fogel, which is available for free online.

Ruport 1.0 will be released on May 15, 2007.

Sometimes, developers have to “hide” from the large community surrounding their project, simply because they don’t have time to respond to all questions and ideas. They have to make time for families, work, and for development. It’s good to get smart users involved to help answer questions.

mtnwestruby: Ruby Queues

Mountain West Ruby Conference
Ruby Queues (RQ) by Ara Howard
16 March 2007

http://www.linuxjournal.com/article/7922
Ara works for NOAA — primarily with satellite data sets. 50KLOC, all paid for by tax payer dollars. Builds medium sized 10-20 node distributed systems.

RQ helps build instant distributed linux clusters. When presenting RQ to scientists, he rarely mentions Ruby. Today, he will talk about the technical side of RQ. RQ isn’t one of the most interesting pieces of software he’s written, but he learned more than average while writing it. One of the reasons he teaches and presents is because he learns while doing it.

RQ has been used to help generate power outage maps after hurricanes hit. Why did he develop it? The lab purchased a bunch of linux machines instead of a Cray, because it was cheaper. His job was to make them work together. He tends to believe that the first link on Goggle will yield the information he needs, so he went looking for a simple distributed computing framework. The solutions he found were the wrong fit, or overly heavyweight. In their environment, the programs that act on the data follow the data, because it’s more expensive to move data than to move programs. He decided to write RQ.

Tried using MySQL for the server queue controller. However, it adds complication with setting up usernames and passwords, and getting approval for the security thereof. He decided to leverage what was already approved – traditional UNIX file permissions and NFS for shared access to data. He also couldn’t run a process as root, or have it listen on a TCP port.

Needed NFS-safe lock files.

gem install lockfile # he wrote this package

NFS lockd wasn’t very good at throughput or fairness. One node would get the lock 500 times in a row, then the next node 500 times, etc. He wrote lock-polling code with a back-off algorithm. It took a while to get it right.

Ended up using SQLite for the shared data store. “Beats the pants off pstore, fsdb, Madeline, etc.” Most of these un-ideal solutions didn’t work well with NFS-heavily-cached data. They would run for 2 weeks, then get corruption. In contrast, SQLite is very robust over NFS — it detects and recovers from corruption.

gem install slave

How does a normal user install daemon processes? RQ cron.

nrtq query – input and output in YAML. He didn’t tell scientists that it was YAML. They didn’t need to know. Using YAML meant he didn’t have to write his own parser, and it’s human readable.
RQ is being used on a single host to queue jobs. There’s a Rails plugin.

Lessons learned:

  • NFS is quirky, but it’s the defacto standard. We get to live with it and work around the quirks.
  • LVM kills performance.
  • Roll your own NFS locking. The standard one is insufficient.
  • Use NFS hard mounts. Puts nodes to sleep until NFS server comes back online.
  • RQ does not move data around. They use vsftpd to allow data to be moved.
  • Constraints are good. Turns out many people and organizations operate under the same constraints.

Linux C++ IDE; NX

Lately, I’ve been developing on Linux. When developing remotely, I can get
along with a shell and vim, with VNC, or with remote-X. However, none of these
options are as fast or as nice as using NX. Here are the instructions to install and use
the NX server and client on Fedora Core 5 and 6:
http://fedoranews.org/contributors/rick_stout/freenx/

What’s the best C++ IDE in Linux? Out of the three IDEs I have evaluated, I’d
recommend either SlickEdit or NetBeans C++. I haven’t tried Emacs. I’ve installed KDevelop, but haven’t tried it much yet.

Eclipse CDT

  Overall: Immature and over complicated. I prefer vim with a ctags file, jedit, nedit, or gedit.
  Code Completion: Broken -- rarely works
  Search by Symbol or Reference: Broken
  Debugger support: Yes. Ugly user interface
  Custom build (bjam): Yes
  Project support: Yes. Automatically adds new files, removes old files from workspace
  Refactoring support: No
  Subversion support: Yes, with plugin

SlickEdit

  Overall: Excellent IDE
  Code Completion: The best of the bunch, but not as good as Visual Studio
  Search by Symbol or Reference: Excellent
  Debugger support: Yes. Difficult to setup
  Custom build (bjam): Yes
  Project support: Yes
  Refactoring support: Good
  Subversion support: Yes
  Notes: Has fairly good key emulation support for Visual Studio, Vim, Brief, Emacs, etc.
  Language Support: Tagging and syntax highlighting for C++, Java, Perl, Python and Ruby (to name just a few).

NetBeans C++

  Overall: Better than Eclipse CDT
  Code Completion: Yes
  Search by Symbol or Reference: Yes
  Debugger support: Yes, but haven't yet figured out how to set breakpoints.
  Custom build (bjam): Yes
  Project support: Not yet evaluated
  Refactoring support: No
  Subversion support: Yes, with plugin or with NetBeans beta 6.0.

KDevelop

  Overall: Not yet evaluated
  Code Completion: Yes
  Search by Symbol or Reference: Symbol - Yes (using ctags); Reference - Unknown.
  Debugger support: Yes
  Custom build (bjam): Most likely
  Project support: Yes
  Refactoring support: Unknown
  Subversion support: Yes

None of these tools are as good at code completion as Microsoft Visual Studio 2005.

Fedora Core 6 Disk Encryption

Here’s how to set up an encrypted disk and swap partition on Fedora 6. Refer to Disk encryption in Fedora: Past, present and future for more information. For RedHat (RHEL 4) or CentOS 4, refer to http://wiki.centos.org/TipsAndTricks/EncryptedFilesystem.

Warning: I have no idea how to set up encrypted disks in combination with LVM. I tend to shy away from LVM because it’s yet another layer of abstraction, making it difficult to recover a broken system. However, the following links may be of help: [1], [2].

In these examples, I’m encrypting the /home partition located on partition /dev/sda5, and the swap partition located on /dev/sda3. The partitions will be different on your system.

Create and Format Encrypted Disk

Before you start, you may want to obliterate the partition that will hold the encypted file system:

$ shred /dev/sda5

Setup the crypt disk:

$ cryptsetup -y --cipher aes-cbc-essiv:sha256 --key-size 256 luksFormat
/dev/sda5
  # You must type "YES" to proceed
  # It will prompt you for a passphrase twice
$ cryptsetup luksOpen /dev/sda5 home
$ mkfs.ext3 -L /home /dev/mapper/home
$ cryptsetup luksClose home

Create /etc/crypttab

Create the /etc/crypttab file. It should be formatted as follows:

swap    /dev/sda3       /dev/urandom swap,cipher=aes-cbc-essiv:sha256
home    /dev/sda5       none    luks

Edit /etc/fstab

/dev/mapper/home        /home                   ext3    defaults 2 1
/dev/mapper/swap        swap                    swap    defaults 0 0

Whenever you boot the system, it will prompt you for your passphrase for the /home partition.

Linux, Asus M2V, Attansic Ethernet and SATA hard drive problems.

At work, I got a shiny new Linux development machine — And AMD 64 Dual Core 3800+ processor running on an Asus M2V 1.xx motherboard.

After installing Fedora Core 6, I ran into two problems. First, the built-in Attansic L1 Ethernet adapter wasn’t recognized. Google research revealed that an Attansic L1 driver will probably appear in the mainline Linux kernel in a few months. Rather than wait, I plugged in a supported Ethernet card.

Second, the SATA hard drive driver timed out. Occasionally, the system froze up without many error messages showing up in the system log. I logged in at the console as root and ran “exec tail -f /var/log/messages” (redirecting syslog to a remote machine is a better solution). The next time the system froze up, I saw more output in syslog. It contained approximately the following:

ata1.00 exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00 BMDMA stat 0x4 timeout
ata1.00 qc timeout (cmd 0xec)

Google helped me stumble on the following workaround, which seems to work. I don’t know what it does. Edit /etc/grub.conf. Find the lines that say “kernel” and add “acpi=force irqpoll noapic hda=noprobe” to the end, and reboot.

title Fedora Core (2.6.19-1.2911.fc6)
        root (hd0,5)
        kernel /vmlinuz-2.6.19-1.2911.fc6 ro root=LABEL=/ rhgb quiet acpi=force irqpoll noapic hda=noprobe
        initrd /initrd-2.6.19-1.2911.fc6.img

Useful commands (helpfully documented on this blog):

  • dmidecode – tells me what motherboard I have
  • lspci – Tells me what built-in ethernet adapter I have
  • cat /dev/cpuinfo – Tells me about my CPU

What’s happening with Version Control Systems?

I’ve long had an interest in version control systems (VCS), also known
as source code management (SCM) systems — beginning with RCS, SCCS and
CVS. CVS
was already showing it’s age when I started using it in 1998. When the
company I worked for, Axent, was acquired by Symantec in 2000, we switched to
using Perforce. At first, I
thought Perforce was a step backwards from CVS. After using it heavily
for a few months, it was clear that CVS and WinCVS
didn’t come close to the ease-of-use and features of Perforce and
p4win. CVS was dreadfully slow compared to Perforce, which was
lightning fast (and still is).

Perforce encourages third-party developers to develop add-ons for use
with their software, which is almost as good as what you get with an
active open-source project. Alough Perforce is proprietary, it’s about as open
as I’ve ever seen a commercial project. It runs on many platforms, has
conversion scripts to migrate CVS repositories to Perforce, etc. It’s not
cheap, unless you’re working on an open-source project, in which case, you can
get free licenses to use it.

At some point, I heard about the Subversion
project, which aimed to correct many of the deficiencies of CVS. Those
were the pre-1.0 days, and it was interesting to watch the development
of Subversion.

About the same time, Bitkeeper was in the news. It was different than CVS, Subversion and Perforce because it was a distributed
version control system. The idea appealed to me because of the
idea that a developer could have version control for his/her private
changes without having to check-in to the main repository until they
were ready. At that time, there weren’t any mature open-source
distributed version control systems to investigate.

I switched jobs late in 2004, and my new company was using Subversion.
Overall, I have been very pleased with Subversion in day-to-day use.
It’s much better than CVS. We had some reliability problems with the
Subversion server. It was running on Windows with the BDB database
storage back-end. When it was switched to a Linux server with the FSFS
back-end, it became much more reliable. My team uses TortiseSVN — an
excellent user interface that integrates with Windows Explorer.

I’ve periodically kept tabs on version control systems. Many open-source variants have sprung up over the last few years: Mercurial, Bazaar-NG, Git/Cogito, Darcs, SVK, Arch and Monotone.
Lately though, I haven’t seen any great reviews on which ones are the
most mature, or what the pros and cons are of each. So, I’ve done some
google research to figure it out, focusing primarily on the distributed
variants.

The conclusion I’ve come to is that the developers of each version
control system are learning from the developers of the other version
control systems, and each project is improving. The Subversion developers are
learning from the distributed version control developers. Recently, there was
an SVN developer summit and they tried out Mercurial, which tells me that there’s merit to the distributed approach.

If you’re already using a modern version control system, the cost to
switch may outweight the benefit. Organizations seem to be able to
cope with legacy tools like Visual SourceSafe and CVS, although better tools
can make developer’s lives easier.

Here’s my own highly subjective comparison table. I’ve marked, in red, some of the things
I think are noteworthy. I focused my efforts on the compeitors that
seem to have garnered the most community adoption. I’ve included one
commercial system, Perforce. Each item is rated on a scale of 1 to 10, 10 being the best. (Update: There’s a better table than mine at http://bazaar-vcs.org/RcsComparisons and various comparisons at Wikipedia)

Comparison of Source Code Management systems

January 31, 2007 Subversion SVK Git/Cogito Mercurial Bazaar-NG Darcs Perforce Notes
Command-line name svn svk git / cg hg bzr darcs p4
Cross-Platform 10 9 6 10 10 9 10 Windows, Linux, Mac, Solaris, etc.
Maturity 9 6 8 7 5 8 10 Maturity based on lifetime, and project flux in code
Maturity: GUI 9 0 5 4 3 1 10
Disconnected/offline operation 2 10 10 10 10 10 0 Disconnected 1. editing of files, 2. branching, 3. merging, 4. history, etc. Especially handy when there’s no network connectivity, such as when on an airplane.
Community Adoption 10 2 8 7 5 2 1
Documentation Quality 10 7 7 8 6 8 10
Storage Format: Robustness 5 5 10 8 7 5 5 Storage format least susceptible to corruption.
Storage Format: Not in flux 1 1 10 8 1 1 ?
(re)Merging support 0 9 9 9 9 10 4 Remembers prior merges, cherry-picking, etc.
Repository Size 1 9 10 9 ? ? ?
Speed 2 7 10 8 6 10
Scalability 9 9 10 9 5 5 9
Commercial Backing 10 5 10 10 10 5 10
Subversion Integration 10 8 6 5 4 4 ? Tailor can be used to migrate changes between all systems
Totals: 88 87 119 112 81 68 79

If I were to pick a VCS system today, it would probably be Git, followed by Mercurial. What follows are my unpolished notes and ideas.

Git/Cogito

Git is very scalable, and is
the fastest
open-source version control system available. Git has a wide community
of professional engineers supporting it, and it has a bright
future. There are graphical user interfaces available for Git such as
gitk and qgit, although none of them are as mature as the user interfaces available for Subversion. Cogito is the easy-to-use command-line wrapper around git. See also the Cogito Wiki.
According to Keith Packard of xorg fame, Git has the most
robust/reliable repository storage format
. Advantages of git and all distributed VCSes include 1. offline repository access, 2.
private branches, 3. distributed backups including change history.

For those wishing to use Git/Cogito on Windows, use Cygwin and select the git and/or cogito packages and read the information at http://git.or.cz/gitwiki/WindowsInstall. For those organizations wishing for excellent Windows-Explorer integration, use git-cvsserver in combination with TortiseCVS.

To install git and cogito on Fedora, run the following as root:
  yum install git cogito qgit

I’ve reluctantly decided that Git isn’t as mature as Subversion, which
shouldn’t be surprising because Subversion has been around for longer. Git
isn’t the right fit for all projects. Git was designed for monolithic code
bases, not for modular code bases, although work is in progress to allow it to
support sub projects
(similar to svn:externals).
“Such flexibility is an implicit feature of centralized SCMs, but is much more
difficult to implement in a distributed system like git. As a result, git
currently lacks built-in subproject support, although gitweb does have a notion
of subprojects.”

There’s a document that describes Common Mistakes made when using Git. Unfortunately, most of it isn’t written yet — there’s only a loose outline.

Tutorials:

Tools — See http://git.or.cz/gitwiki/InterfacesFrontendsAndTools

Mercurial

The OpenSolaris project decided between Bazaar-NG, Git and Mercurial.
Mercurial was chosen primarily because 1. it was fast (although Git is
faster), 2. the Mercurial developers were very responsive to the
OpenSolaris developers and 3. OpenSolaris developers felt like they
could hack Python code, and 4. the repository format works well with ZFS &
NetApp filesystem snapshotting. Their evaluation of Git is here,
and it looks like the listed downsides are now out-of-date or superficial. The Mozilla project had a “version control shootout“, and although they haven’t yet made a decision, Mercurial and Bazaar-NG sounded the best to them.

The following has diagrams to illustrate distributed merging:
http://www.selenic.com/mercurial/wiki/index.cgi/UnderstandingMercurial

Mercurial is more mature than Bazaar-NG, and Mercurial is faster:
http://sayspy.blogspot.com/2006/11/bazaar-vs-mercurial-unscientific.html

“Technologically, centralized systems are a single point of failure–
any problems with the central server are problems for all people using
it.” — http://bazaar-vcs.org/WhyUseBzr

Mercurial supports access control, email notify, line-ending conversion,
etc.:
http://www.selenic.com/mercurial/wiki/index.cgi/UsingExtensions

SVK

SVK is built on top of Subversion, so it should, in theory, integrate
well with an existing Subversion repository, allowing developers to use
a distributed tool even if the master server remains a Subversion
server. Community adoption is high enough to have some confidence
in the future of the project, although adoption isn’t nearly as high as with Git, Mercurial or Bazaar-NG.

It used to be difficult to install, but you can now get a prebuilt
installer for Windows and probably for Linux as well. Working copies
(sandboxes) have no extra meta data (no .svn directory which interfere
with find, etc.) The repository format is significantly smaller than
with Subversion. I’ve found that SVK is much faster than Subversion,
although I haven’t used it much. There is not yet a graphical
user interface — a must for many organizations/communities.

The good, the bad and the ugly about SVK (Sept 2006):
http://kitenet.net/~joey/blog/entry/svk.html

Darcs

Users of darcs, including myself, appreciate its simplicity and
ease-of-use (note: Cogito, Mercurial and Bazaar-NG are also easy to
use). Downsides of darcs are that 1. Darcs is implemented in Haskel,
which limits the contributing developer community (perhaps it will inspire
people to learn Haskel), 2. depends on having Haskel libraries installed and
3. there’s no graphical user interface, unless you consider darcsweb. Still, I like darcs, and I use it on my
home linux box. Like Perforce and SVK, darcs doesn’t clutter up directories
with .darcs metadata. It used to be that Darcs wasn’t very scalable, but
I hear that it’s become much more scalable as of mid-2006. I’ve read that
Mercurial and Darcs feel somewhat similiar in their command-line user
interface.

Mirroring Subversion with Darcs and Tailor (Sept 2006):
http://fiatdev.com/articles/2006/09/10/mirroring-subversion-with-darcs-and-tailor

Subversion

Subversion has a bright future, I think, and we may yet see some of the
advantages of distributed systems appear. For those who need merge
history tracking, which makes future merges from the same branch
easier, there’s svnmerge.py. In a future release, Subversion will have this feature built-in.

The Subversion 1.4 release brought impressive speedups for working copy operations.

Control/Power

Changing information flow by switching from a centralized system to a
distributed system will empower or disempower different sets of people.
I wouldn’t be surprised if one encounters resistance in switching.

In the centralized model, developers are empowered to make any change
they want, which may affect everyone, without consulting others. Of course,
if they abuse that power, they may lose commit access. With a distributed system, an integrator pulls in people’s changes based on what and whom they trust. If
you’re aiming for quality code that doesn’t destabilize a system, it
sounds like a good approach, and it works well for Linux kernel development. Most distributed systems can be used similiar to a
centralized system, so that no integrator is required — individuals can push their changes to the master repository.

HOWTO Make Windows XP unusable

A friend of mine was cleaning out what he thought was cruft from his
c:\Windows\System32 directory when he deleted oembios.dat. His computer failed
to boot after that, and a system restore disk didn’t help. Although he could boot into a command prompt, he couldn’t boot up in safe mode. He fixed the problem
by copying an oembios.dat file from another computer. Read more about this here. The oembios.dat file may be related to Windows Product Activation.