Archive for August, 2006

Massaging Data

Tuesday, August 22nd, 2006


Update: It seems that this behavior is no longer exhibited in the current version of Rails (1.1.6), as each test is wrapped in a transaction. So this entire post is pretty much moot now.

Rails’ tests (unit, functional, and integration) use a database with read-only values. Anything which modifies the database is not saved. This allows the tests to run in isolation from each other, which is a Good Thing.

In the real world, however, I often find myself needing to write methods which go and massage a bunch of data in the database. A simple example would be a nightly cron job which creates invoices for accounts whose billing cycle are due.

In Rails this is done by creating a static method like Account.create_invoices, and invoking it from the crontab via “script/runner -e production Account.create_invoices”. Since it is a method, it can be tested in a unit or integration test. (I think the latter is more appropriate, since a high-level method like this often touches a number of models.)

But this method’s main “output” is not a return value, but rather an adjustment (or lots of them) to the database. For example, I may want to go find every account with an open billing item, create an invoice in the invoices table, and then mark the item closed. What I really want to test when this method is done is that there are some number of new invoices in the invoices table, that the billing items have been marked closed, and that the account has no open billing items. But you can’t check most of this output, because it exists as changes to the database, not direct return values.

My impression is that the Rails way views this as a Bad Thing. Every method should return its results, rather than going in and massaging a bunch of data in the database. This makes it more orthogonal and easier to test.

I agree with this philosophy in theory, I’m not sure that this is realistic for real-world applications. I’m like to think that that’s because I’m still trapped in the SQL paradigm, so maybe someone can enlighten me on the pure Rails way to do this.

Here’s a concrete example:

class Account < ActiveRecord::Base
        has_many :invoices
        has_many :billable_items

        def open_billable_items
                items = []
                billable_items.each do |item|
                        items << item if item.open?
                end
                return items
        end

        def create_invoice
                open_billable_items.each do |item|
                        item.close   # creates the invoice and marks the item closed in the db
                end
        end

        def self.create_all_invoices
                Account.find(:all).each do |account|
                        account.create_invoice
                end
        end
end

class BillingTest < ActionController::IntegrationTest
        fixtures :accounts

        def test_create_all_invoices
                assert_equal 5, accounts(:first).open_billable_items.length
                assert_equal 0, Invoice.count

                Account.create_all_invoices

                assert_equal 0, accounts(:first).open_billable_items.length
                assert_equal 5, Invoice.count
        end
end

Maybe the right thing to do here is not to test at such a high level, but instead test only BillableItem.close by having it return the created invoice. This bothers me though. It needs to work at the high level, so why can’t I test that?

And this is a very simple example. In reality, the nightly cron job may be touching dozens of tables and thousands or even millions of rows. Returning all affected rows as a result doesn’t make much sense, and may be completely impossible due to memory limitations. (The whole reason we use a database is so that we can operate on large sets of data without having to instantiate every record at once!)

Repositories vs. Collections

Monday, August 21st, 2006

The concept of a repository is a simple one on the surface, but it has surprisingly deep and subtle ramifications. Putting data in a repository encourages or even enforces unification, normalization, and consistency.

Source repositories were the first type of repository I encountered. I started using them for their archival capabilities, but as soon as I did I got a surprising side-effect: normalization of my project directory and file layouts. Almost overnight, my source trees turned from sprawling, unpredictable jungles of files into well-organized and consistently-named file hierarchies.

But it’s not just source. How about media? Many people have a movie “collection,” which is usually a pile of DVDs, VHS tapes, and maybe even laserdiscs which are not organized in any particular fashion. Some have covers, some don’t. Some are copies, some are store-bought copies. It’s a mess, but most people don’t think of it that way. It’s just the default, because that’s the way collections tend to be.

Repositories, on the other hand, tend toward the opposite. It’s actively difficult and in some cases impossible to have a repository that disorganized. iTunes is a great example of a media repository. Not to say that it can’t be a mess, it certainly can. But you have to sort of work at it. By default things are pretty well-organized.

Matz Gets Touchy-Feely

Thursday, August 17th, 2006

From The Philosophy of Ruby:

“Instead of emphasizing the what, I want to emphasize the how part: how we feel while programming. That’s Ruby’s main difference from other language designs. I emphasize the feeling, in particular, how I feel using Ruby. I didn’t work hard to make Ruby perfect for everyone, because you feel differently from me. No language can be perfect for everyone. I tried to make Ruby perfect for me, but maybe it’s not perfect for you. The perfect language for Guido van Rossum is probably Python.

[…]

For me the purpose of life is partly to have joy. Programmers often feel joy when they can concentrate on the creative side of programming, So Ruby is designed to make programmers happy.”

Tabs Have All The Advantages

Wednesday, August 9th, 2006

Tabs vs. spaces is one of those arguments that has raged across software codebases since time immemorial. I am always astonished that this simple matter has not been cleared up yet. Let me make it easy for everyone and just provide the answer: tabs have every single advantage. There is every reason to use them, and no reason not to.

Tabs allow every programmer to configure their indents the way that they want. Just because you like 2-character indents doesn’t mean that I do. Why force me to use your indentation size? Let me make a comparison to throw further light on the issue.

Sometimes fledling web designers, realizing that HTML displays a little differently for everyone depending on their browser and screen settings, decide that they must have pixel-perfect control over what every user sees. So they make their HTML and then dump it to a giant JPG to display as the website. This is dumb, of course, but the argument is that they want it to look the same for everyone. Using spaces instead of tabs is the same sort of mentality: that everyone else MUST have the same size indents as you. The newbie web designer can be forgiven. Experienced programmers, on the other hand, should know better.

What it boils down to is that the tab character is a better representation of that actual data. A line indent is a different operation from a space. We have a standard character to represent this difference. They even map one-to-one with the keys on the keyboard: to indent code we press tab, to break apart tokens on the line we press the spacebar. It seems so utterly and completely obvious that we should have these map to a tab character and a space character, respectively, that I am astonished anyone ever thought otherwise.

The only reasonable argument I’ve heard for spaces is that some editors don’t handle tabs correctly. Please tell me, what editor that is good enough for programming can’t handle a simple tab? Every modern programmer’s editor I’ve encountered (i.e. vim, emacs, TextMate…) handles them correctly. Why are we making allowances for crappy editors that can’t do something as simple as process a tab character? If we thought that way, then we’d all still be writing table-based HTML so that we could fully support Netscape 4.0.

It should be noted that tabs, when used correctly, are only to be used for code indents. A tab is not a shortcut for pressing the spacebar a few times. Let me illustrate a common tab-use mistake, filling in the → character to indicate a tab character and a . for space:


→if (($this->some_value && $this->some_other_value) ||
→→→$this->third_value)
→→do_stuff();

The correct use of tabs in this case is:


→if (($this->some_value && $this->some_other_value) ||
→....$this->third_value)
→→do_stuff();

See here for a more detailed description of this technique.

The second line has a single layer of code indent (one tab), but then a few spaces to make it line up in an aesthetically pleasing way with the previous line. This will work correctly with any size tabs, and more correctly represents the data.

So please, I beg you. We gave up on tag soup and table based layouts in favor of CSS; we gave up on “This site optimized for 800×600″ in favor of liquid layouts; now please, please, stop expanding tabs into spaces in code.