(mis)adventures in software development...

12 April 2013

I’ll write a script instead…

Category Python

A Python script to convert date fields in a CSV file from MM/DD/YYYY to DD/MM/YYYY format — because Excel wouldn’t do it properly!

I imagine there must be many jobs, especially in and around the finance industry, where using Excel is part of daily life. I feel sorry for those people.

I find myself using spreadsheets quite a bit lately, and it’s often a frustrating experience. There’s a lot of annoyingly repetitive pointing and clicking for one thing, which is frustrating for a Unixy guy like myself. Then there’s Excel’s remarkable ability to find unique and unexpected ways to mangle the data I’m manipulating each step of the way. (And from what I’ve seen, Libre Office isn’t much better.) But what really gets to me is when I have to do something that seems like it should be relatively simple to do, but turns out to be anything but simple.

For example, I recently found myself wanting to import a CSV file into Excel. A seemingly straightforward enough task in itself, assuming the CSV file is properly formatted, which in this case it was. The problem was that the CSV file in question had a date-time field, and the date part was in US format (MM/DD/YYYY hh:mm), and I wanted it in “Australian” format. (DD/MM/YYY hh:mm). Well, to be exact, I didn’t’ really care about the date format, but I wanted it converted into a proper “date” field in Excel with the correct value, so I could sort by date. This proved to be a problem. Annoyingly though, the functionality appears to be there. There is a “Text Import Wizard” which in theory allows date fields to be marked, and even the order (e.g. “D/M/Y”) to be specified. It just didn’t work. For me. For this file. I think it was getting confused by the time part, and despite much trial and error and fiddling on my part, it would not parse the date-time field correctly.

There might have been a way to get this work. And I’m pretty sure I did get something like this working in the past (maybe using Libre Office??). But in frustration, I thought, “screw Excel, I’ll write a script to do this instead.”

And so I did. I wrote a Python script to reformat date-time fields in a CSV file just so I could then load the CSV file into Excel and have it properly recognise the date-time values. What fun. There’s time well spent.

So for anyone who might be even remotely interested — and so that they don’t have to do this themselves — here is the script.

It takes 3 command line arguments:

  • Input filename
  • Output filename
  • Index of the date-time field to reformat (zero-based)

It makes the following assumptions:

  • File is comma delimited.
  • First line in the file is a header, and discarded, so output is without a header line.
  • In date-time fields, only the date part is modified. If there’s any text after the date part, it’s included in the results unmodified, so both date and date-time fields should be handled properly.
#!/usr/bin/env python

import sys
import re
import csv

if (len(sys.argv) > 3):
    input_filename = sys.argv[1]
    print "Using input file:", input_filename
    output_filename = sys.argv[2]
    print "Using output file:", output_filename
    field_index = int(sys.argv[3])
    sys.exit("Usage: csv_us2au_date.py <inputfile> <outputfile> <fieldindex>")

prog = re.compile(r'(\d+)/(\d+)/(\d+)(.*)')

of = open(output_filename, "wb")

with open(input_filename, 'rb') as csvfile:
    writer = csv.writer(of)
    reader = csv.reader(csvfile)
    next(csvfile)       # skip first line!

    for row in reader:
        m = prog.search(row[field_index])
        if m is None:
            print "INVALID DATE AT LINE %d" % (reader.line_num)
            v = row[field_index]
            row[field_index] = '"' + v + '"'
            row[field_index] = "%02d/%02d/%d" % (int(m.group(2)),
            if m.group(4):
                row[field_index] += m.group(4)

So nothing earth-shattering there, just annoying this is what I had to resort to in an attempt to get Excel to do what I wanted it to.

Which got me thinking about how, as a coder, there is a tendency to re-invent the wheel (“That way of doing this is fucked, I want to do things my way, because my way will be better!”) This is something I (and I imagine most coders) have to be careful of, and often fight against.

For one thing, it’s not always efficient in terms of time and energy. In this case, it’s debatable whether I saved any time. It’s not a long or complex script by any means, but the input file wasn’t particularly big either. It probably would have taken me a similar amount of time just to manually adjust the date fields. Although that would have been much less “fun”.

(The problem was, I didn’t stop just there. Once I started coding, I kept coding! I ended up writing a whole bunch of far less trivial scripts to essentially automate all the data processing I was intending to do in Excel manually. This definitely took more time than it would have taken to do manually, but if nothing else it was an interesting exercise in Python programming and using SQLAlchemy. It also means doing the same thing when Tax Return time comes round again next year should be a lot simpler — assuming the broker doesn’t change the format of it’s CSV export file!)

On the other hand, many a great software product or open source project has started just this way — with someone cracking the shits and turning to coding to solve a particular problem or automate some tedious task.

This is all well and good for those of us who can code — for those of us who do have the option of saying, “screw it, I’ll write a script to do this instead”. But what about those who don’t have that “luxury”? What about those who’s only interest in computers and software is as a tool to get things done? Like those people who might have to use spreadsheets as part of their day job (for want of a more specialised tool)? What do they do when face with some simple text processing task, like in this case converting a date-time format. What then? Write a VBA macro? Sure, that’s presumably why there’s a scripting language is embedded into Excel, to allow (power) users to deal with (advanced/unusual) situations the software by default cannot handle (easily). But that means learning VBA, and by extension, learning something about scripting/programming. Not everyone is interested or cut out to be a coder, so this isn’t going to be a great option for them. Which leaves lots of manual pointing and clicking.

Of course, I’m looking at this from the point of view of a programmer. I know that converting between date formats is a simple text processing task, the sort of simple text processing task computers are well suited for. Which is why I am frustrated to find I cannot get Excel to do it easily. Perhaps someone less computer-literate might just resign themselves to do it manually and not give it a second thought.

There are some that therefore say this is why some coding knowledge is useful even for non-coders in this day and age. Well, kind of. If someone is interested in learning to code, I’m all for it, have at it. But I think the reality is not everyone is cut out to be a (good) coder. And there are some people who just have no interest in coding, which I think is fine. Computers after all are just a tool, a machine, a labour saving device. I don’t think computer users should need to be coders to use computers effectively, in the same way car drivers shouldn’t need to be mechanics to drive a car! If anything, I think the onus is on us coders to do a better job of creating and designing software that allows users to do what they need to do as efficiently as possible and with the minimum amount of annoyances possible.

In this respect, it seems despite the advances that have been made in technology and human-computer interaction, we still have a long way to go to make the technology stay out of the way for users who just want to get things done, without having to tinker under the hood. Well, at least that’s the case with Excel.