voidynullness

(mis)adventures in software development...

    
25 July 2013

Extract emails from Gmail with Python via IMAP

Share
Category Python

Using Python to get email from a Gmail account via IMAP: The Least You Need To Know.

I have a filter setup in a Gmail account to automatically collect what are essentially automatically generated emails from a particular source, and file them neatly away into a label, leaving the inbox relatively uncluttered with their consistently formatted regularness. Thing is, those particular emails are not achieving their potential for maximum usefulness to me by just sitting there in my Gmail account. And with their predictable content, they are ripe candidates for parsing and extracting the useful data, and storing it in more useful form, like a database or spreadsheet. But instead they just pile up, and silently mock my supposed geek cred with each passing day. Well I have had enough of their judgmental presence! It is high time I finally worked out a way to programmatically download and parse their sorry arse. And anyway, if the NSA is allegedly algorithmically accessing my emails why shouldn’t I? So as part of my continuing effort to Script My Life, I once again turn to trusty Python to make my life easier — one hackily bespoke script at a time.

The nice thing about Python is that there’s a module for just about everything. The not-so-nice thing about Python is that there’s usually more than one module for everything. And often a bewildering number of packages offering the same functionality. Usually all deprecated — except for one. Or possibly all “un-Pythonic” — except for one.

So while there may be more than one way to programmatically get emails out of Gmail with python, one option is IMAP. Gmail can be accessed via IMAP, and conveniently enough the Python Standard Library has an IMAP interface, so it appears not unreasonable to use IMAP. Unless you’re reading this in a future where IMAP has been deprecated. Or deemed un-Pythonic.

First step is to create an IMAP4 instance, preferably the SSL variant for security, connected to the Gmail server at imap.gmail.com:

#!/usr/bin/env python

import sys
import imaplib
import getpass
import email
import datetime

M = imaplib.IMAP4_SSL('imap.gmail.com')

Next we can attempt to login. If the login fails, an exception of type imaplib.IMAP4.error: will be raised:

try:
    M.login('notatallawhistleblowerIswear@gmail.com', getpass.getpass())
except imaplib.IMAP4.error:
    print "LOGIN FAILED!!! "
    # ... exit or deal with failure...

If the login is successful, we can now do IMAPy things with our IMAP4 object. Most methods of IMAP4 return a tuple where the first element is the return status of the operation (usually 'OK' for success), and the second element will be either a string or tuple with data from the operation.

For example, to get a list of mailboxes on the server, we can call list():

rv, mailboxes = M.list()
if rv == 'OK':
    print "Mailboxes:"
    print mailboxes

With Gmail, this will return a list of labels. To open one of the mailboxes/labels, call select():

rv, data = M.select("Top Secret/PRISM Documents")
if rv == 'OK':
    print "Processing mailbox...\n"
    process_mailbox(M) # ... do something with emails, see below ...
    M.close()
M.logout()

So with the mailbox selected, we can now get the emails within it. For example, we can get all the emails in the selected mailbox and for each one output the message number, subject, and date:

# Note: This function definition needs to be placed
#       before the previous block of code that calls it.
def process_mailbox(M):
  rv, data = M.search(None, "ALL")
  if rv != 'OK':
      print "No messages found!"
      return

  for num in data[0].split():
      rv, data = M.fetch(num, '(RFC822)')
      if rv != 'OK':
          print "ERROR getting message", num
          return

      msg = email.message_from_string(data[0][1])
      print 'Message %s: %s' % (num, msg['Subject'])
      print 'Raw Date:', msg['Date']
      date_tuple = email.utils.parsedate_tz(msg['Date'])
      if date_tuple:
          local_date = datetime.datetime.fromtimestamp(
              email.utils.mktime_tz(date_tuple))
          print "Local Date:", \
              local_date.strftime("%a, %d %b %Y %H:%M:%S")

We use the search() method to get a list of message sequence numbers, then loop over these, calling fetch() to get the actual messages.

fetch() returns the raw message contents. To avoid having to parse the actual message data from fetch() ourselves, we can use the email package from the standard library. Once again, there are a few different packages floating around for doing this kind of thing, but I think email is currently the one least likely to get you down-voted on Stack Overflow.

message_from_string() returns a message object, and we can then access header items as a dictionary on that object.

Which brings us to the “Date” header, and the potentially thorny issue of date and timezone. If you don’t care about the date/time the emails were sent, then things are much simpler. But if you do care about such matters, note that the contents of the “Date” header may vary depending on the email client sending the email, and the timezone of the sender. Reliably converting to local time can be surprisingly tricky. Python is once again comes to the rescue with numerous modules promising to assist you with all your deepest darkest date conversion needs. The code snippet above shows one possible way of converting to local time, using the capabilities of email.util. It’s currently working well enough for my purposes, but there may be better ways to accomplish this.

The message body can be obtained by calling msg.get_payload(), which will return the payload data as a string (if the message is not multi-part). For text messages, you could then parse the data using regular expressions. For parsing contents of HTML emails however, you must not use regular expressions. Ever. Or you will feel the Lovecraftian wrath of the Stack Overflow minions. Instead, use a HTML parser, or higher level scraper like Beautiful Soup.

UPDATE:

  • A complete version of the above code is available in this gist.
  • For a variation on the theme, see also this gist: a script that simply dumps all emails in an IMAP folder to files.