For my final project, I needed all of the Google chats my husband Caleb and I have ever sent each other. The next post gets into how and why I made poetry with that text. This post just covers how I got the data in the first place.
Collecting the Data
I first tried to export everything using Google Takeout, Google’s cross-product data export tool. I’ve used it successfully in the past, but for whatever reason, it ran into trouble this time, and gave me a file with large swaths of missing messages and bad data. While trying to diagnose that problem, I found another solution: this Google Apps Script for automatically scraping chats out of Gmail and populating a Google Spreadsheet. Google Apps Scripts can only run for so long (and will give you an error if you try sending too many requests to Google’s servers), so I modified the script to allow me to only request a range of messages, manually re-running the script several times over the course of a couple hours.
I set up the script so that the first column of my spreadsheet is the message thread ID number (counting up chronologically as you move back in time—useful for keeping track of which messages were already requested), the second column is the date, the third and fourth columns are sender and recipient, and the last column contains the full text of the message. The end result looked like this:
With my spreadsheet, I manually deleted messages with people other than Caleb, combined the messages sent from both my personal and NYU accounts, and exported all the data as a CSV.
Parsing the Data
Unfortunately, six years is a long time for Google to maintain consistent data formatting, particularly while completely overhauling their instant messaging products. This means that the new messages were formatted correctly when my script scraped them, like so:
|2/27/2015||Matthew||Caleb||Any good burrito places in Manhattan?|
But scraping older messages gave me something like this:
|8/14/2011||Caleb||Matthew||Caleb: Toaster: black, white, or red?
me: Hey. Do you still need toaster color input?
Caleb: If you could.
me: How about black?
me: Sorry I didn't get back to you earlier. I went kind of sleep for a couple of hours.
…with the body of the “message” saved as:
<span><span style="font-weight:bold">Caleb</span>: Toaster: black, white, or red?</span><br><span><span style="font-weight:bold">me</span>: Hey. Do you still need toaster color input?</span>, and so on.
In a few cases, for whatever reason, times were included in the message body as well:
12:27 PM me: Are you there?
12:28 PM me: Cool. I'm having lunch now.
i'm trying to decide what to do for lunch
12:29 PM Caleb: how has your day been?
12:30 PM me: Pretty good. Finished up web ads, pending revisions.
Obviously, this meant that the sender, recipient, and timestamp information associated with the chat thread was not useful. Instead, I had to use Beautiful Soup to parse out these HTML-formatted chat threads and extract the relevant data. The resulting data was a list of chat messages, each stored of a list with the format: [timestamp, recipientName, messageText]. Because all of this HTML parsing took several minutes, I only did it once, and then I used the Pickle module to store my clean data to a file.
After a lot of tinkering, this is what I came up with:
import csv from dateutil.parser import * from bs4 import BeautifulSoup import pickle # Given a row of the CSV file (which may or may not contain # multiple threaded messages), return a list individual chat messages. def formatRow(input): rows =  # The main CSV timestamp will be the default timestamp for all messages timestamp = parse(input) # The main sender will be the default sender, unless others are specified. # (Because GChat formats the usernames differently at different points, # this simple string search is enough of a check) if 'caleb' in input.lower(): sender = 'caleb' elif 'matthew' in input.lower(): sender = 'matthew' # Now, pull the text in the message or thread threadText = input # Check for empty "messages" if str(threadText) == '': return  # Beautiful soup doesn't like weak breaks, so remove them threadText = threadText.replace('<wbr>', '') # Parse the thread text as HTML. This will both remove HTML-encoded # special characters and allow us to tear apart old threads that were # all saved in the same message. thread = BeautifulSoup(input) # I don't want to read links in my poetry, so remove them. for link in thread.findAll('a'): link.extract() for msg in thread.html.body.children: if msg.name == 'p': # If the text is stored in a <p> tag, then it's a simple message. rows.append([timestamp, sender, unicode(msg.contents)]) else: # If the text is organized by <div> tags, then it has built-in # timestamps. if msg.name == 'div': # Update the base timestamp with this new hour and minute if len(msg.contents.contents.strip()) > 0: newTimestamp = parse(msg.contents.contents.strip()) timestamp = timestamp.replace(hour=newTimestamp.hour, minute=newTimestamp.minute) msg = msg.contents.span # If the text is contained in a span tag, then it maybe # has the sender's name in bold before the message. if msg.name == 'span': # Update the message if msg.contents.name == 'span': if 'caleb' in msg.span.contents.lower(): sender = 'caleb' elif 'me' in msg.span.contents.lower(): sender = 'matthew' content = unicode(msg.contents) else: content = unicode(msg.contents) # Remove the colon if present if content[:2] == ': ': content = content[2:] # Append a new row if len(content) > 0: rows.append([timestamp, sender, content]) return rows #---------------------- Main Logic ------------------------ # This is the list of data that we'll want in the future. rows =  with open('source/RawChatArchive.csv', 'rb') as csvfile: messageList = csv.reader(csvfile) for row in messageList: formatted = formatRow(row) rows += formatted # The pickle module allows us to simply save a Python data structure in a # file, no manual parsing or formatting necessary. pickle.dump(rows, open("chatlog.p", "wb"))
The comments are closed.