Matthew Kaney at ITP

(Specifically, Reading and Writing Electronic Text projects)

Final Project: Two Poems

Reading and Writing Electronic Text


207

Finally, a pair of poems created as part of my final project:

Poem 1 (Read Live at our Final Performance)

In 2010:

i said: kidding, considering, answering
meaning, meeting, seeing, eating, being

you said: code is compromised, Restaurants used
preview fill, people come, typing is, code is compromised

i said: caring, wearing, bring, breaking, making
missing, seeing, being, feeling, dealing

you said: way do go, Doughnuts is owned, goal being
George emailed, Noah got Call, Express is

i said: computers is, Am wrapping, nobody is expecting
Chris isn’t, Kevin doesn’t want, backend is

you said: hoping, ping, playing, spelling, setting
staying, sitting, styling, trying, string, starting

i said: Here, I’m sending you the photos.
you said: okay, yeah, you want it on

In 2011:

i said: Sheriff issued, Becky fill, Chris saw
T-Rex believes, Chris did send, person was thinking giving

you said: command used, question is, numbers is
Daniel is looking, comments are, plants became

i said: putting, cutting, shutting, editing
meaning, evening, vending, sending, ending

you said: brain was, Pandora do, electricity is
email sent, notifications do, English are

i said: fair CNN, Historic Times
saddest emoticon, soda machine

you said: numbers is, Heitz has been, god had
tail is, parsers are, fact think, problem arrived

i asked: You mean the actual typing?
you said: if you stress too much you’ll die.

In 2012:

i said: specific reason, nice Christmas Boy
feeling bad, first-born child expensive, proprietary software

you said: irritating, awaiting, updating
living, leaving, leaning, learning, turning

i said: Pete look, re going, nobody was
narration is, Adobe fixed, list shared

you said: meeting, muttering, limiting
helping, yelling, telling, staring, starting

i said: disagreeing, distracting, tracking
studying, stuffing, mystifying, missing

you said: implying, importing, morning
compounding, complaining, compiling, coming

i said: Okay, you can come over now.
you asked: have you tried other browsers?

In 2013:

i said: Code signing is, paycheck is, names are
money is, Club has taken, thing seeing

you said: waking, taking, staying, sitting
hammering, hurting, irving, unnerving

i said: other hand, previous day’s stuff
physical right, task management system

you said: grammatical clarity, Happy St. Patrick’s Day
yoko ono, Progress report, lab rat

i said: people seemed, Technicolor had
commercials are, Liz said, music is going

you said: programming, rolling, ring, running
arguing, bragging, banging, banning, burning

i asked: But, what are you afraid happened?
you said: I can help you research stuff.

In 2014:

i said: Gruber doesn’t respond, show is
person checked, turnaround looking, OP is

you said: composing, compiling, camping
fucking, king, liking, looking, linking, thinking

i said: Chat seems, classes entered, people is
images posted, project is, New seems

you said: wedding, reading, rendering, ending
starting, steering, starring, string, trying, rising

i said: accounts has, listeners attached
clinic says, services thank, street including

you said: app was taken, people are, difference is
people arguing, class was, reason including

i asked: When are you going to be done?
you said: can you do me a favor

In 2015:

i said: discussing, licensing, sequencing
exciting, writing, pricing, trying, timing

you said: Christmas is, question asking, Queens doesn’t exist
records started being kept, thing looks, everyone thinks

i said: boring, morning, running, stunning
topping, dropping, operating, playing

you said: thread agree, York is visiting
emphasis added copied, kafkaesque is

i said: separate link, reading anything
Adventures Time, grad student, net neutrality

you said: map indicates, point think, cookies are messed
host are, K. Sounds, Schwab sent, decision has been

i asked: So, how do you want to play this?
you asked: Could you do something like that?

Poem 2

In 2010:

i said: Wednesday evening, version online
different exclusively-gay porn shops

you said: major characters, da da da da da da da
good game, silly thing, friends locations, birthday present box cardboard

i said: writing, trying, string, starring, storing
bugging, bring, boring, storing, scoring, caring

you said: names leads, stuff came, woman tied, error is
output buffering, Apple emailed, Chrome isn’t

i said: daily dose, new episode Douglas
fiery passion, Thanks lot, bip-bop music

you said: adding, landing, branding, boring
buying, bribing, biking, boxing, focusing

i said: I’ll talk to you tonight, promise.
you asked: take one back with you would you?

In 2011:

i said: book based, Kevin sends, Atkinson was
Liz says, Adam wants, designers decided

you said: answer is, tone intended, today is
core set, today is, solution was, TA is talking

i said: different times, sandwich home, straight women
RIcky Gervais, something different, MAME rom

you said: sticking, setting, sitting, staying
aligning, lightning, fighting, firing, flying

i said: old code, research scientist, parents blown-glass vase
Muppet franchise, funny story, blows glass

you said: overdue balance, meals week, Prismatic Pumpkin
new group, multiple star reviews, INTERNET PEOPLE

i said: What do you mean, “what text is this?”
you asked: did you use that phone number?

In 2012:

i said: addressing, pressing, depressing
falling, fling, filling, filing, floating, loading

you said: fucking STRING, right sorry, friend Steve
new tab, word chippy, case scenario

i said: Pete look, Man thought, phone upgrade hanging
Pete included, paper burned, legislature was tricked

you said: thing has happened, heh was, post saw
stuff running, messages coming, person working

i said: statement needs, tie is, someone addressing
way do, proposal want, ones are, rule is

you said: internet security, Marc Heitz
free software people, last night, new petition

i asked: What would you like to do to me?
you said: do you even have to ask

In 2013:

i said: Linux is, Times had, time visited
way ask, Stewart talked, Stewart talked, references are

you said: picking, packing, panicking, banning
morning, warning, boring, bing, buying, banging

i said: technique had, music streaming wasn’t
movies come, wife connotes, person seems, anybody explained

you said: Linux wireless sadness, internet petitions
regular Milky Way, holy shit brandi

i said: release party, communication right
good Kelly, website design service, Kansas City

you said: certain things, love Doug, robot band
new car, school shooting stuff, hundai accent

i asked: You think Emily or Simon?
you asked: Can you wait up till midnight?

In 2014:

i said: great project, pretentious white people
first people, Visual Language assignment

you said: hell got delayed, phone call is, Mattel pulled
class did, days is, account is, couples doing

i said: mailing, bailing, boiling, buying
going, outgoing, knowing, running, opening

you said: size happened, Businesses stayed, class got caught
text file, Apple has, blob sounds, class is changing

i said: trying, turing, during, driving
chatting, cheating, seeing, sing, sawing, swing

you said: boring, porting, printing, inviting
phrasing, freezing, rising, raising, raining

i said: If it’s loose, you can pull it off.
you asked: What are you going to do?

In 2015:

i said: separate link, fine promise, dubious circumstances max
Union Square, physical membership card

you said: Reilly admits, section is, cards get
something making, robot doing, post contains

i said: grocery store, Booty Don’t Lie
Ice King, right same, music industry, Steve Jobs

you said: eating, sting, sitting, splitting, selling
summarizing, amazing, mailing, making

i said: melody were, lady sitting
Romney is, White has, Club doesn’t, Pizza was named

you said: New York, regular food list, new year
fine right, happy hour today, gain media attention

i asked: How long do you think it will take?
you asked: do you have a preference?

Final Project: Code and Description

Reading and Writing Electronic Text


233

Generative text easily lends itself to conceptual, often humorous writing, a tone that I frequently adopt in my work. So for my final project, I wanted push myself towards a more personal approach to electronic text. As a source, I looked to Gmail chat messages sent between my husband and myself. We’ve sent roughly 45,000 messages back and forth, covering the six-year span of our relationship (except for several months where we used the now-defunct Google Wave.) Chatting has always been an important, and sometimes primary, form of communication, and these chats form a rich sample. Loving, angry, depressed, intimate, and often painfully banal, these chats archive the daily facts of what happened and how we felt in our respective lives.

In a previous post, I discuss how I collected the data. Once I had the data collected and parsed, the writing could begin. To begin, I wrote a series of prototype programs, each trying to randomly extract different types of information from the messages. From these loose experiments, I developed my favorites, eventually turning them into self-contained methods.

Grabbing Noun Phrases

One thing I immediately noticed what how much of our communication was basically “Are you there? What’s up?” type of messages. Reading 200 such messages back to back could be compelling poetry, but it’s not very narrative. I really wanted to tell the story of the things that happened to us, so noun phrases seemed like a good place to start. I used TextBlob to do part of speech tagging, though instead of using their built-in noun phrase extraction, I did my own, because it allowed me a bit more control over what phrases I looked for (and I was happier with the results I got).

from textblob import TextBlob

#------------ From a set of rows, get all noun phrases -----
# (Very simplified; "noun phrase" here mean x * adj + y * noun)
def getNounPhrases(chatList):
    phrases = []
    
    for row in chatList:
        newBlob = TextBlob(row[2])
        
        phrase = [];
        hasNoun = False;
        
        for tag in getTags(newBlob):
            if len(tag[0]) > 1:
                if tag[1] == 'JJ':
                    phrase.append(tag[0]);
                elif tag[1] == 'NN' or tag[1] == 'NNS' or tag[1] == 'NNP':
                    hasNoun = True;
                    phrase.append(tag[0]);
                else:
                    if hasNoun and len(phrase) > 1:
                        phrases.append(' '.join(phrase));
                    hasNoun = False;
                    phrase = [];
        
        if hasNoun and len(phrase) > 1:
            phrases.append(' '.join(phrase));
            hasNoun = False;
    
    return phrases

Running this code on things I said in 2010 yields “good inspiration, empty box, other box, half full, organic mac cheeses”, among others.

Getting Simple Sentences

From there, I realized that I could similarly search for tiny sentences, in the most basic “sentence = noun + verb” elementary grammar sense. These have even more narrative thrust than a mere list of noun phrases.

#-------------- Get Subject + Verb ---------------------------
def getTinySentences(chatList):
    phrases = []
    
    for row in chatList:
        newBlob = TextBlob(row[2])
        
        for sentence in newBlob.sentences:
            phrase = []
            
            for tag in getTags(sentence):
                # Ignore single-character "words"
                if len(tag[0]) > 1:
                    if tag[1][0:2] == 'VB':
                        # If we get a verb, check if we already
                        # have a noun. If so, add the verb to our
                        # phrase.
                        if len(phrase) > 0:
                            phrase.append(tag)
                    elif (tag[1] == 'NN' or tag[1] == 'NNS' or
                          tag[1] == 'NNP' or tag[1] == 'PNP'):
                        # Otherwise, if we have a noun or pronoun,
                        # start assembling the phrase
                        phrase = [tag]
                    else:
                        # If we get any other word, then we're no
                        # longer looking at a phrase. If we've already
                        # collected a phrase, append it to the output.
                        if len(phrase) > 1:
                            newPhrase = ''
                            
                            for word in phrase:
                                newPhrase += word[0] + ' '
                                
                            phrases.append(newPhrase[:-1])
                        
                        phrase = []
    return phrases

For example, “circle indicates, girls watch, noise got” and so on.

Getting “—ing” Words

From there, I looked at getting all gerunds (or rather, all words ending in “—ing” except for words that were variants of “thing”, because that was obnoxious to show up in a list of verbs). The code here is pretty simple:

#-------------------- Get a list of "gerunds" from a chat list -----------
def getGerunds(chatList):
    gerunds = set();

    for row in chatList:
        blob = TextBlob(row[2])
        
        for word in blob.words:
            if word.lower()[-3:] == 'ing' and word.lower()[-5:] != 'thing':
                gerunds.add(word.lower());
    
    return gerunds;

This yields “saying, walking, breaking, pointing” etc. Nice, but it got much better when I sorted by phonetic similarity, as I did in a much older project. I built a second function that sorted the output of the first function (while not allowing duplicates, hence the set in the gerund function).

# Given a word and a list of words/phrases, sort the list of words/phrases
# based on how similarly they sound to the given word. Not rhyming per se,
# but a nice combo of rhyme, alliteration, assonance, and so on.
def rankProximity(comparison, wordList):
    comparison = pronouncer.cleanWord(comparison)
    outputList = []
    
    for word in wordList:
        cleanWord = pronouncer.cleanWord(word.split()[0])
        
        if cleanWord in pronouncer.phoneticDict and comparison != cleanWord:
            matcher = difflib.SequenceMatcher(None,
                                pronouncer.phoneticDict[comparison][0],
                                pronouncer.phoneticDict[cleanWord][0])
            similarity = matcher.ratio()
            
            outputList.append([word, similarity]);
    
    outputList.sort(key=lambda x: x[1])
    outputList.reverse();
    
    return outputList;

Sorting a given list (again, me in 2010) by their proximity to the word “making” gives us: mocking and taking (at 80% match), breaking (73% match), moneymaking (71% match), and so on.

General Search and a Couple Utilities

Also, I just wanted to be able to search for specific words, and this general-purpose code fit the bill:

def chatGrep(search, chatList, maxLength=10):
    output = []
    
    for row in chatList:
        blob = TextBlob(row[2])
        
        for sentence in blob.sentences:
            if search.lower() in sentence.lower() and len(sentence.words) < maxLength:
                output.append(sentence)
    
    return output

Finally, you might notice that I use getWords() and getTags() methods several times above. Basically, TextBlob parses apostrophes and contractions out as separate tokens, which is I’m sure sometimes useful, but more trouble than it’s worth here. So, I wrote a couple methods to reassemble contractions, while still performing the useful natural language stuff that TextBlob gives us.

def getWords(sentence):
    phrase = sentence.words
    
    out = []
    
    suffixes = ["n't", "'s", "'m", "'re", "'ve", "'d", "'ll"]
    
    for word in phrase:
        if word.lower() in suffixes:
            out[-1] = out[-1] + word
        else:
            out.append(word)
            
    return out

def getTags(sentence):
    phrase = sentence.tags
    
    out = []
    
    suffixes = ["t", "s", "m", "re", "ve", "d", "ll"]
    suffix = False
    
    for word in phrase:
        if len(out) > 0 and (word[0].lower() == "'" or word[0].lower() == "n"):
            out[-1][0] = out[-1][0] + word[0]
            suffix = True
        elif suffix and word[0].lower() in suffixes:
            out[-1][0] = out[-1][0] + word[0]
        else:
            out.append([word[0], word[1]])
            suffix = False
            
    return out

Pulling it all Together

Of course, the important thing about these chats was that I knew who said them, and when. In my final poem, I wanted to sort things chronologically by year, and distinguish the things that Caleb and I said, so I wrote a simple program for returning a subset of rows, before passing that data into any of the above functions.

#------------ Basic filtered rows -------------------------
def filteredRows(person, year):
    subset = []
    
    for row in rows:
        if row[1] == person and row[0].year == year:
            subset.append(row)
    
    return subset

With all these functions, I wrote a script for assembling them. First into lines, and then into a poem:

# From a set of gerunds, make a list of a certain number of syllables, where
# each no word is repeated, and each word sounds similar to the previous.
def getGerundList(gerunds, maxSyllableLength):
    firstWord = random.choice(list(gerunds))
    
    while pronouncer.cleanWord(firstWord) not in pronouncer.phoneticDict:
        firstWord = random.choice(list(gerunds))
    
    gerundList = [firstWord]
    syllableLength = len(pronouncer.getMeter(firstWord))
    
    while syllableLength < maxSyllableLength:
        gerunds.remove(gerundList[-1])
        sortedList = rankProximity(gerundList[-1], gerunds)
        
        gerundList.append(random.choice(sortedList[:3])[0])
        syllableLength += len(pronouncer.getMeter(gerundList[-1]))
    
    return gerundList

# From a set of phrases, string together a list of phrases of the desired
# syllable length.
def getPhraseList(phrases, maxSyllableLength):
    firstPhrase = random.choice(phrases)
    
    while not pronouncer.check(firstPhrase):
        firstPhrase = random.choice(phrases)
    
    phraseList = [firstPhrase]
    syllableLength = len(pronouncer.getMeter(firstPhrase))
    
    tries = 0
    
    while syllableLength < maxSyllableLength and tries < 300:
        newPhrase = random.choice(phrases)
        
        if pronouncer.check(newPhrase):
            phraseList.append(newPhrase)
            syllableLength += len(pronouncer.getMeter(newPhrase))
            tries = 0
        
        tries += 1
    
    return phraseList

#-------------------------------------------------------------
# Generate one stanza for each year (2010-2015)
for year in range(2010, 2016):
    myRows = filteredRows('matthew', year)
    hisRows = filteredRows('caleb', year)
    
    print 'In ' + str(year) + ':\n'
    
    # Each stanza has six pairs of lines
    for i in range(0, 6):
        # The speaker alternates each line
        if i % 2 == 0:
            currRows = myRows
            currPerson = 'i'
        else:
            currRows = hisRows
            currPerson = 'you'
        
        # The speaker can say one of three types of things, randomly chosen
        path = random.randint(0, 2)
        
        if path == 0:
            gerunds = getGerunds(currRows)
            gerundList = getGerundList(gerunds, 8);
            print currPerson + ' said: ' + ', '.join(gerundList)
            gerundList = getGerundList(gerunds, 10);
            print ', '.join(gerundList)
        elif path == 1:
            tinySentences = getTinySentences(currRows)
            tinySentenceList = getPhraseList(tinySentences, 8)
            print currPerson + ' said: ' + ', '.join(tinySentenceList)
            tinySentenceList = getPhraseList(tinySentences, 10)
            print ', '.join(tinySentenceList)
        elif path == 2:
            nounPhrases = getNounPhrases(currRows)
            nounPhraseList = getPhraseList(nounPhrases, 8)
            print currPerson + ' said: ' + ', '.join(nounPhraseList)
            nounPhraseList = getPhraseList(nounPhrases, 10)
            print ', '.join(nounPhraseList)
        
        print ''
    
    #---- End each year/stanza with a pair of sentences where ----
    #--------------- each of us addresses the other --------------
    mySentences = pronouncer.findWordsWithLength(8, chatGrep('you ', myRows))
    mySentence = random.choice(mySentences)
    
    if mySentence[-1] == '?':
        verb = 'asked'
    else:
        verb = 'said'
    
    print 'i ' + verb + ': ' + str(mySentence)
    
    hisSentences = pronouncer.findWordsWithLength(7, chatGrep('you ', hisRows))
    hisSentence = random.choice(hisSentences)
    
    if hisSentence[-1] == '?':
        verb = 'asked'
    else:
        verb = 'said'
    
    print 'you ' + verb + ': ' + str(hisSentence) + '\n'

Bonus: Better Pronouncing Module Functionality

I used my CMU Pronouncing Dictionary functions on this project to get lines with certain numbers of syllables, and compare the similarity of two words. Unfortunately, this only works for words in the speaking dictionary, which was limiting, especially considering I write linguistic monstrosities like “bit-shifting/packing/masking”. So, I updated my pronouncing library to break apart such compound words and then evaluate them based on the sum of the pronunciation of each part. In the interest of full disclosure, that code is here too:

import string

# Array of all vowel sounds in the CMU speaking dictionary
vowelSounds = ["AA", "AE", "AH", "AO", "AW", "AY", "EH", "ER", "EY",
               "IH", "IY", "OW", "OY", "UH", "UW"]

# Now, import and populate the CMU speaking dictionary
cmu = open('cmudict', 'r')

phoneticDict = {}

for line in cmu:
    tokens = line.split()
    
    # Only grab the first pronunciation in the dictionary, for now
    if tokens[1] == "1":
        key = tokens[0]
        phonemes = []
        stresses = []

        # Some phonemes have a number indicating stress (e.g. "EH1").
        for phoneme in tokens[2:]:
            # If this is one of those, split the phoneme from the stress
            if not phoneme.isalpha():
                stresses.append(int(phoneme[-1]))
                phoneme = phoneme[:-1]

            phonemes.append(phoneme)

        phoneticDict[key] = (phonemes, stresses)

cmu.close()

# Convert tokens (with attached whitespace, punctuation, and irregular
# capitalization) to a clean, all-caps form
def cleanWord(word):
    newWord = word.upper()
    newWord = newWord.strip(string.punctuation)
    return newWord

# Check whether two words are both a) in the pronouncing dictionary and
# b) rhyming. Returns True or False.
def rhymingWords(firstWord, secondWord):
    firstWord = cleanWord(breakUpWords(firstWord)[-1])
    secondWord = cleanWord(breakUpWords(secondWord)[-1])
    
    if firstWord == secondWord:
        return False
    
    if firstWord in phoneticDict:
        searchSyllables, searchStresses = phoneticDict[firstWord]
    else:
        return False
    
    if secondWord in phoneticDict:
        wordSyllables, wordStresses = phoneticDict[secondWord]
    else:
        return False
    
    lastPhoneme = ''
    stressCounter = 1
    lastStress = 0

    for i in range(1, 1 + min(len(searchSyllables), len(wordSyllables))):
        if (searchSyllables[-i] == wordSyllables[-i] and
           stressCounter <= len(wordStresses) and
           stressCounter <= len(searchStresses)):
            lastPhoneme = searchSyllables[-i]

            if lastPhoneme in vowelSounds:
                lastStress = (wordStresses[-1 * stressCounter] and
                             searchStresses[-1 * stressCounter])

                stressCounter += 1
        else:
            break

    if (lastPhoneme in vowelSounds) and (lastStress > 0):
        return True
    else:
        return False

# For a given string, split the string into individual words and then return
# the meter of the entire string of words, in terms of stresses where
# 0 = no stress, 1 = primary stress, and 2 = secondary stress.
# For example, [1, 0, 1, 1, 0, 2]
def getMeter(line):
    words = breakUpWords(line)
    meter = [];
    
    for word in words:
        currWord = cleanWord(word);
        
        if currWord in phoneticDict:
            meter += phoneticDict[currWord][1]
        else:
            return False;
    
    return meter

# Get a list of words that rhyme with a certain word. The first parameter
# is the word that the results should rhyme with. The optional second
# parameter is the list of words for searching in. If no list is specified,
# the function will search the entire pronouncing dictionary.
def findWordsWithRhyme(rhyme, searchList=phoneticDict.keys()):
    result = [];
    
    for word in searchList:
        if rhymingWords(rhyme, word):
            result.append(word);
    
    return result;

# Return a list of words that have the given pattern of stresses. The first
# parameter is a list of stresses (0, 1, or 2). The optional second parameter
# is the list of words for searching in. If no list is specified, the
# function will search the entire pronouncing dictionary.
def findWordsWithMeter(meter, searchList=phoneticDict.keys()):
    result = [];
    
    for word in searchList:
        searchMeter = getMeter(word);
        
        if searchMeter == meter:
        	result.append(word);
    
    return result;

# Return a list of words with a certain syllable length. The first parameter
# is a number of syllables, and the optional second parameter is a list of
# legal words or phrases. If no list is specified, the function will
# search the entire pronouncing dictionary.
def findWordsWithLength(syllableLength, searchList=phoneticDict.keys()):
    result = [];
    
    for word in searchList:
        searchMeter = getMeter(word);
        
        if searchMeter and len(searchMeter) == syllableLength:
        	result.append(word);
    
    return result;

# Convert a phrase (multiple words, as well as compound words formed with
# hyphens or slashes) into a list of single words.
def breakUpWords(phrase):
    if cleanWord(phrase) in phoneticDict:
        return [phrase]

    input1 = phrase.split()
    output = []

    for word1 in input1:
        input2 = word1.split('-')

        for word2 in input2:
            input3 = word2.split('/')

            for word3 in input3:
                output.append(word3)
    return output

# Check whether a phrase consists of a legal (that is, in the pronouncing
# dictionary) word, or a series of legal words (either in a phrase, or in
# compound words). Returns True or False.
def check(phrase):
    words = breakUpWords(phrase)

    for word in words:
        if cleanWord(word) not in phoneticDict:
            return False

    return True

Final Project Prologue: Harvesting Data

Reading and Writing Electronic Text


274

For my final project, I needed all of the Google chats my husband Caleb and I have ever sent each other. The next post gets into how and why I made poetry with that text. This post just covers how I got the data in the first place.

Collecting the Data

I first tried to export everything using Google Takeout, Google’s cross-product data export tool. I’ve used it successfully in the past, but for whatever reason, it ran into trouble this time, and gave me a file with large swaths of missing messages and bad data. While trying to diagnose that problem, I found another solution: this Google Apps Script for automatically scraping chats out of Gmail and populating a Google Spreadsheet. Google Apps Scripts can only run for so long (and will give you an error if you try sending too many requests to Google’s servers), so I modified the script to allow me to only request a range of messages, manually re-running the script several times over the course of a couple hours.

I set up the script so that the first column of my spreadsheet is the message thread ID number (counting up chronologically as you move back in time—useful for keeping track of which messages were already requested), the second column is the date, the third and fourth columns are sender and recipient, and the last column contains the full text of the message. The end result looked like this:

Chat Spreadsheet

With my spreadsheet, I manually deleted messages with people other than Caleb, combined the messages sent from both my personal and NYU accounts, and exported all the data as a CSV.

Parsing the Data

Unfortunately, six years is a long time for Google to maintain consistent data formatting, particularly while completely overhauling their instant messaging products. This means that the new messages were formatted correctly when my script scraped them, like so:

2/27/2015 Matthew Caleb Any good burrito places in Manhattan?

But scraping older messages gave me something like this:

8/14/2011 Caleb Matthew Caleb: Toaster: black, white, or red?
me: Hey. Do you still need toaster color input?
Caleb: Yes
Caleb: If you could.
me: How about black?
Caleb: Sure
me: Sorry I didn't get back to you earlier. I went kind of sleep for a couple of hours.

…with the body of the “message” saved as: <span><span style="font-weight:bold">Caleb</span>: Toaster: black, white, or red?</span><br><span><span style="font-weight:bold">me</span>: Hey. Do you still need toaster color input?</span>, and so on.

In a few cases, for whatever reason, times were included in the message body as well:

7/26/2010 Caleb Matthew
12:27 PM me: Are you there?
 Caleb: yes
12:28 PM me: Cool. I'm having lunch now.
 Caleb: cool
  i'm trying to decide what to do for lunch
 me: Ah.
12:29 PM Caleb: how has your day been?
12:30 PM me: Pretty good. Finished up web ads, pending revisions.
 Caleb: cool

Obviously, this meant that the sender, recipient, and timestamp information associated with the chat thread was not useful. Instead, I had to use Beautiful Soup to parse out these HTML-formatted chat threads and extract the relevant data. The resulting data was a list of chat messages, each stored of a list with the format: [timestamp, recipientName, messageText]. Because all of this HTML parsing took several minutes, I only did it once, and then I used the Pickle module to store my clean data to a file.

After a lot of tinkering, this is what I came up with:

import csv
from dateutil.parser import *
from bs4 import BeautifulSoup
import pickle

# Given a row of the CSV file (which may or may not contain
# multiple threaded messages), return a list individual chat messages.
def formatRow(input):
    rows = []
    
    # The main CSV timestamp will be the default timestamp for all messages
    timestamp = parse(input[0])
    
    # The main sender will be the default sender, unless others are specified.
    # (Because GChat formats the usernames differently at different points,
    # this simple string search is enough of a check)
    if 'caleb' in input[1].lower():
        sender = 'caleb'
    elif 'matthew' in input[1].lower():
        sender = 'matthew'
    
    # Now, pull the text in the message or thread
    threadText = input[3]
    
    # Check for empty "messages"
    if str(threadText) == '':
        return []
    
    # Beautiful soup doesn't like weak breaks, so remove them
    threadText = threadText.replace('<wbr>', '')
    
    # Parse the thread text as HTML. This will both remove HTML-encoded
    # special characters and allow us to tear apart old threads that were
    # all saved in the same message.
    thread = BeautifulSoup(input[3])
    
    # I don't want to read links in my poetry, so remove them.
    for link in thread.findAll('a'):
        link.extract()
    
    for msg in thread.html.body.children:
        if msg.name == 'p':
            # If the text is stored in a <p> tag, then it's a simple message.
            rows.append([timestamp, sender, unicode(msg.contents[0])])
        else:
            # If the text is organized by <div> tags, then it has built-in
            # timestamps.
            if msg.name == 'div':
                # Update the base timestamp with this new hour and minute
                if len(msg.contents[0].contents[0].strip()) > 0:
                    newTimestamp = parse(msg.contents[0].contents[0].strip())
                    timestamp = timestamp.replace(hour=newTimestamp.hour,
                                                  minute=newTimestamp.minute)
                msg = msg.contents[1].span
            
            # If the text is contained in a span tag, then it maybe 
            # has the sender's name in bold before the message.
            if msg.name == 'span':
                # Update the message
                if msg.contents[0].name == 'span':
                    if 'caleb' in msg.span.contents[0].lower():
                        sender = 'caleb'
                    elif 'me' in msg.span.contents[0].lower():
                        sender = 'matthew'
                    content = unicode(msg.contents[1])
                else:
                    content = unicode(msg.contents[0])
                
                # Remove the colon if present
                if content[:2] == ': ':
                    content = content[2:]
                
                # Append a new row
                if len(content) > 0:
                    rows.append([timestamp, sender, content])
    return rows

#---------------------- Main Logic ------------------------

# This is the list of data that we'll want in the future.
rows = []

with open('source/RawChatArchive.csv', 'rb') as csvfile:
    messageList = csv.reader(csvfile)
    for row in messageList:
        formatted = formatRow(row)
        rows += formatted        

# The pickle module allows us to simply save a Python data structure in a
# file, no manual parsing or formatting necessary.
pickle.dump(rows, open("chatlog.p", "wb"))

Assignment 4: Pronouncing Module

Reading and Writing Electronic Text


18

For this assignment, I focused on extending the functionality of my CMU pronouncing dictionary work from earlier in the semester, and wrapping up everything in a module I called “pronouncer.py”.

Rhyme

My module can be used to test whether two words rhyme. Note that both words must be in the CMU speaking dictionary. If they’re not, then the function returns false.

>>> pronouncer.rhymingWords(‘apple’, ‘orange’)
False

You can also search the pronouncing dictionary for rhymes:

>>> pronouncer.findWordsWithRhyme(‘persuasive’)
[‘INVASIVE’, ‘PERVASIVE’, ‘SUPERABRASIVE’, ‘ABRASIVE’, ‘EVASIVE’]

Or, if you want to work within an existing corpus, you can specify a list of words to search within:

>>> pronouncer.findWordsWithRhyme(‘lemon’, [‘demon’, ‘yemen’, ‘salmon’])
[‘yemen’]

Meter

Based on the stress information in the pronouncing dictionary, the module can do incredibly simple scansion-type analysis (expressed as a list of 1s, 2s, and 0s, for primary stress, secondary stress, and no stress, respectively). Currently, the system makes no attempt to guess meter for words it doesn’t know, so this isn’t always that helpful.

>>> pronouncer.getMeter(‘To sit in solemn silence’)
[1, 1, 0, 1, 0, 1, 0]

More helpful is the ability to search for words that have a given stress. Like the rhyme functions, this can either search the entire pronouncing dictionary, or a provided word list.

>>> pronouncer.findWordsWithMeter([1, 0, 1, 0, 0])
[‘VIVISEPULTURE’, ‘NONEXECUTIVE’, ‘SUNAMERICA’, ‘GENEMEDICINE’, ‘PUBLIC-SPIRITED’, ‘SELF-DELIVERANCE’, ‘PORT-VICTORIA’, ‘OVERSHADOWING’, ‘SELF-SUFFICIENCY’, ‘PHOTOFINISHING’, ‘MOCK-HEROICALLY’]

Module Code

import string

# Array of all vowel sounds in the CMU speaking dictionary
vowelSounds = ["AA", "AE", "AH", "AO", "AW", "AY", "EH", "ER", "EY",
               "IH", "IY", "OW", "OY", "UH", "UW"]

# Now, import and populate the CMU speaking dictionary
cmu = open('cmudict', 'r')

phoneticDict = {}

for line in cmu:
    tokens = line.split()
    
    # Only grab the first pronunciation in the dictionary, for now
    if tokens[1] == "1":
        key = tokens[0]
        phonemes = []
        stresses = []

        # Some phonemes have a number indicating stress (e.g. "EH1").
        for phoneme in tokens[2:]:
            # If this is one of those, split the phoneme from the stress
            if not phoneme.isalpha():
                stresses.append(int(phoneme[-1]))
                phoneme = phoneme[:-1]

            phonemes.append(phoneme)

        phoneticDict[key] = (phonemes, stresses)

cmu.close()


# Convert tokens (with attached whitespace, punctuation, and irregular
# capitalization) to a clean, all-caps form
def cleanWord(word):
    newWord = word.upper()
    newWord = newWord.strip(string.punctuation)
    return newWord

# Check whether two words are both a) in the pronouncing dictionary and
# b) rhyming. Returns True or False.
def rhymingWords(firstWord, secondWord):
    if firstWord == secondWord:
        return False
    
    firstWord = cleanWord(firstWord)
    secondWord = cleanWord(secondWord)
    
    if firstWord in phoneticDict:
        searchSyllables, searchStresses = phoneticDict[firstWord]
    else:
        return False
    
    if secondWord in phoneticDict:
        wordSyllables, wordStresses = phoneticDict[secondWord]
    else:
        return False
    
    lastPhoneme = ''
    stressCounter = 1
    lastStress = 0

    for i in range(1, 1 + min(len(searchSyllables), len(wordSyllables))):
        if (searchSyllables[-i] == wordSyllables[-i] and
           stressCounter <= len(wordStresses) and
           stressCounter <= len(searchStresses)):
            lastPhoneme = searchSyllables[-i]

            if lastPhoneme in vowelSounds:
                lastStress = (wordStresses[-1 * stressCounter] and
                             searchStresses[-1 * stressCounter])

                stressCounter += 1
        else:
            break

    if (lastPhoneme in vowelSounds) and (lastStress > 0):
        return True
    else:
        return False

# For a given string, split the string into individual words and then return
# the meter of the entire string of words, in terms of stresses where
# 0 = no stress, 1 = primary stress, and 2 = secondary stress.
# For example, [1, 0, 1, 1, 0, 2]
def getMeter(line):
    words = line.strip().split()
    meter = [];
    
    for word in words:
        currWord = cleanWord(word);
        
        if currWord in phoneticDict:
            meter += phoneticDict[currWord][1]
    
    return meter

# Get a list of words that rhyme with a certain word. The first parameter
# is the word that the results should rhyme with. The optional second
# parameter is the list of words for searching in. If no list is specified,
# the function will search the entire pronouncing dictionary.
def findWordsWithRhyme(rhyme, searchList=phoneticDict.keys()):
    result = [];
    
    for word in searchList:
        if rhymingWords(rhyme, word):
            result.append(word);
    
    return result;

# Return a list of words that have the given pattern of stresses. The first
# parameter is a list of stresses (0, 1, or 2). The optional second parameter
# is the list of words for searching in. If no list is specified, the
# function will search the entire pronouncing dictionary.
def findWordsWithMeter(meter, searchList=phoneticDict.keys()):
    result = [];
    
    for word in searchList:
        cleaned = cleanWord(word);
        
        if cleaned in phoneticDict and phoneticDict[cleaned][1] == meter:
        	result.append(word);
    
    return result;

Midterm: The/Is/Is/The

Reading and Writing Electronic Text


317

For my midterm, I decided to focus more on my work with the CMU pronouncing dictionary, in order to create better meter and rhyme. I was still trying to figure out ways to extract interesting words from a source text. Eventually, this will have to involve some natural language parsing, I’m sure, but for now, I discovered that I could simply search for phrases that take the form “the [something]” or “is [something]”.

The Form

My poetic form is “The/Is/Is/The”, for lack of a better name. It consists of four eight-line stanzas. Each line consists of two-word phrases, starting with either “the” or “is”: In the first and third stanzas, the “the” lines are followed by “is” lines, and in the second and fourth stanzas, the “is” lines are followed by “the” lines.

I was interested in the way that the structure “is [something]” can be used as either an absolute statement or a question. This form is designed to distill a document (presumably non-fiction, in the present tense) to some of its most basic assertions. By alternating “is” lines and “the” lines, I think the poem creates some ambiguity between these two readings.

For the meter, I started with each line having three syllables, with the stress on the second, such as “the pathway is winding” (two amphibrachs, x / x | x / x). When that became too repetitive, I added two more patterns that fit within the same scheme: “the wall is unbroken” (x / | x x / x) and “the result is confusing” (x x / | x x / x). Additionally, the last line of each stanza is two syllables, and the last lines of a pair of stanzas rhyme.

Poems

Here’s a poem generated from the U.S. Senate Report on the CIA’s Detention and Interrogation Program:

the arrest
is entitled
the techniques
is familiar
the u.s
is surrounded
the purpose
is for

is detailed
the geneva
is because
the conditions
is going
the capture
is also
the lore

the first
is important
the u.s
is obstructing
the war
is created
the senate
is told

is pointed
the rectal
is no
the destruction
is going
the abu
is also
the cold

This one’s from The C Programming Language by Brian Kernighan and Dennis Ritchie:

the space
is exhausted
the result
is encountered
the return
is provided
the body
is said

is useful
the function
is entered
the standard
is stated
the object
is present
the head

the request
is divided
the result
is unequal
the result
is illegal
the function
is all

is likely
the prefix
is more
the expression
is noted
the proper
is entered
the call

And finally, a poem generated from the iTunes Terms and Conditions:

the device
is unlawful
the update
is protected
the gift
is provided
the licensed
is for

is removed
the transaction
is subject
the entire
is subject
the external
is strongly
the store

the app
is provided
the app
is acquired
the extent
is provided
the licensed
is your

is defined
the external
is granted
the licensed
is subject
the commercial
is granted
the store

The Code

import sys
import string
import random

# Array of all vowel sounds in the CMU speaking dictionary
vowelSounds = ["AA", "AE", "AH", "AO", "AW", "AY", "EH", "ER", "EY", "IH", "IY", "OW", "OY", "UH", "UW"]

# Check whether two words are both a) in the pronouncing dictionary and b) rhyming.
# It's assumed that you'll call cleanWord on each word before passing it into this function.
def rhymingWords(firstWord, secondWord):
    if firstWord == secondWord:
        return False
    
    if firstWord in phoneticDictionary:
        searchSyllables, searchStresses = phoneticDictionary[firstWord]
    else:
        return False
    
    if secondWord in phoneticDictionary:
        wordSyllables, wordStresses = phoneticDictionary[secondWord]
    else:
        return False
    
    lastPhoneme = ''
    stressCounter = 1
    lastStress = 0

    for i in range(1, 1 + min(len(searchSyllables), len(wordSyllables))):
        if searchSyllables[-i] == wordSyllables[-i] and stressCounter <= len(wordStresses) and stressCounter <= len(searchStresses):
            lastPhoneme = searchSyllables[-i]

            if lastPhoneme in vowelSounds:
                lastStress = wordStresses[-1 * stressCounter] and searchStresses[-1 * stressCounter]

                stressCounter += 1
        else:
            break

    if (lastPhoneme in vowelSounds) and (lastStress > 0):
        return True
    else:
        return False

# Convert tokens (with attached whitespace, punctuation, and irregular capitalization) to a clean, all-caps form
def cleanWord(word):
    newWord = word.upper()
    newWord = newWord.strip(string.punctuation)
    return newWord

# Given a list of words (which should have already been passed through cleanWord()), return a randomly selected
# word that has the given pattern of stresses (passed in as an array: [0, 1, 0], for example)
# 
# This function WILL shuffle your list in place. Watch out for that.
def wordWithPattern(searchList, stresses):
    random.shuffle(searchList)
    
    for word in searchList:
        if word in phoneticDictionary and phoneticDictionary[word][1] == stresses:
        	return word.lower()


# Import and populate the speaking dictionary
cmu = open('cmudict', 'r')

phoneticDictionary = {}

for line in cmu:
    tokens = line.split()
    
    # Only grab the first pronunciation in the dictionary
    if tokens[1] == "1":
        key = tokens[0]
        phonemes = []
        stresses = []

        # Some syllables have a number indicating stress (e.g. "EH1").
        # We're not using it, so strip that character off.
        for phoneme in tokens[2:]:
            if not phoneme.isalpha():
                stresses.append(int(phoneme[-1]))
                phoneme = phoneme[:-1]

            phonemes.append(phoneme)

        phoneticDictionary[key] = (phonemes, stresses)

cmu.close()

# Now start the real work ----------------------------------------------------

# First, pull in a list of all words in the file
words = []

for line in sys.stdin:
    line = line.strip()
    newWords = line.split()
    words += newWords

# Now, 
phrases = {'subject': [],
		   'predicate': []}

currType = ''

for word in words:
    if currType != '':
        # Filter out a couple of words I don't want... single letter words and articles
        if len(cleanWord(word)) > 1 and cleanWord(word) != 'AN':
            phrases[currType].append(cleanWord(word))
        currType = ''
    elif cleanWord(word) == 'THE':
        currType = 'subject'
    elif cleanWord(word) == 'IS':
        currType = 'predicate'

rhymes = set()

for subj in phrases['subject']:
	for pred in phrases['predicate']:
		if rhymingWords(subj, pred) and (phoneticDictionary[subj][1] == [1] or phoneticDictionary[subj][1] == [0, 1]) and (phoneticDictionary[pred][1] == [1] or phoneticDictionary[pred][1] == [0, 1]):
			rhymes.add((subj, pred))


# Now, print the poem

print '\n\n'

# This could be extended out to any number of pairs of stanzas
for i in range(0, 2):
    currRhyme = random.choice(list(rhymes))

    for i in range(0, 4):
    	if i == 3:
    		print 'the ' + wordWithPattern(phrases['subject'], [1, 0])
    		print 'is ' + currRhyme[1].lower()
    	elif random.random() < 0.4:
    		print 'the ' + wordWithPattern(phrases['subject'], [1, 0])
    		print 'is ' + wordWithPattern(phrases['predicate'], [1, 0])
    	elif random.random() < 0.5:
    		print 'the ' + wordWithPattern(phrases['subject'], [0, 1])
    		print 'is ' + wordWithPattern(phrases['predicate'], [0, 1, 0])
    	else:
    		print 'the ' + wordWithPattern(phrases['subject'], [1])
    		print 'is ' + wordWithPattern(phrases['predicate'], [0, 1, 0])

    print ''

    for i in range(0, 4):
    	if i == 3:
    		print 'is ' + wordWithPattern(phrases['predicate'], [1, 0])
    		print 'the ' + currRhyme[0].lower()
    	elif random.random() < 0.4:
    		print 'is ' + wordWithPattern(phrases['predicate'], [1, 0])
    		print 'the ' + wordWithPattern(phrases['subject'], [1, 0])
    	elif random.random() < 0.5:
    		print 'is ' + wordWithPattern(phrases['predicate'], [0, 1])
    		print 'the ' + wordWithPattern(phrases['subject'], [0, 1, 0])
    	else:
    		print 'is ' + wordWithPattern(phrases['predicate'], [1])
    		print 'the ' + wordWithPattern(phrases['subject'], [0, 1, 0])
    
    print ''

print '\n\n'

Assignment 3: Wiki

Reading and Writing Electronic Text


215

For this assignment, I looked into the Wikipedia API. I’m interested in making poems that are about something (at least in an incredibly vague sense of “about”), and Wikipedia is as good a resource as any for learning about a wide variety of topics (or, for computer poets, at least gleaning the vaguest impression of knowledge).

I wanted to write a program that could take articles on two topics and mash them up, generating metaphorical comparisons between the two. I thought I’d grab sentences and combine them, but Wikipedia is quite messy, littered with markup. So, instead of trying to parse out all of the markup from sentences, I decided to focus on words again. I used my CMU Pronouncing Dictionary code from the previous assignment to filter each of the two articles into a list of words found in the CMU dictionary. I then found words that only appeared in one of the two articles, and formed pairs of words based on phonetic similarity. By focusing on words that only appear in one article, I hoped to single out interesting words, while establishing a series of contrasts.

Here’s an example, contrasting “Computer” and “Poetry”:

I am Computer and you are Poetry
I am practically and you are grammatical.
And while you are plants, I am plan.
And while you are random, I am rand.

And while you are larger, I am large.
And you are not result, for you are results.
And while you are compilation, I am computational.
And while you are dramas, I am dumas.

And you are not careful, for you are carroll.
And while you are character, I am microcomputer.
And you are not dummer, for you are adam.
And while you are two, I am too.

And I am replacements while you are placement.
I am not discussed, for I am difficulties.
And I am catering while you are cutting.
I am not thus, for I am us.

And I am fail while you are fable.
I am not free, for I am frees.
And while you are tory, I am torpedo.
I am not two, for I am tool.

For the most part, the words chosen by the program are interesting and relevant to their specific topic. Wikipedia mentions or cites a lot of names, which generally don’t lead to interesting lines. (Though, I think “And you are not careful, for you are carroll” is wonderful.) And, of course, there’s no checking of part of speech or anything, so the lines are often a syntactic mess.

Here’s another, on the same topic:

I am Computer and you are Poetry
I am not departing, for I am interpreting.
I am commands and you are combined.
I am mechanical and you are classical.

And you are not bus, for you are basis.
And you are not francisco, for you are france.
And you are not imperative, for you are emotive.
And you are not serve, for you are served.

And I am multiply while you are multiple.
And while you are memory, I am mit.
I am scales and you are schools.
I am not rising, for I am devising.

I am people and you are appeal.
I am i and you are rhyme.
I am not open, for I am punch.
I am not frequent, for I am frequently.

I am not introducing, for I am producing.
I am not cultures, for I am compilers.
I am sine and you are science.
And I am slowly while you are sleep.

And, here’s my code:

import urllib
import json
import sys
import random
import difflib
import string

# Function to find the closest phonetic word
def closestWord(search_word, word_list):
    closest = ['', 0]

    for word in word_list:
        if not word == search_word:
            matcher = difflib.SequenceMatcher(None, phoneticDictionary[search_word], phoneticDictionary[word])
            similarity = matcher.ratio()

            if similarity > closest[1]:
                closest = [word, similarity]

    return closest


# Import and populate the speaking dictionary
cmu = open('cmudict', 'r')

phoneticDictionary = {}

for line in cmu:
    tokens = line.split()

    if tokens[1] == "1":
        key = tokens[0]
        phonemes = []

        # Some syllables have a number indicating stress (e.g. "EH1").
        # We're not using it, so strip that character off.
        for phoneme in tokens[2:]:
            if not phoneme.isalpha():
                phoneme = phoneme[:-1]

            phonemes.append(phoneme)

        phoneticDictionary[key] = phonemes

cmu.close()

#Treat the first two command-line arguments as Wikipedia titles, and load the
#articles
titles = [sys.argv[1], sys.argv[2]]
query = {"format": "json",
         "action": "query",
         "titles": '|'.join(titles),
         "prop": "revisions",
         "rvprop": "content"}
query_string = urllib.urlencode(query)

url = "http://en.wikipedia.org/w/api.php?" + query_string

response = urllib.urlopen(url).read()

data = json.loads(response)

#Now, grab the content for the two pages

titles = []
pageContent = []

for id in data['query']['pages']:
    print id
    words = data['query']['pages'][id]['revisions'][0]['*'].split(' ')
    
    cleanWords = []
    
    for word in words:
        word = word.upper()
        word = word.strip(string.punctuation)
        
        if word in phoneticDictionary:
            cleanWords.append(word)
    
    titles.append(data['query']['pages'][id]['title'])
    pageContent.append(cleanWords)

#Now, clean up lists more. First, remove duplicates whithin a list
pageContent[0] = list(set(pageContent[0]))
pageContent[1] = list(set(pageContent[1]))

#Second, remove all overlapping words between the two lists
for word in pageContent[0]:
    if word in pageContent[1]:
        pageContent[0].remove(word)
        pageContent[1].remove(word)

#Now, generate poem

#First line
print 'I am ' + titles[0] + ' and you are ' + titles[1]

templates = [['I am ', 0, ' and you are ', 1],
             ['And I am ', 0, ' while you are ', 1],
             ['I am not ', 1, ', for I am ', 0],
             ['And you are not ', 0, ', for you are ', 1],
             ['And while you are ', 1, ', I am ', 0]]

for i in range(1, 20):
    randoWord = random.choice(pageContent[0])
    matchWord = closestWord(randoWord, pageContent[1])[0]
    
    words = [randoWord.lower(), matchWord.lower()]
    pattern = random.choice(templates)
    print pattern[0] + words[pattern[1]] + pattern[2] + words[pattern[3]] + '.'
    #Break it up into four-line stanzas
    if i % 4 == 3:
        print ''

Assignment 2: Speech

Reading and Writing Electronic Text


228

I used the speaking dictionary from Carnegie Mellon University’s computational speech group. The dictionary is a super long text document with pronunciations for basically any word you can think of, laid out like so:

...
POET'S 1 P OW1 AH0 T S
POETS 1 P OW1 AH0 T S
POFAHL 1 P AA1 F AA0 L
POFF 1 P AO1 F
...

My rough goal is to cluster words that sound similar (words that rhyme, alliterative words, homophones, etc) into abstract, tongue-twistery text. As it turns out, Python has a good way to calculate the similarity of two lists (in this case, lists of phonemes) using difflib, a module for computing differences in sequences of text.

Working on the text of the recent New York Times article on the FCC and Net Neutrality, I came up with code to generate pairs consisting of a random word from the article, plus the word that was phonetically closest to it.

HOUSEHOLD, SO-CALLED
DIETZ, ITS
CAN, AN
TECHNICAL, CRITICAL
RULES, RULE
BRANDED, BRAND
SPIGOT, BIGGEST
AWAY, WAY
IS, IMPOSE
BEGINNING, BUILDING

To generate my poem, I wrote some code to take pairs of words from the article and then link them with similar adjacent words.

proclaiming in been the a right wrote the
thinking is impose tolls tools to
turn away way were work with waiting until efficiently into
technology activist activists have had branded brand of a right
support he hear dire i think thing that at the
days the a right wrote the
east coast-west
gatekeeper power our future few days dazed brian urban dictionary
window proclaiming proclaimed we week 102
posted a the policy technology activist activists have had branded

Here’s the code:

import sys
import random
import difflib

# Function to find the closest phonetic word
def closest_word(search_word):
    closest = ['', 0]

    for word in words:
        if not word == search_word:
            matcher = difflib.SequenceMatcher(None, phoneticDictionary[search_word], phoneticDictionary[word])
            similarity = matcher.ratio()

            if similarity > closest[1]:
                closest = [word, similarity]

    return closest


# Import and populate the speaking dictionary
cmu = open('cmudict', 'r')

phoneticDictionary = {}

for line in cmu:
    tokens = line.split()

    if tokens[1] == "1":
        key = tokens[0]
        phonemes = []

        # Some syllables have a number indicating stress (e.g. "EH1").
        # We're not using it, so strip that character off.
        for phoneme in tokens[2:]:
            if not phoneme.isalpha():
                phoneme = phoneme[:-1]

            phonemes.append(phoneme)

        phoneticDictionary[key] = phonemes

cmu.close()

wordpairs = {}
lastword = ''

words = set()

for line in sys.stdin:
    line = line.strip()
    line = line.upper()

    if lastword != '':
        wordpairs[lastword] = line

    lastword = line

    if line in phoneticDictionary:
        words.add(line)

# Cycle through lines, starting each with a random word
for i in range(0, 10):
    randomWord = random.choice(list(words))
    lineWords = []

    for j in range(0, 5):
        lineWords.append(randomWord)
        lineWords.append(wordpairs[randomWord])

        if wordpairs[randomWord] in words:
            randomWord = closest_word(wordpairs[randomWord])[0]

            if randomWord in lineWords:
                break
        else:
            break

    print " ".join(lineWords).lower()

Assignment 1: Experimental text justification

Reading and Writing Electronic Text


7

While experimenting Python’s string manipulation, I noticed that Python has tools for padding strings to certain lengths in order to center, left or right-justify text rendered in a monospaced font. The fourth popular text layout—text justified with both margins is notably not supported, as that traditionally involves variable-sized spaces.

While considering this problem, it occurred to me that padding strings with extra spaces between words would look clunky, but then I hit upon a solution—vowels. Whereas consonants provide much of the definition of words, vowels can be repeated or omitted with a minimal impact on readability. The following Python program justifies a column of text to the width of the first line, simply by duplicating or removing vowels (omitting capital vowels, as they often occur at the beginning of words, and it looks more awkward to chop them off).

# Justify text by adding and subtracting vowels
import sys

vowels = ['a', 'e', 'i', 'o', 'u']
lineLength = 0

for line in sys.stdin:
    line = line.strip()

    if lineLength == 0:
        lineLength = len(line)
    else:
        index = 0

        while (len(line) < lineLength) or (len(line) > lineLength):
            if line[index] in vowels:
                if len(line) < lineLength:
                    line = line[:index] + line[index] + line[index:]
                    index = (index + 2) % len(line)
                elif len(line) > lineLength:
                    line = line[:index] + line[index + 1:]
            else:
                index = (index + 1) % len(line)

    print line

As an example, take this one of Shakespeare’s sonnets.

Let me confess that we two must be twain,
Although our undivided loves are one:
So shall those blots that do with me remain,
Without thy help, by me be borne alone.
In our two loves there is but one respect,
Though in our lives a separable spite,
Which though it alter not love's sole effect,
Yet doth it steal sweet hours from love's delight.
I may not evermore acknowledge thee,
Lest my bewailed guilt should do thee shame,
Nor thou with public kindness honour me,
Unless thou take that honour from thy name:
  But do not so, I love thee in such sort,
  As thou being mine, mine is thy good report.

When run through this Python script, each line is transformed, resulting in the following, nicely-justified text:

Let me confess that we two must be twain,
Althoouugh oouur undivided loves are one:
S shll thse blots that do with me remain,
Wiithoout thy help, by me be borne alone.
In ur two loves there is but one respect,
Thoouugh iin our lives a separable spite,
Whch thgh t alter not love's sole effect,
Yt dth t stl swt hrs from love's delight.
I maay noot eeveermoore acknowledge thee,
Lst my bwiled guilt should do thee shame,
Noor thou with public kindness honour me,
Unlss thu take that honour from thy name:
Buut do not so, I love thee in such sort,
As th bing mine, mine is thy good report.

Beautiful.