Identify important people from emails (alt)

Identify important people from emails (alt)

Programming Foundation for Data Analytics

Project Title: Identification of Roles of People in Email Conversations

 Project Goal:You are given a CSV file “220.csv”, a filtered email archive containing “Subject”, names in columns “From”, “To”, “CC”, and “BCC”; and “Importance”. You are asked to write a Python program to compute statistics that will help identify the roles each involved person (including organizations) plays in the email conversations. In particular, for each person (or organization) that appears in the file, you need to find out how many conversations that person was involved in, how many times that person appeared in the “From”, “To”, “CC”, and “BCC” columns, respectively. And whether that person ever appeared in a conversation that was marked as highly important.

Names of people appears in format: Last, First Middle-Initial (Location), e.g.,

Khammivong, Somsanouk N (HOU)

Note that the middle initial may be omitted. Also note that a name can be an organization, such as

DRCS_Support

Multiple names in the same column are separated by “;”.

Each email is marked as “Normal” or “High” in importance.

You are encouraged to open the CSV file in Notepad to examine the format of the data.

Output:The output of the program should be a CSV file whose header line contains column names “Name”, “Total”, “Conversations”, “From”, “To”, “CC”, “BCC”, and “Importance”. The other rows contain data specified as follows:

  • Column “Name”: the name of the person or organization
  • Column “Total: the total number of times the person appears in all the emails.
  • Column “Conversations”: the number of conversations the person is involved (i.e., how many times that person appears in at least one of the “From”, “To”, “CC”, and “BCC” columns of at least one email in the conversation).
  • Column “From”: the number of times the person appears in the “From” column.
  • Column “To”: the number of times the person appears in the “To” column.
  • Column “CC”: the number of times the person appears in the “CC” column.
  • Column “BCC”: the number of times the person appears in the “BCC” column.
  • Column “Importance”: If the person appears in at least one email that is marked “High” importance, the value is “High”. Otherwise, the value is “Normal”. 

“””

Project Goal: You are given a CSV file “220.csv”, a filtered email archive

containing “Subject”, names in columns “From”, “To”, “CC”, and “BCC”; and

“Importance”. You are asked to write a Python program to compute statisticsthat will help identify the roles each involved person (including organizations)plays in the email conversations. In particular, for each person (or organization)that appears in the file, you need to find out how many conversations thatperson was involved in, how many times that person appeared in the “From”,

“””

##PROGRESS SO FAR

importcsv

import re

p = re.compile(‘([\[\(] *)?.*(RE?S?|FW?|Fwd?|re\[\d+\]?) *([-:;)\]][ :;\])-]*)|\]+ *$’, re.IGNORECASE)

with open(‘220.csv’,”r+”) as f:

f_reader = csv.reader(f)

for row in f_reader:

subject = p.sub(“”, row[0]) #clean the 1st column 

Solution

##PROGRESS SO FAR

importcsv

import re

from collections import defaultdict

def normalize(name):

returnname.replace(‘BIS’, ”).strip().strip(“‘”)

p = re.compile(‘([\[\(] *)?.*(RE?S?|FW?|Fwd?|re\[\d+\]?) *([-:;)\]][ :;\])-]*)|\]+ *$’, re.IGNORECASE)

columns = [“From”, “To”, “CC”, “BCC”]

# dict of dict of subjects list

conv = {item: defaultdict(list) for item in columns}

importances = defaultdict(list)

with open(‘220.csv’, “r+”, encoding=”iso-8859-1″) as f:

f_reader = csv.DictReader(f)

for row in f_reader:

subject = p.sub(“”, row[‘Subject’]) # clean the 1st column

importance = row[‘Importance’]

for item in columns:

for name in row[item+’: (Name)’].split(‘;’):

name = normalize(name)

if name:    # mae sure it is not empty

# append to the corresponding subjects list

conv[item][name].append(subject)

importances[name].append(importance)

all_names = list(importances.keys())

# eliminate duplicate entriess

for name in all_names:

forval in all_names:

if name != val and name in val:

importances[val].extend(importances[name])

delimportances[name]

for item in columns:

conv[item][val].extend(conv[item][name])

delconv[item][name]

# csv fields

fields = [‘Name’, ‘Total’, ‘Conversations’, ‘From’, ‘To’, ‘CC’, ‘BCC’, ‘Importance’]

data = []

# counts for each unique name

for name in sorted(importances.keys()):

importance = ‘High’ if ‘High’ in importances[name] else ‘Normal’

conversations = len(set(subject for item in columns for subject in conv[item][name]))

total = sum(len(conv[item][name]) for item in columns)

payload = {‘Name’: name, ‘Total’: total, ‘Conversations’: conversations, ‘Importance’: importance}

for item in columns:

payload[item] = len(conv[item][name])

data.append(payload)

# write to the output file

with open(“output.csv”, “w”, encoding=”iso-8859-1″) as f:

f_writer = csv.DictWriter(f, fields)

f_writer.writeheader()

f_writer.writerows(data)