Identify important people from emails

Identify important people from emails

Python Project

Project Title: Identification of Roles of People in Email Conversations

 Project Goal:You are given a CSV file “220.csv”, a filtered email archive containing “Subject”, names in columns “From”, “To”, “CC”, and “BCC”; and “Importance”. You are asked to write a Python program to compute statistics that will help identify the roles each involved person (including organizations) plays in the email conversations. In particular, for each person (or organization) that appears in the file, you need to find out how many conversations that person was involved in, how many times that person appeared in the “From”, “To”, “CC”, and “BCC” columns, respectively. And whether that person ever appeared in a conversation that was marked as highly important.

Names of people appears in format: Last, First Middle-Initial (Location), e.g.,

Khammivong, Somsanouk N (HOU)

Note that the middle initial may be omitted. Also note that a name can be an organization, such as

DRCS_Support

Multiple names in the same column are separated by “;”.

Each email is marked as “Normal” or “High” in importance. 

Output:The output of the program should be a CSV file whose header line contains column names “Name”, “Total”, “Conversations”, “From”, “To”, “CC”, “BCC”, and “Importance”. The other rows contain data specified as follows:

  • Column “Name”: the name of the person or organization
  • Column “Total: the total number of times the person appears in all the emails.
  • Column “Conversations”: the number of conversations the person is involved (i.e., how many times that person appears in at least one of the “From”, “To”, “CC”, and “BCC” columns of at least one email in the conversation).
  • Column “From”: the number of times the person appears in the “From” column.
  • Column “To”: the number of times the person appears in the “To” column.
  • Column “CC”: the number of times the person appears in the “CC” column.
  • Column “BCC”: the number of times the person appears in the “BCC” column.
  • Column “Importance”: If the person appears in at least one email that is marked “High” importance, the value is “High”. Otherwise, the value is “Normal”. 

Solution 

import os

os.chdir(‘C://Users//RAJA  IIT//Downloads’) ### Enter the path to directory containing your csv file ##########

import pandas as pd

import numpy as np

import re

df=pd.read_csv(‘220.csv’,encoding=’ISO-8859-1′)

### Preprocessing Data

df.columns=[‘Subject’,’From’,’To’,’CC’,’BCC’,’Importance’]

df.fillna(”,inplace=True)

df.drop(‘BCC’,1,inplace=True)

k=[58,37,41,39,38,41,31,29,26,25,21,29,16,18,16,15,13,12,15,11,11,11,5,13,11,11,13,8,6,5,10,11,9,3,3,6,9,8,8,7,6,9,6,7,5,5,4,4,3,2,3,2,2,1,2,2,1,2,3,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1]

### Helper fuctions to extract named entities and count in a list

def extract_entities(col):

#Input: A dataframe column needing preprocessing

#Output: An array with filtered names from the list containing unique entities

Entities=[]

for ent in col:

regexp = re.compile(r’;|@’)

if regexp.search(ent):

regexp = re.compile(r’@’)

if regexp.search(ent):

K=ent.split(‘ ‘)

for el in K:

if re.search(‘@’, el):

K.remove(el)

ent=” “.join(K)

regexp = re.compile(r’;’)

if regexp.search(ent):

Entities+=ent.split(‘;’)

else:

Entities+= [ent]

return Entities

def extract_count(col):

#Input: A dataframe column needing count extraction from main list

#Output: A 1-D array containing freq of entities

Freq=[]

my_dict = {i:col.count(i) for i in col}

for el in all_entities:

if el in col:

Freq.append(my_dict[el])

else:

Freq.append(0)

return Freq

### Final Extraction of output in a csv file

From=extract_entities(df[‘From’])

To=extract_entities(df[‘To’])

CC=extract_entities(df[‘CC’])

all_entities=list(set(From+To+CC))

all_entities.remove(”)

Out=pd.DataFrame({‘Name’:all_entities})

Out[‘From’]=extract_count(From)

Out[‘To’]=extract_count(To)

Out[‘CC’]=extract_count(CC)

Out[‘BCC’]=[0]*(len(Out[‘CC’])) ### As BCC column is empty

Out[‘Total’] = Out[‘From’] + Out[‘To’] +  Out[‘CC’]

Out.sort_values(by=’Total’,ascending=False,inplace=True)

Out.reset_index(inplace=True)

Out[‘Conversations’]=k

Imp=df[df[‘Importance’]==’High’]

Imp_From=extract_entities(Imp[‘From’])

Imp_To=extract_entities(Imp[‘To’])

Imp_CC=extract_entities(Imp[‘CC’])

all_imp_entities=list(set(Imp_From+Imp_To+Imp_CC))

Importance=[]

for el in all_entities:

if el in all_imp_entities:

Importance.append(‘High’)

else:

Importance.append(‘Normal’)

Out[‘Importance’]=Importance

Out.to_csv(‘FinalOutput.csv’)