Segment text

Segment text

Word-segmentation task (Writing a single Python Program)

Assignment specifications

The purpose of this assignment is to write a Python program that makes word segmentation decisions (e.g., the ambiguity of the sounds /sk/ as part of either \this kite” or \the sky”). Recall the work by Saffran et al. on the role of transition probabilities in word learning. Saffran et al. posited that word-internal transition probabilities (e.g., the frequency of the sound sequence /Is/ within \this”) are higher than between-word transition probabilities (e.g., the frequency of the sequence /sk/ across \this kite”), and that this distinction can be used to segment words from the continuous stream of sounds in linguistic input.

For this assignment, the task is simplified so that the program will read a corpus of prior \training input” from a file and will then prompt a user for a number of possible actions, including most importantly determining the likely word boundary in a short sequence of adjacent sounds. Sounds here are treated as orthographic letters. In addition, the program will allow the user to add input to the corpus and to damage the corpus in targeted ways and thereby skew its statistical properties. The program will loop until the user quits. When the user quits, the latest version of the corpus list is saved to the corpus le.

Below is a sample of how the program should run.

  • python word_segmentation.py

Reading corpus from file: corpus.txt

The corpus consists of a sequence of 639 letters

What do you want to do?

s (segment input to find the best word boundary)

a (add new input to the corpus)

d (damage the corpus)

q (quit)

>s

type a 3-letter sequence for segmentation: thb

Here is the proposed word boundary given the training corpus: Proposed end of one word: t h

Proposed beginning of new word:  b

What do you want to do?

s (segment input to find the best word boundary)

a (add new input to the corpus)

d (damage the corpus)

  • (quit)

>q

Writing corpus to file: corpus.txt

Good bye

To start you off, a training corpus file (called corpus.txt) is provided which contains an excerpt from the Saffran et al. paper. It consists of 753 letters that make up 100 words. Some skeleton code is also provided in a file called word_segmentation.py and you will need to expand this code in order to do the assignment.

If you look at the file word_segmentation.py you will find that three functions are provided (one for reading the corpus from the file at the start of the program, one for tidying the text to make it lowercase etc. and another for writing the corpus back to the file at the end). Theskeleton of the main loop is also provided. As it stands, the program will run, but will only prompt the user for options, without actually doing anything with the user input.

You will see that the file includes comments describing what the provided parts of the program do. The comments also provide some pseudo-code giving the details of what your code should do. 

Questions

  1. Write a line of code that reads the corpus from the file txt and stores it as a variable. The variable will contain a python list, such that each element of the list represents a letter. The line that you write will involve calling the function get_corpus _from_ file(myfile) that has already been the need for you.
  1. Write a function segment sequence(corpus, letter1, letter2, letter3) that takes the corpus as its first input parameter (corpus) followed by 3 letters as the next 3 input parameters (letter1, letter2, letter3) and computes the most likely word boundary. Before printing the proposed segmentation, it informs the user what it is about to do, so the printed output should look like this (e.g., it might segment the string `thb’ into `th’ and `b’):

Here is the proposed word boundary given the training corpus:

Proposed end of one word: th

Proposed beginning of new word: b

  1. Write the code that deals with the situation where the user wants to quit the program (i.e. the user typesq). This will involve writing the current corpus to the file (using the provided function write_ corpus_ to_ file(mycorpus,myfile), saying goodbye to the user, and exiting the loop.
  2. Write the code that deals with the situation where the user wants to segment a sequence of letters (i.e. the user typess). This will involve calling the segment_sequence function that you wrote in 2 above.
  3. Write the code that deals with the situation where the user wants to add input to the corpus (i.e. the user types a). This will involve prompting the user for new input, and appending the letters from the user onto the end of the corpus, then printing the length of the new corpus.
  4. Write the code that deals with the situation where the user wants to damage the stored corpus in order to skew the statistics (i.e. the user types d). This will involve prompting the user for the sequence of letters to be removed from the corpus (e.g., they might ask to remove the sequence `the’), then removing that sequence wherever it occurs in the corpus, and printing two things: a statement confirming what (tidied) sequence was removed from the corpus and the new size of the corpus.
  1. Modify the program so that it checks whether the corpus contains a letter before at-tempting to calculate any probabilities. For example, if an attempt is made to segment the sequence `zth’ but there are no instances of the letter `z’ in the corpus,the program should saysegmentation unavailable because corpus does not contain the following letter: and specify the letter that has count 0. This is important in order to avoid dividing by zero in the calculation of the probabilities. Also, if the user wants to damage the corpus by re-moving a sequence that is not in the corpus, the program should say You can’t remove a sequence that does not exist in the corpus and specify what (tidied) subsequence did not match. This will involve modifying the code you wrote for question 2 and question 6. Note that in cases where the corpus does not contain a relevant letter or sequence, the rest of the code written in questions 2 and 6 should not run. You can use if statements or, in the case of the function you wrote for question 2, you can use the python command return to exit a function.
  1. Compared to rule-based cognitive models whose capabilities are defined by explicit rules, this model is a statistical model. Speculate in 1-2 paragraphs about how rule-based and statistical models might behave differently when subjected to damage.

#!/usr/bin/env python

import re

####——FUNCTION (PROVIDED)—————

#### tidy_text(text)

#### convert to lowercase

#### remove new line and whitespace and punctuation from a string

deftidy_text(text):

text = text.lower()

text = re.sub(‘[\W_]’, ”, text)

return text

####——FUNCTION (PROVIDED)—————

#### get_corpus_from_file(myfile)

#### read in contents of myfile

#### return corpus as string

defget_corpus_from_file(myfile):

f = open(myfile)

print(“Reading corpus from file:”, myfile)

corpus = f.read()

tidied_corpus = tidy_text(corpus)

corpus = list(tidied_corpus)  ## split text into list of characters

print(“The corpus consists of a sequence of”, len(corpus), “letters”)

return corpus

f.close()

####——FUNCTION (PROVIDED)—————

#### write_corpus_to_file(mycorpus, myfile)

#### write the contents of mycorpus into myfile

defwrite_corpus_to_file(mycorpus,myfile):

f = open(myfile, ‘w’)

newcorpus = ”.join(mycorpus)

f.write(newcorpus)

f.close()

####——FUNCTION (REQUIRED) [Question 2]————

#### segment_sequence(corpus, letter1, letter2, letter3)

#### calculate transition probability letter1 to letter2:

####     p(letter2 | letter1) = count(sequence letter1 letter2)/count(letter1)

#### calculate transition probability letter2 to letter3 as:

####     p(letter3 | letter2) = count(sequence letter2 letter3)/count(letter2)

#### print a message to the user:

####     “Here is the proposed word boundary given the training corpus:”

#### print the first half of the segmentation as:

####        “Proposed end of one word:” ____

#### print the second half of the segmentation as:

####        “Proposed beginning of the new word:” ____

####——CODE REQUIRED HERE [Question 1]————-

#### store the contents of corpus.txt in a variable

#### using get_corpus_from_file(myfile)

#### (the variable will contain a python list)

#####——-MAIN LOOP (PARTIALLY PROVIDED)—

##### Prompt the user for choice, and act on it.

while True:

print(“What do you want to do?”)

print(“s (segment input to find the best word boundary)”)

print(“a (add new input to the corpus)”)

print(“d (damage the corpus)”)

print(“q (quit)”)

user_input = input(“> “)

#   ——-CODE REQUIRED HERE [Question 3]—

#   if the user input is “q”

#       save list in file using write_corpus_to_file(mycorpus,myfile)

#       say goodbye

#       exit the main while-loop

#   ——-CODE REQUIRED HERE [Question 4]

#   else if the user input is “s”

#       prompt user for sequence of 3 letters (e.g., thb)

#       split the sequence into 3 individual letters

#       segment the user’s 3 letters

#   ——-CODE REQUIRED HERE [Question 5]

#   else if the user input is “a”

#       prompt user for new input, tidy the text (see tidy_text above)

#       append the new input to the end of the corpus

#       report the size of the corpus

#   ——-CODE REQUIRED HERE [Question 6]

#   else if the user input is “d”

#       prompt the user for the sequence of letters to be removed

#       tidy if necessary

#       remove that sequence from the corpus

#       report the size of the corpus

#  Uncomment this line below when you’ve written the above code

#  for the ‘if’ statements; otherwise Python will return an error

#  for an unheralded ‘else’ statement

#    else:

#        print(“Your input was not recognized as a valid choice.”)

#   ——- PROSE ANSWER REQUIRED HERE (Question 8)

“””

Make sure to keep your answer within the triple quotes below;

otherwise what you write will be interpreted as Python code.

“”” 

Solution 

#!/usr/bin/env python

import re

####——FUNCTION (PROVIDED)—————

#### tidy_text(text)

#### convert to lowercase

#### remove new line and whitespace and punctuation from a string

deftidy_text(text):

text = text.lower()

text = re.sub(‘[\W_]’, ”, text)

return text

####——FUNCTION (PROVIDED)—————

#### get_corpus_from_file(myfile)

#### read in contents of myfile

#### return corpus as string

defget_corpus_from_file(myfile):

f = open(myfile)

print(“Reading corpus from file:”, myfile)

corpus = f.read()

tidied_corpus = tidy_text(corpus)

corpus = list(tidied_corpus)  ## split text into list of characters

print(“The corpus consists of a sequence of”, len(corpus), “letters”)

return corpus

f.close()

####——FUNCTION (PROVIDED)—————

#### write_corpus_to_file(mycorpus, myfile)

#### write the contents of mycorpus into myfile

defwrite_corpus_to_file(mycorpus,myfile):

f = open(myfile, ‘w’)

newcorpus = ”.join(mycorpus)

f.write(newcorpus)

f.close()

####——FUNCTION (REQUIRED) [Question 2]————

#### segment_sequence(corpus, letter1, letter2, letter3)

#### calculate transition probability letter1 to letter2:

####     p(letter2 | letter1) = count(sequence letter1 letter2)/count(letter1)

#### calculate transition probability letter2 to letter3 as:

####     p(letter3 | letter2) = count(sequence letter2 letter3)/count(letter2)

#### print a message to the user:

####     “Here is the proposed word boundary given the training corpus:”

#### print the first half of the segmentation as:

####        “Proposed end of one word:” ____

#### print the second half of the segmentation as:

####        “Proposed beginning of the new word:” ____

defsegment_sequence(corpus, letter1, letter2, letter3):

corpus_str = “”.join(corpus)   # list to str

ifcorpus.count(letter1) == 0:    # make sure we don’t have zero division

print(“Segmentation unavailable because corpus does not contain the following letter: ” + letter1)

elifcorpus.count(letter2) == 0:  # make sure we don’t have zero division

print(“Segmentation unavailable because corpus does not contain the following letter: ” + letter2)

else:

# transition probability letter1 to letter2

p21 = corpus_str.count(letter1 + letter2)/corpus.count(letter1)

# transition probability letter2 to letter3

p32 = corpus_str.count(letter2 + letter3)/corpus.count(letter2)

print(“Here is the proposed word boundary given the training corpus:”)

if p21 > p32:   # letter1 to letter2 is more probable that letter2 to letter3

print(“Proposed end of one word: ” + letter1 + letter2)

print(“Proposed beginning of the new word: ” + letter3)

else:

print(“Proposed end of one word: ” + letter1)

print(“Proposed beginning of the new word: ” + letter2 + letter3)

####——CODE REQUIRED HERE [Question 1]————-

#### store the contents of corpus.txt in a variable

#### using get_corpus_from_file(myfile)

#### (the variable will contain a python list)

corpus = get_corpus_from_file(‘corpus.txt’)

#####——-MAIN LOOP (PARTIALLY PROVIDED)—

##### Prompt the user for choice, and act on it.

while True:

print(“What do you want to do?”)

print(“s (segment input to find the best word boundary)”)

print(“a (add new input to the corpus)”)

print(“d (damage the corpus)”)

print(“q (quit)”)

user_input = input(“> “)

#   ——-CODE REQUIRED HERE [Question 3]—

#   if the user input is “q”

#       save list in file using write_corpus_to_file(mycorpus,myfile)

#       say goodbye

#       exit the main while-loop

ifuser_input == “q”:

print(“Writing corpus to file: corpus.txt”)

write_corpus_to_file(corpus, “corpus.txt”)

print(“Good bye”)

break

#   ——-CODE REQUIRED HERE [Question 4]

#   else if the user input is “s”

#       prompt user for sequence of 3 letters (e.g., thb)

#       split the sequence into 3 individual letters

#       segment the user’s 3 letters

elifuser_input == “s”:

text = input(“Enter the sequence of 3 letters (e.g., thb): “)

a, b, c = text  # split into 3 letters

segment_sequence(corpus, a, b, c)

#   ——-CODE REQUIRED HERE [Question 5]

#   else if the user input is “a”

#       prompt user for new input, tidy the text (see tidy_text above)

#       append the new input to the end of the corpus

#       report the size of the corpus

elifuser_input == “a”:

text = input(“Enter the sequence to add to the corpus: “)

add_text = tidy_text(text)

corpus.extend(list(add_text))   # append letters from add_text to corpus

print(“The corpus consists of a sequence of”, len(corpus), “letters”)

#   ——-CODE REQUIRED HERE [Question 6]

#   else if the user input is “d”

#       prompt the user for the sequence of letters to be removed

#       tidy if necessary

#       remove that sequence from the corpus

#       report the size of the corpus

elifuser_input == “d”:

text = input(“Enter the sequence to remove from the corpus: “)

del_text = tidy_text(text)

corpus_str = “”.join(corpus)   # list to str

ifdel_text in corpus_str:  # make sure the sequence is present in corpus

corpus = list(corpus_str.replace(del_text, “”))   # remove del_text and convert back to list

print(“The sequence ” + del_text + ” is removed from the corpus”)

print(“The corpus consists of a sequence of”, len(corpus), “letters”)

else:

print(“You can’t remove a sequence that does not exist in the corpus: ” + del_text)

#  Uncomment this line below when you’ve written the above code

#  for the ‘if’ statements; otherwise Python will return an error

#  for an unheralded ‘else’ statement

else:

print(“Your input was not recognized as a valid choice.”)

#   ——- PROSE ANSWER REQUIRED HERE (Question 8)

“””

When damaged rule-based model simply discards segmentations (rules) which areremoved from it and the damage does not affect other parts of the models.

On the other hand the damage to the statistical model also affects the

parts of the model adjascent to the damaged part. It implies that the

statistical model is more sensetive to the damage than rule-based.

“””