For this problem I would like you to use hash techniques to
For this problem, I would like you to use hash techniques to identify the most common phrases and common paragraphs between two classics by Mark Twain–“The Adventures of Tom Sawyer” and “Adventures of Huckleberry Finn.” Specifically, I would like your program to do the following:
Part 1 Finding most common phrases.For N=1, 2, 3, 4, 5, 6, 7, 8, 9, and 10, list The top 10 most common N consecutive word phrases in these two novels and the frequencies of the phrases in each novel The format for each phrase in the output shall be
phrase frequency_in_Tom_Sawyer frequency_in_Huckleberry_Finn
Part 2 Finding most similar paragraphs.Below is a paragraph from Mark Twain’s another novel:
“THE Mississippi is well worth reading about. It is not a commonplace river, but on the contrary is in all ways remarkable. considering theMissouri its main branch, it is the longest river in the world—four thousand three hundred miles. It seems safe to say that it is also the crookedest river in the world, since in one part of its journey it uses up one thousand three hundred miles to cover the same ground that the crow would fly over in six hundred and seventy-five. It discharges three times as much water as the St. Lawrence, twenty-five times as much as the Rhine, and three hundred and thirty-eight times as much as the Thames. No other river has so vast a drainage-basin: it draws its water supply from twenty-eight States and Territories; from Delaware, on theAtlantic seaboard, and from all the country between that and Idaho onthe Pacific slope-aspread of forty-five degrees of longitude. TheMississippi receives and carries to the Gulf waterfrom fifty-foursubordinate rivers that are navigable by steamboats, and from somehundredsthat are navigable by flats and keels. The area of itsdrainage-basin is as great as the combinedareas of England, Wales,Scotland, Ireland, France, Spain, Portugal, Germany,Austria, Italy,and Turkey; and almost all this wide region is fertile; the Mississippivalley, proper, isexceptionally so.”
I want you to expand your program for Part 1 to do the following:
List top 10 paragraphs in each of the two novels,
“The Adventures of Tom Sawyer” and “Adventures of Huckleberry Finn,”that are mostly similar to the above given paragraph.
One critical part is how to define similarity between two paragraphs. I would like you to use the
simplest method: count the number of words appearing in two given paragraphs and use that
number as the similarity measure.
Note for both Part 1 and Part 2:When you solve Part 1 and Part 2, you shall convert any phrases or words into lower cases for comparison.You shall also use only one blank space symbol to separate any two words. For example,“cat dog” can be preprocessed into “cat dog”.
Solution
# Initialize hash with two key-value pairs. my %animals = (cat => \"meow\", dog => \"bark\"); say %animals; # Add another key-value pair. %animals.push: (bird => \"chirp\"); say %animals; # This returns the cat value. my $cat = %animals