Summary You will implement a little search engine to do two
Summary
You will implement a little search engine to do two things: (a) gather and index keywords that appear in a set of plain text documents, and (b) search for user-input keywords against the index and return a list of matching documents in which these keywords occur.
https://www.cs.rutgers.edu/courses/112/classes/fall_2016_venugopal/progs/prog4/prog4.html
Implementation
Download the attached lse_project.zip file (https://drive.google.com/open?id=0B7pAsB1LuvrWRWtJajNURHo2Y0k) to your computer. DO NOT unzip it. Instead, follow the instructions on the Eclipse page under the section \"Importing a Zipped Project into Eclipse\" to get the entire project into your Eclipse workspace.
You will see a project called Little Search Engine with a single class, search.LittleSearchEngine. This is where you will fill in your code, see details that follow.
The project also contains two sample text documents, AliceCh1.txt, and WowCh1.txt, directly under the project folder. You may use these samples to test your program. (Be sure to get other online text documents--or make your own--for more rigorous testing.) The names of these files are in a file called docs.txt that may be given as input to a driver that runs the search engine, which can in turn send this file name as the first argument to the makeIndex method in LittleSearchEngine, to match with the docsFile parameter.
The project also has a noisewords.txt file that contains a list of \"noise\" words, one per line. Noise words are commonplace words (such as \"the\") that must be ignored by the search engine. You will use this file (and this file ONLY) to filter out noise words from the documents you read, when gathering keywords.
NOTE: You will need to write your own driver to test your implementation. This driver can be given a file that contains the names of all the documents, as well as the noise words file, as input. It can then set up a LittleSearchEngine object and call its methods as needed to test the implementation.
Following is the sequence of method calls that will be performed on a LittleSearchEngine object, to index and search keywords.
LittleSearchEngine() - Already implemented.
The constructor creates new (empty) keywordsIndex and noiseWords hash tables. The keywordsIndex hash table is the MASTER hash table, which indexes all keywords from all input documents. The noiseWords hash table stores all the noise words. Both of these are fields in the LittleSearchEngine class.
Every key in the keywordsIndex hash table is a keyword. The associated value for a keyword is an array list of (document,frequency) pairs for the documents in which the keyword occurs, arranged in descending order of frequencies. A (document,frequency) pair is held in an Occurrenceobject. The Occurrence class is defined in the LittleSearchEngine.java file, at the top. In an Occurrence object, the document field is the name of the document, which is basically the file name, e.g. AliceCh1.txt.
void makeIndex(String docsFile, String noiseWordsFile) - Already implemented.
Indexes all the keywords in all the input documents. See the method documentation and body in the LittleSearchEngine.java file for details.
If you want to index the given sample documents, the first parameter would be the file docs.txt and the second parameter would be the noise words file, noisewords.txt
After this method finishes executing, the full index of all keywords found in all input documents will be in the keywordsIndex hash table.
The makeIndex methods calls methods loadKeyWords and mergeKeyWords, both of which you need to implement.
HashMap loadKeyWords(String docFile) - You implement.
This method creates a hash table for all keywords in a single given document. See the method documentation for details.
This method MUST call the getKeyWord method, which you need to implement.
String getKeyWord(String word) - You implement.
Given an input word read from a document, it checks if the word is a keyword, and returns the keyword equivalent if it is.
See the method documentation for details. Here are some examples of input parameter word, and returned value.
Note that if there is more than one trailing punctuation (as in the \"World...\" and \"World?!\" examples above), the method strips all of them. Also, the last example makes it clear that punctuation appearing anywhere but at the end is not stripped, and the word is rejected.
void mergeKeyWords - You implement.
Merges the keywords loaded from a single document (in method loadKeyWords) into the global keywordsIndex hash table.
See the method documentation for details. This method MUST call the insertLastOccurence method, which you need to implement.
ArrayList insertLastOccurrence(ArrayList occs) - You implement.
See the method documentation for details. Note that this method uses binary search on frequency values to do the insertion. The return value is the sequence of mid points encountered during the search, using the regular (not lazy) binary search we covered in class. This return value is not used by the calling method-it is only going to be used for grading this method.
For example, suppose the list had the following frequency values (including the last one, which is to be inserted):
Then, the binary search (on the list excluding the last item) would encounter the following sequence of midpoint indexes:
Note that if a subarray has an even number of items, then the midpoint is the last item in the first half.
After inserting 6, the input list would be updated to this:
and the sequence 2 4 3 would be returned.
If the new item is a duplicate of something that already exists, it doesn\'t matter if the new item is placed before or after the existing item.
Note that the items are in DESCENDING order, so the binary search would have to be done accordingly.
ArrayList top5search(String kw1, String kw2) - You implement.
This method computes the search result for the input \"kw1 OR kw2\", using the keywordsIndex hash table. The result is a list of NAMES of documents (limited to the top 5) in which either of the words \"kw1\" or \"kw2\" occurs, arranged in descending order of frequencies. See the method documentation in the code for additional details.
As an example, suppose the search is for \"deep or world\", in the given test documents, AliceCh1.txt (call it A) and WowCh1.txt (call it W). The word \"deep\" occurs twice in A and once in W, and the word \"world\" occurs once in A and 7 times in W:
The result of the search is:
in that order. (Recall that the name of a document is the same as the name of the document file.)
NOTE:
If a document occurs in both keywords\' match list, consider the one with the higher frequency - do NOT add frequencies.
Return AT MOST 5 non-duplicate entries. This means if there are more than 5 non-duplicate entries, then return the five top frequency entries, but if there are fewer than 5 non-duplicate entries, then return all of them.
If a document in the first match list (for the first keyword) has the same frequency as a document in the second match list (for the second keyword), and both are candidates for inclusion in the output (they are not the same document), then pick the document in the first list before the document in the second list.
Implementation Rules
Do NOT change the package name on the first line.
Do NOT add any import statements. With the existing imports, you may use any of the classes in java.lang, java.io and java.util.
Do NOT change the class Occurrence in ANY way.
Do NOT change the headers of any of the existing methods in LittleSearchEngine in ANY way.
Do NOT change the code in any of the implemented methods in ANY way.
Do NOT add any class fields in LittleSearchEngine.
You MAY add helper methods to LittleSearchEngine, but you must make them private.
Grading
When a method is graded, the correct versions of other methods will be used. Also, all data structures will be set to their correct (expected) states before a method is called.
Submission
Submit your LittleSearchEngine.java file.
| Input Parameter | Returned value |
|---|---|
| distance. | distance (strip off period) |
| equi-distant | null (not all alphabetic characters) |
| Rabbit | rabbit (convert to lowercase) |
| Between | null (noise word) |
| we\'re | null (not all alphabetic characters) |
| World... | world (strip trailing periods) |
| World?! | world (strip trailing ? and !) |
| What,ever | null (not all alphabetic characters) |
Solution
package search;
import java.io.*;
import java.util.*;
/**
* This class encapsulates an occurrence of a keyword in a document. It stores
* the document name, and the frequency of occurrence in that document.
* Occurrences are associated with keywords in an index hash table.
*
*/
class Occurrence {
/**
* Document in which a keyword occurs.
*/
String document;
/**
* The frequency (number of times) the keyword occurs in the above document.
*/
int frequency;
/**
* Initializes this occurrence with the given document,frequency pair.
*
* @param doc
* Document name
* @param freq
* Frequency
*/
public Occurrence(String doc, int freq) {
document = doc;
frequency = freq;
}
/*
* (non-Javadoc)
*
* @see java.lang.Object#toString()
*/
public String toString() {
return \"(\" + document + \",\" + frequency + \")\";
}
}
/**
* This class builds an index of keywords. Each keyword maps to a set of
* documents in which it occurs, with frequency of occurrence in each document.
* Once the index is built, the documents can searched on for keywords.
*
*/
public class LittleSearchEngine {
/**
* This is a hash table of all keywords. The key is the actual keyword, and
* the associated value is an array list of all occurrences of the keyword
* in documents. The array list is maintained in descending order of
* occurrence frequencies.
*/
HashMap<String, ArrayList<Occurrence>> keywordsIndex;
/**
* The hash table of all noise words - mapping is from word to itself.
*/
HashMap<String, String> noiseWords;
/**
* Creates the keyWordsIndex and noiseWords hash tables.
*/
public LittleSearchEngine() {
keywordsIndex = new HashMap<String, ArrayList<Occurrence>>(1000, 2.0f);
noiseWords = new HashMap<String, String>(100, 2.0f);
}
/**
* This method indexes all keywords found in all the input documents. When
* this method is done, the keywordsIndex hash table will be filled with all
* keywords, each of which is associated with an array list of Occurrence
* objects, arranged in decreasing frequencies of occurrence.
*
* @param docsFile
* Name of file that has a list of all the document file names,
* one name per line
* @param noiseWordsFile
* Name of file that has a list of noise words, one noise word
* per line
* @throws FileNotFoundException
* If there is a problem locating any of the input files on disk
*/
public void makeIndex(String docsFile, String noiseWordsFile)
throws FileNotFoundException {
boolean valid = false;
while (!valid) {
try {
Scanner sc = new Scanner(new File(noiseWordsFile));
valid = true;
} catch (FileNotFoundException e) {
return;
}
}
Scanner sc = new Scanner(new File(noiseWordsFile));
while (sc.hasNext()) {
String word = sc.next();
noiseWords.put(word, word);
}
valid = false;
while (!valid) {
try {
sc = new Scanner(new File(docsFile));
valid = true;
} catch (FileNotFoundException e) {
return;
}
}
sc = new Scanner(new File(docsFile));
while (sc.hasNext()) {
String docFile = sc.next();
HashMap<String, Occurrence> kws = loadKeyWords(docFile);
mergeKeyWords(kws);
}
}
/**
* Scans a document, and loads all keywords found into a hash table of
* keyword occurrences in the document. Uses the getKeyWord method to
* separate keywords from other words.
*
* @param docFile
* Name of the document file to be scanned and loaded
* @return Hash table of keywords in the given document, each associated
* with an Occurrence object
* @throws FileNotFoundException
* If the document file is not found on disk
*/
public HashMap<String, Occurrence> loadKeyWords(String docFile)
throws FileNotFoundException {
HashMap<String, Occurrence> keywords = new HashMap<String, Occurrence>();
boolean valid = false;
while (!valid) {
try {
Scanner words = new Scanner(new File(docFile));
valid = true;
} catch (FileNotFoundException e) {
return keywords;
}
}
Scanner words = new Scanner(new File(docFile));
int freq = 1;
while (words.hasNext()) {
String word = words.next();
if (getKeyWord(word) != null) {
word = getKeyWord(word);
if (!keywords.containsKey(word)) {
Occurrence occurs = new Occurrence(docFile, freq);
keywords.put(word, occurs);
} else {
keywords.get(word).frequency++;
}
}
}
return keywords;
}
/**
* Merges the keywords for a single document into the master keywordsIndex
* hash table. For each keyword, its Occurrence in the current document must
* be inserted in the correct place (according to descending order of
* frequency) in the same keyword\'s Occurrence list in the master hash
* table. This is done by calling the insertLastOccurrence method.
*
* @param kws
* Keywords hash table for a document
*/
public void mergeKeyWords(HashMap<String, Occurrence> kws) {
ArrayList<Occurrence> list = new ArrayList<Occurrence>();
for (String key : kws.keySet()) {
Occurrence occ = kws.get(key);
if (!keywordsIndex.containsKey(key)) {
ArrayList<Occurrence> occurList = new ArrayList<Occurrence>();
occurList.add(occ);
keywordsIndex.put(key, occurList);
} else {
list = keywordsIndex.get(key);
list.add(occ);
insertLastOccurrence(list);
keywordsIndex.put(key, list);
}
}
}
/**
* Given a word, returns it as a keyword if it passes the keyword test,
* otherwise returns null. A keyword is any word that, after being stripped
* of any trailing punctuation, consists only of alphabetic letters, and is
* not a noise word. All words are treated in a case-INsensitive manner.
*
* Punctuation characters are the following: \'.\', \',\', \'?\', \':\', \';\' and \'!\'
*
* @param word
* Candidate word
* @return Keyword (word without trailing punctuation, LOWER CASE)
*/
public String getKeyWord(String word) {
word = word.trim();
char end = word.charAt(word.length() - 1);
while (end == \'.\' || end == \',\' || end == \'?\' || end == \':\'
|| end == \';\' || end == \'!\') {
word = word.substring(0, word.length() - 1);
if (word.length() > 1) {
end = word.charAt(word.length() - 1);
} else {
break;
}
}
word = word.toLowerCase();
for (String noiseWord : noiseWords.keySet()) {
if (word.equalsIgnoreCase(noiseWord)) {
return null;
}
}
for (int j = 0; j < word.length(); j++) {
if (!Character.isLetter(word.charAt(j))) {
return null;
}
}
return word;
}
/**
* Inserts the last occurrence in the parameter list in the correct position
* in the same list, based on ordering occurrences on descending
* frequencies. The elements 0..n-2 in the list are already in the correct
* order. Insertion is done by first finding the correct spot using binary
* search, then inserting at that spot.
*
* @param occs
* List of Occurrences
* @return Sequence of mid point indexes in the input list checked by the
* binary search process, null if the size of the input list is 1.
* This returned array list is only used to test your code - it is
* not used elsewhere in the program.
*/
public ArrayList<Integer> insertLastOccurrence(ArrayList<Occurrence> occs) {
if (occs.size() == 1) {
return null;
}
int lastFreq = occs.get(occs.size() - 1).frequency;
Occurrence temp = occs.get(occs.size() - 1);
int lo = 0;
int hi = occs.size() - 1;
int mid;
ArrayList<Integer> midIndex = new ArrayList<Integer>();
while (lo <= hi) {
mid = (lo + hi) / 2;
midIndex.add(mid);
if (lastFreq > occs.get(mid).frequency) {
hi = mid - 1;
} else if (lastFreq < occs.get(mid).frequency) {
lo = mid + 1;
} else {
break;
}
}
if (midIndex.get(midIndex.size() - 1) == 0) {
if (temp.frequency < occs.get(0).frequency) {
occs.add(1, temp);
occs.remove(occs.size() - 1);
return midIndex;
}
}
occs.add(midIndex.get(midIndex.size() - 1), temp);
occs.remove(occs.size() - 1);
/*
* for(int i = 0; i < midIndex.size(); i++) {
* System.out.print(midIndex.get(i) + \" \"); }
*
* System.out.println();
*/
return midIndex;
}
/**
* Search result for \"kw1 or kw2\". A document is in the result set if kw1 or
* kw2 occurs in that document. Result set is arranged in descending order
* of occurrence frequencies. (Note that a matching document will only
* appear once in the result.) Ties in frequency values are broken in favor
* of the first keyword. (That is, if kw1 is in doc1 with frequency f1, and
* kw2 is in doc2 also with the same frequency f1, then doc1 will appear
* before doc2 in the result. The result set is limited to 5 entries. If
* there are no matching documents, the result is null.
*
* @param kw1
* First keyword
* @param kw1
* Second keyword
* @return List of NAMES of documents in which either kw1 or kw2 occurs,
* arranged in descending order of frequencies. The result size is
* limited to 5 documents. If there are no matching documents, the
* result is null.
*/
public ArrayList<String> top5search(String kw1, String kw2) {
ArrayList<String> resultList = new ArrayList<String>();
ArrayList<Occurrence> listOne = new ArrayList<Occurrence>();
kw1 = kw1.toLowerCase();
if (keywordsIndex.get(kw1) != null) {
listOne = keywordsIndex.get(kw1);
}
ArrayList<Occurrence> listTwo = new ArrayList<Occurrence>();
kw2 = kw2.toLowerCase();
if (keywordsIndex.get(kw2) != null) {
listTwo = keywordsIndex.get(kw2);
}
/*
* System.out.println(); System.out.print(kw1 + \": \"); for(int i = 0; i
* < listOne.size(); i++) { //print out the first list if(i+1 ==
* listOne.size()) { System.out.print(listOne.get(i)); } else {
* System.out.print(listOne.get(i) + \", \"); } }
*
* System.out.println(); System.out.print(kw2 + \": \"); for(int i = 0; i
* < listTwo.size(); i++) { //print out the second list if(i+1 ==
* listTwo.size()) { System.out.print(listTwo.get(i)); } else {
* System.out.print(listTwo.get(i) + \", \"); } }
*/
for (int p = 0; p < listOne.size(); p++) {
if (resultList.size() <= 4) {
int L1 = listOne.get(p).frequency;
String docNameOne = listOne.get(p).document;
for (int w = 0; w < listTwo.size(); w++) {
String docNameTwo = listTwo.get(w).document;
int L2 = listTwo.get(w).frequency;
if (L2 <= L1) {
if (!resultList.contains(docNameOne)
&& resultList.size() <= 4) {
resultList.add(docNameOne);
}
} else if (L2 > L1) {
if (!resultList.contains(docNameTwo)
&& resultList.size() <= 4) {
resultList.add(docNameTwo);
}
}
}
}
}
for (int k = 0; k < resultList.size(); k++) {
if (k + 1 == resultList.size()) {
System.out.print(resultList.get(k));
} else {
System.out.print(resultList.get(k) + \", \");
}
}
if (resultList.size() == 0) {
return null;
}
return resultList;
}
}
TestDriver.java
package apps;
import search.*;
import java.io.*;
import java.util.*;
class Occurrence {
String document;
int frequency;
public Occurrence(String doc, int freq) {
document = doc;
frequency = freq;
}
/* (non-Javadoc)
* @see java.lang.Object#toString()
*/
public String toString() {
return \"(\" + document + \",\" + frequency + \")\";
}
}
public class TestDriver
{
static BufferedReader keyboard = new BufferedReader(new InputStreamReader(System.in));
public static void main(String args[]) throws IOException
{
String docsFile = \"docs.txt\";
String noiseWords = \"noisewords.txt\";
LittleSearchEngine google = new LittleSearchEngine();
google.makeIndex(docsFile, noiseWords);
String kw1 = \"Between\";
String kw2 = \"distance\";
google.top5search(kw1, kw2);
System.out.println(google.getKeyWord(\"World?!\"));
}}









