Hello I need help with the following assignment This assignm
Hello, I need help with the following assignment:
This assignment will give you practice in manipulating lists of data stored in arrays.
1 General Instructions
Read the general instructions on preparing and submitting assignments.
Read the Counting Words problem description below. Implement the histogram function to complete the desired program.
You must use dynamically allocated arrays for this purpose.
For your initial implementation, use ordered insertion to keep the words in order and ordered sequential search when looking for words. Note that the array utility functions from the lecture notes are available to you as art of the provided code.
Although we are counting words in this program, the general pattern of counting occurrences of things is a common analysis step in laboratory work, statistical studies, and business tasks. The results of such a program are often fed into other programs for further processing and/or display. Such results are often displayed as histograms. The CSV output format is a common “data exchange” format recognized by many programs. Almost all spreadsheets, for example, will read CSV files.
When you have the program running, execute it using a short paragraph of text as an input, saving the output in a file ending with a “.csv” extension. Run a spreadsheet program (e.g., Microsoft Excel). You should be able to load your .csv file directly into the spreadsheet.
Try displaying your results as a histogram. In Excel (2007), for example, select the two columns of data, choose “Sort” from the Data tab and sort your data on the numeric column. Then, with the two sorted columns of data still selected, go to the Insert tab and select a 2D Bar chart. Save your spreadsheet in Excel (.xsl) format. You will turn this in later.
As documents get larger, the total number of words increases far, far faster than does the number of distinct words. Our personal vocabularies are only so large, after all. In fact, most writers unconsciously limit themselves to writing with a small fraction of their personal “reading” vocabularies. So most words in a large document are bound to be repeats. That means that, for this application, the speed of the functions for searching for words is probably more important than the speed of the functions for inserting new words into the array. We do many more searches than insertions.
Try running your program on one of the large text files provided in the assignment directory. Time it to see how long it takes. Now replace all uses of ordered sequential search by calls to the binary search function. Run it on the same output, timing it again. You should see a substantial improvement.
Use the button below to submit your completed program and your saved spreadsheet.
2 Problem Description
2.1 Counting Words
Develop a program to prepare a list of all words occurring in a plain-text document, counting how many times each word occurs. In determining whether two words are different, punctuation (non-alphabetic characters) other than hyphens and differences in upper/lower case should be ignored. Portions of text that cannot be interpreted as words (e.g.: “ !@#%* ”) should be ignored.
The program will be launched with either one or three parameters on the command line. The first parameter is always the maximum number of distinct words (i.e., discounting repetitions of the same word) expected in the input.
If that is the only parameter supplied, e.g.,
then the program reads from the standard input (cin) and writes its output to the standard output (cout).
Optionally, the program can be run with two additional parameters, the file names to be used for input and output respectively, e.g.,
The function you will write, which does the actual input, counting, and output, should read a word at a time (the usual >> stringoperator will do just fine). Each word should be processed to remove punctuation and replace upper-case characters, then checked against an array of already encountered words. If it is in there, increment an associated counter. If not, it must be added.
2.2 Input
Input is plain text and continues to the end of file.
Example:
2.3 Output
If you are able to process the entire document using no more distinct words than indicated in the first command line parameter then your output should list all the distinct words encountered (in lower-case) together with the count of the number of occurrences of each such word.
The actual output format will be Comma-Separated Values (CSV) format. In this format, all non-numeric outputs are enclosed in double quotes (“…”) and successive values are separated by a comma or a line break. You should produce one line for each distinct word. Each line will contain the word and its occurrence count.
In the event that you find that the input contains more than the predicted maximum number of distinct words, no output is written to the indicated output stream. Instead, a message should be written to the standard error stream (cerr):
replacing ## by the predicted maximum number.
Example: For the sample input given earlier, if the program is invoked with a sufficiently large MaxCount parameter, the output would be
2.4 Notes
Your program should reside in the single file wordcount.cpp. You will be provided with the following files:
wordcount.cpp
A starter version of the program, including the main function. Pay special attention to the //* comments as these will guide the code you need to develop. arrayUtils.h
Various array manipulation functions we have developed in class and in the text. test0.in
The sample input from this writeup *.txt
Some larger inputs (entire books), courtesy of the Gutenberg Project bin
The bin directory contains Windows and Unix versions of the compiled solution. Use this to check your output by comparing your program and the solution on the same inputs. makefile
Project management file for people working in Unix/Linux/CygWin
All files provided for this assignment can be found in this directory, or, if you are logged in to a CS Dept machine, in /home/cs333/Assignments/array_wc.
#include <cstdlib>
#include <iostream>
#include <string>
#include <fstream>
#include \"arrayUtils.h\"
using namespace std;
// Strips all punctuation (except hyphens) from a word and
// changes any upper-case characters to lower-case. Note that
// if a word has nothing except punctuation, this may reduce
// it to the empty string.
string reduceWords (string word)
{
string result;
for (int i = 0; i < word.size(); ++i)
{
char c = word[i];
if (c == \'-\')
result += c;
else if (c >= \'a\' && c <= \'z\')
result += c;
else if (c >= \'A\' && c <= \'Z\')
result += (c - \'A\' + \'a\'); // converts to lower-case
}
return result;
}
// Read words from the provided input stream, reducing each word and counting
// how many times each word appears. Write the resulting words and counts (in
// alphabetic order by word) in CSV format to the output stream.
// - Assume that the input contains a maximum of MaxWords distinct words
// (after reduction). If more distinct words than this are actually
// encountered, write nothing to the output stream but print an error
// message on the standard error stream.
void histogram(const int MaxWords, istream& input, ostream& output)
{
//* Declare an array of strings to hold all words
//* and an array of int to hold the number of times each word
//* encountered.
// Read the words from the input and count them
string word;
while (input >> word)
{
//* Reduce the word and, if any characters are left
//* check to see if the word in already in our array
//* If so, add one to that word\'s counter
//* If not, is there room to add it to the array?
//** If so, add the word and a counter to the arrays, setting the
//** counter to 1.
//** If not, print the error message and abort the program [exit(1);]
}
//* Print all the words found, with their counts, in .csv format.
//* Clean up the arrays we created
}
int main (int argc, char** argv)
{
if (argc != 2 && argc != 4)
{
cerr << \"Usage: \" << argv[0] << \" MaxWords [inFileName outFileName]\" << endl;
return -1;
}
int MaxWords = atoi(argv[1]);
if (argc == 2)
{
// No file names in command line - use standard in and standard out
histogram (MaxWords, cin, cout);
}
else
{
// Take input and output file names from the command line
ifstream in (argv[2]);
ofstream out (argv[3]);
histogram (MaxWords, in, out);
}
return 0;
}
#include <cstdlib>
#include <iostream>
#include <string>
#include <fstream>
#include \"arrayUtils.h\"
using namespace std;
// Strips all punctuation (except hyphens) from a word and
// changes any upper-case characters to lower-case. Note that
// if a word has nothing except punctuation, this may reduce
// it to the empty string.
string reduceWords (string word)
{
string result;
for (int i = 0; i < word.size(); ++i)
{
char c = word[i];
if (c == \'-\')
result += c;
else if (c >= \'a\' && c <= \'z\')
result += c;
else if (c >= \'A\' && c <= \'Z\')
result += (c - \'A\' + \'a\'); // converts to lower-case
}
return result;
}
// Read words from the provided input stream, reducing each word and counting
// how many times each word appears. Write the resulting words and counts (in
// alphabetic order by word) in CSV format to the output stream.
// - Assume that the input contains a maximum of MaxWords distinct words
// (after reduction). If more distinct words than this are actually
// encountered, write nothing to the output stream but print an error
// message on the standard error stream.
void histogram(const int MaxWords, istream& input, ostream& output)
{
//* Declare an array of strings to hold all words
//* and an array of int to hold the number of times each word
//* encountered.
// Read the words from the input and count them
string word;
while (input >> word)
{
//* Reduce the word and, if any characters are left
//* check to see if the word in already in our array
//* If so, add one to that word\'s counter
//* If not, is there room to add it to the array?
//** If so, add the word and a counter to the arrays, setting the
//** counter to 1.
//** If not, print the error message and abort the program [exit(1);]
}
//* Print all the words found, with their counts, in .csv format.
//* Clean up the arrays we created
}
int main (int argc, char** argv)
{
if (argc != 2 && argc != 4)
{
cerr << \"Usage: \" << argv[0] << \" MaxWords [inFileName outFileName]\" << endl;
return -1;
}
int MaxWords = atoi(argv[1]);
if (argc == 2)
{
// No file names in command line - use standard in and standard out
histogram (MaxWords, cin, cout);
}
else
{
// Take input and output file names from the command line
ifstream in (argv[2]);
ofstream out (argv[3]);
histogram (MaxWords, in, out);
}
return 0;
}
#ifndef ARRAYUTILS_H
#define ARRAYUTILS_H
// Add to the end
// - Assumes that we have a separate integer (size) indicating how
// many elements are in the array
// - and that the \"true\" size of the array is at least one larger
// than the current value of that counter
template <typename T>
void addToEnd (T* array, int& size, T value)
{
array[size] = value;
++size;
}
// Add value into array[index], shifting all elements already in positions
// index..size-1 up one, to make room.
// - Assumes that we have a separate integer (size) indicating how
// many elements are in the array
// - and that the \"true\" size of the array is at least one larger
// than the current value of that counter
template <typename T>
void addElement (T* array, int& size, int index, T value)
{
// Make room for the insertion
int toBeMoved = size - 1;
while (toBeMoved >= index) {
array[toBeMoved+1] = array[toBeMoved];
--toBeMoved;
}
// Insert the new value
array[index] = value;
++size;
}
// Assume the elements of the array are already in order
// Find the position where value could be added to keep
// everything in order, and insert it there.
// Return the position where it was inserted
// - Assumes that we have a separate integer (size) indicating how
// many elements are in the array
// - and that the \"true\" size of the array is at least one larger
// than the current value of that counter
template <typename T>
int addInOrder (T* array, int& size, T value)
{
// Make room for the insertion
int toBeMoved = size - 1;
while (toBeMoved >= 0 && value < array[toBeMoved]) {
array[toBeMoved+1] = array[toBeMoved];
--toBeMoved;
}
// Insert the new value
array[toBeMoved+1] = value;
++size;
return toBeMoved+1;
}
// Search an array for a given value, returning the index where
// found or -1 if not found.
template <typename T>
int seqSearch(const T list[], int listLength, T searchItem)
{
int loc;
for (loc = 0; loc < listLength; loc++)
if (list[loc] == searchItem)
return loc;
return -1;
}
// Search an ordered array for a given value, returning the index where
// found or -1 if not found.
template <typename T>
int seqOrderedSearch(const T list[], int listLength, T searchItem)
{
int loc = 0;
while (loc < listLength && list[loc] < searchItem)
{
++loc;
}
if (loc < listLength && list[loc] == searchItem)
return loc;
else
return -1;
}
// Removes an element from the indicated position in the array, moving
// all elements in higher positions down one to fill in the gap.
template <typename T>
void removeElement (T* array, int& size, int index)
{
int toBeMoved = index + 1;
while (toBeMoved < size) {
array[toBeMoved] = array[toBeMoved+1];
++toBeMoved;
}
--size;
}
// Search an ordered array for a given value, returning the index where
// found or -1 if not found.
template <typename T>
int binarySearch(const T list[], int listLength, T searchItem)
{
int first = 0;
int last = listLength - 1;
int mid;
bool found = false;
while (first <= last && !found)
{
mid = (first + last) / 2;
if (list[mid] == searchItem)
found = true;
else
if (searchItem < list[mid])
last = mid - 1;
else
first = mid + 1;
}
if (found)
return mid;
else
return -1;
}
#endifThis is a file with not very many words, and very
few words are repeated words.
Solution
Here is the completed code according to instructions above. I have tried to run the program with and without the input file. The outputs are shown below. For the .csv output file, I created a google spreadsheet to draw a histogram and this is the link to that spreadsheet
https://docs.google.com/spreadsheets/d/1DLbzKLD8HhqIDeaSyzQk2-oSAVwIBLY7mTphBF4r7b8/edit?usp=sharing.
Please dont forget to rate the answer if it helped. Thank you very much.
wordcount.cpp
#include <cstdlib>
#include <iostream>
#include <string>
#include <fstream>
#include \"arrayUtils.h\"
using namespace std;
// Strips all punctuation (except hyphens) from a word and
// changes any upper-case characters to lower-case. Note that
// if a word has nothing except punctuation, this may reduce
// it to the empty string.
string reduceWords (string word)
{
string result;
for (int i = 0; i < word.size(); ++i)
{
char c = word[i];
if (c == \'-\')
result += c;
else if (c >= \'a\' && c <= \'z\')
result += c;
else if (c >= \'A\' && c <= \'Z\')
result += (c - \'A\' + \'a\'); // converts to lower-case
}
return result;
}
// Read words from the provided input stream, reducing each word and counting
// how many times each word appears. Write the resulting words and counts (in
// alphabetic order by word) in CSV format to the output stream.
// - Assume that the input contains a maximum of MaxWords distinct words
// (after reduction). If more distinct words than this are actually
// encountered, write nothing to the output stream but print an error
// message on the standard error stream.
void histogram(const int MaxWords, istream& input, ostream& output)
{
//* Declare an array of strings to hold all words
//* and an array of int to hold the number of times each word
//* encountered.
string *words=new string[MaxWords];
int *count=new int[MaxWords];
int index,sizew=0,sizec=0;
// Read the words from the input and count them
string word;
while (input >> word)
{
//* Reduce the word and, if any characters are left
//* check to see if the word in already in our array
//* If so, add one to that word\'s counter
//* If not, is there room to add it to the array?
if(sizew<MaxWords)
{
word=reduceWords(word);
if(word.empty())
continue;
index=seqOrderedSearch(words,sizew,word);
if(index!=-1)
{
count[index]++;
}
else
{
//** If so, add the word and a counter to the arrays, setting the
//** counter to 1
index=addInOrder(words,sizew,word);
//cout<<\"added word \"<<word<<\" at index \"<<index<<endl;
addElement(count,sizec,index,1);
}
}
else
{
//** If not, print the error message and abort the program [exit(1);]
cerr<<\"Input file contains more than \"<<MaxWords<<\" words \"<<endl;
exit(1);
}
}
//* Print all the words found, with their counts, in .csv format.
//* Clean up the arrays we created
for(int i=0;i<sizew;i++)
{
output<<\"\\\"\"<<words[i]<<\"\\\",\"<<count[i]<<endl;
}
delete []words;
delete []count;
}
int main (int argc, char** argv)
{
if (argc != 2 && argc != 4)
{
cerr << \"Usage: \" << argv[0] << \" MaxWords [inFileName outFileName]\" << endl;
return -1;
}
int MaxWords = atoi(argv[1]);
if (argc == 2)
{
// No file names in command line - use standard in and standard out
histogram (MaxWords, cin, cout);
}
else
{
// Take input and output file names from the command line
ifstream in (argv[2]);
ofstream out (argv[3]);
histogram (MaxWords, in, out);
in.close();
out.close();
}
return 0;
}
sample output 1
./a.out 100
This is a file with not very many words, and very
few words are repeated words.\"a\",1
\"and\",1
\"are\",1
\"few\",1
\"file\",1
\"is\",1
\"many\",1
\"not\",1
\"repeated\",1
\"this\",1
\"very\",2
\"with\",1
\"words\",3
sample output 2 for ./a.out 1000 words.txt wordscount.csv
words.txt
You may provide arbitrary sections, in the same format as the ones above. This may be of use for extremely complicated
plugins where more information needs to be conveyed that doesn\'t fit into the categories of \"description\" or
\"installation.\" Arbitrary sections will be shown below the built-in sections outlined above.
wordscount.csv
\"above\",2
\"arbitrary\",2
\"as\",1
\"be\",3
\"below\",1
\"built-in\",1
\"categories\",1
\"complicated\",1
\"conveyed\",1
\"description\",1
\"doesnt\",1
\"extremely\",1
\"fit\",1
\"for\",1
\"format\",1
\"in\",1
\"information\",1
\"installation\",1
\"into\",1
\"may\",2
\"more\",1
\"needs\",1
\"of\",2
\"ones\",1
\"or\",1
\"outlined\",1
\"plugins\",1
\"provide\",1
\"same\",1
\"sections\",3
\"shown\",1
\"that\",1
\"the\",4
\"this\",1
\"to\",1
\"use\",1
\"where\",1
\"will\",1
\"you\",1
Link to histogram https://docs.google.com/spreadsheets/d/1DLbzKLD8HhqIDeaSyzQk2-oSAVwIBLY7mTphBF4r7b8/edit?usp=sharing













