C PROGRAMMING Computational Stylistics big data 1 Write a pr

C PROGRAMMING: Computational Stylistics (big data)

1. Write a program that reads in a text file character-by-character and computes the relative frequency of each letter A-Z. Discard all non-alphabetic characters and change each lowercase letter to upper case. The program determines the total number of alphabetic letters and the total number of each letter A-Z. After the file has been read in, compute the relative frequency of each letter by dividing the count for each letter by the total number of alphabetic letters. Output 26 lines that look like this:

A 0.08167

B 0.01492

C 0.02782

D 0.04253

E 0.12702

The program uses command line arguments for the file name of the input file and output file:

C:\\>relative sampleOne.txt frequencyOne.txt

Of course, the files will be different lengths. Plan on files the size of books.

2. Write a program that computes the Root Mean Square (RMS) difference between two sets of relative frequencies. The input will be two files of relative frequencies created by the above program.

C:\\>RMSdistance frequencyOne.txt frequencyTwo.txt

Read the two sets of relative frequencies into two arrays of 26 cells each. The RMS difference between two sets of frequencies is the square root of the average square of the difference between corresponding frequencies.

Say that the relative frequency of \'A\' in sample one is freqOne[0] and the relative frequency of \'A\' in sample two is freqTwo[0]. The square of the difference is:

(freqOne[0] - freqTwo[0])^2

A similar square for B will be:

(freqOne[1] - freqTwo[1])^2

Compute twenty-six such squares and compute their average. Then take the square root of that average. Output that single value. This is the root mean square difference. Of course, if the two files are the same, the RMS difference will be zero. If the two files are radically different, the RMS difference will be large.

So the distance between two text samples has been boiled down to a single number. Presumably the closer two files are to each other the smaller this number will be. Notice that the number does not depend on the size of the file. Ideally, an author will habitually use the same words and style and the difference between two samples from the same author will be smaller than the difference between two samples of two different authors.

Write the RMS difference to the monitor.

3. Download the files unknown.txt, austin.txt, dickens.txt, and twain.txt from this site. Compute the relative frequencies for each. The file unknown.txt is from an unknown author. The other samples of text are from known authors. Now you can compare the distance between the writing styles of two authors. With your data you can tell which two authors are most similar. The known author most similar to the Unknown sample will (presumably) be the one who wrote it.

A way to do this is to create a confusion matrix:

Each cell of the matrix contains the RMS difference between the two samples named in the row and column. So the RMS difference between the sample for Austin and Dickens will be in the cell marked X in the above. Since the RMS difference is symmetric, only the cells below the main diagonal need to be filled in.

Write (and submit) a brief report of your conclusions. Your report should include a confusion matrix and an explanation (in complete, grammatical sentences) of your conclusion. Make this a paragraph of about ten sentences. Yes, computer science majors are actually expected to know how to write.

Submit both programs and your report for part three.

	Austin	Dickens	Twain	Unknown
Austin	0.0
Dickens	0.004	0.0
Twain	- X -	- - -	0.0
Unknown	- - -	- - -	- - -	0.0

Answer of (1) :

#include <stdio.h>
#define BUFSIZE 1000

int main(int argc, char *argv[])
{
int alpha[26]={0};
FILE *file1;
char c,x;
float y;
FILE *file = fopen(argv[1], \"r\");
if(file == NULL)
{
printf(\"cannot open the %s\",file);
exit(8);
}
file2 = fopen (argv[2], \"w\");
while (c != EOF) {
        c = getc(file);
   if(c>=\'a\' && c<=\'z\')
       c = toUPPER(c);
       putc(c,file2);
}
fclose(file);

do
{

switch(c)
{
case\'A\': alpha[0]++;
break;
case\'B\': alpha[1]++;
break;
case\'C\': alpha[2]++;
break;
case\'D\': alpha[3]++;
break;
case\'E\': alpha[4]++;
break;
case\'F\': alpha[5]++;
break;
case\'G\': alpha[6]++;
break;
case\'H\': alpha[7]++;
break;
case\'I\': alpha[8]++;
break;
case\'J\': alpha[9]++;
break;
case\'K\': alpha[10]++;
break;
case\'L\': alpha[11]++;
break;
case\'M\': alpha[12]++;
break;
case\'N\': alpha[13]++;
break;
case\'O\': alpha[14]++;
break;
case\'P\': alpha[15]++;
break;
case\'Q\': alpha[16]++;
break;
case\'R\': alpha[17]++;
break;
case\'S\': alpha[18]++;
break;
case\'T\': alpha[19]++;
break;
case\'U\': alpha[20]++;
break;
case\'V\': alpha[21]++;
break;
case\'W\': alpha[22]++;
break;
case\'X\': alpha[23]++;
break;
case\'Y\': alpha[24]++;
break;
case\'Z\': alpha[25]++;
break;
}

}while(ch != EOF);

int i;
x = \'A\'
for(i=0;i<26;x<=\'Z\';i++;++x)
{
fprintf(\"Frequency of Characters are:\ \");
y = 1/alpha[i];
fprintf(\"%c : %f\ \",x,y);
}
return 0;

fclose(file1); /* close the file */
}

C PROGRAMMING: Computational Stylistics (big data) 1. Write a program that reads in a text file character-by-character and computes the relative frequency of ea

C PROGRAMMING Computational Stylistics big data 1 Write a pr

Solution

Get Help Now

Submit a Take Down Notice