The package bayesm includes the dataset Scotch Need help wit
The package bayesm includes the dataset Scotch
Need help with R codes and visualizations for the questions below. The datasets are part of the installed packages in R and can be installed directly in R.
Q2: Whisky The package bayesm includes the dataset Scotch, which reports which brands of whisky 2218 respondents consumed in the previous year.
a) Draw a barchart of the number of respondents per brand. What ordering of the brands do you think is best?
b) There are 20 named brands and a further category Other.brands. That entails drawing a lot of bars. If you decided to plot only the biggest brands individually and group the rest all together in the ‘Other’ group, what cutoff would you use for defining a big brand?
c) Another version of the dataset called whiskey is given in the package flexmix. It is made up of two data frames, whiskey with the basic data, and whiskey_brands with information on whether the whiskeys are blends or single malts. How would you incorporate this information in your graphics, by using colour, by using a different ordering, or by drawing two graphics rather than one?
Solution
The olive oils dataset is well known and can be found in several packages, for instance as olives in extracat. The original source for the data is the paper [Forina et al., 1983].
a) Draw a scatterplot matrix of the eight continuous variables. Which of the fatty acids are strongly positively associated and which strongly negatively associated?
b) Are there outliers or other features worth mentioning?
The complete R snippet is
install.packages(\"psych\")
library(psych)
data.df<- (olives)
pairs(olives[,c(3:10)])
cor(olives[,c(3:10)])
cor(olives[,c(3:10)])
palmitic palmitoleic stearic oleic linoleic linolenic
palmitic 1.0000000 0.83560497 -0.17039178 -0.8373354 0.46068446 0.31932669
palmitoleic 0.8356050 1.00000000 -0.22218545 -0.8524384 0.62162666 0.09311163
stearic -0.1703918 -0.22218545 1.00000000 0.1135987 -0.19781693 0.01891719
oleic -0.8373354 -0.85243835 0.11359873 1.0000000 -0.85031837 -0.21817123
linoleic 0.4606845 0.62162666 -0.19781693 -0.8503184 1.00000000 -0.05743858
linolenic 0.3193267 0.09311163 0.01891719 -0.2181712 -0.05743858 1.00000000
arachidic 0.2282991 0.08548117 -0.04097892 -0.3199623 0.21097260 0.62023577
eicosenoic 0.5019518 0.41635048 0.14037748 -0.4241459 0.08904499 0.57831851
arachidic eicosenoic
palmitic 0.22829912 0.50195179
palmitoleic 0.08548117 0.41635048
stearic -0.04097892 0.14037748
oleic -0.31996234 -0.42414586
linoleic 0.21097260 0.08904499
linolenic 0.62023577 0.57831851
arachidic 1.00000000 0.32866349
eicosenoic 0.32866349 1.00000000
The one with the highest values (either is positive or negative direction) are considered to be having high correlation
The outlier analysis can be performed using the following code
library(outliers)
outs <- scores(olives$palmitic, type=\"chisq\", prob=0.9) # beyond 90th %ile based on chi-sq
olives$palmitic[outs]
The results are
> olives$palmitic[outs]
[1] 911 911 875 943 952 1529 1510 1539 1527 1518 1514 1620 1543 1721 1742 1517
[17] 1577 1590 1621 1753 1679 1693 1692 1638 1680 926 916 905 610 920 952 922
[33] 1732 1515 1521 1639
this is done foe palmitic , likewise we can perform the similar analysis for all the numeric variables
Please note that we can answer only 1 full question at a time , as per the answering guidelines


