1Which of the following are the key ideas underlying Classif
1.Which of the following are the key ideas underlying Classification and Regression tree
Trimming
Recursive Partitioning
Pruning
Feedforward Partitioning
None of the Above
2.What does it mean for Data Mining Method to be Data driven, not data driven? Provide two examples of models, which are considered data driven and explain
3. List one Example of a Model-driven Method and Explain
4.Name and define three types of layers shown in Neural Network
5.Which of the following are commonly used to measure Impurity in a Classification tree
ROC Curve
Gini Index
Confusion Matrix
Entropy
None of the Above
6. In a neural network, define and explain what the w and 0 values represent and how they are used in the model
7. In Neural Networks, the characteristics of a one-way flow, no cycles is described as Multilayer Feedforward Network
True
False
8. Which Data-Mining Method does CHAID (Chi-Squared Automatic Interaction Detection) apply to and what is the primary purpose?
9. In looking for the probability of passing an exam versus the number of hours studying. What type of curve is shown in the following illustration?
Probibility Curve
Naïve-Bayes Curve
Linear Regression curve
Logistic Regression curve
None of the above
10. In a Neural Network, define and explain the following
Case Updating-
Batch Updating-
11. Which of the following are true as it applies to Naïve Bayes
Incorporate the Concept of conditional probability
Named after the Reverend Thomas Bayes
Can only be used with Categorical Variables
Data-driven not model driven
None of the above
12. Which of the following characterize Classification and Regression Tree
Considered highly transparent, easy to interpret
Can be used for either Classification or Prediction
Model-Driven (requires the assumptions of statistical models)
Computationally cheap even on large samples
All of the above
13.What is meant by the term “blackbox” and which data model does it generaly apply to
Which of the following is a more effective visualization of the data
Pie
Bar
Both graphs are equally effective
14. For evaluating regression results, is it better to use Adjusted R Squared or R Squared? Explain
15. Define and Explain the following terms, as they apply to variable selection.
16.Why is it important to partition data when we develop a model? List two types of partitions and explain
16b. Explain how lift charts are used to explain model performance
17. In evaluating model performance, which of the following metrics is most useful and why?
18.Oversampling Is used when the event of interest is rare
True
False
19.Which of the following is true as it describes the Naïve Rule
Classify all records as belonging to the most prevalent class
Is another term for Naïve Bayes
Often used as benchmark
Using external predictor info should outperform the Naïve Rule
All of the above
20. Which of the following characterizes k-nearest neighbors:
Used for classification (categorical outcome) or Prediction (numerical outcome)
Highly automated, data driven method
Rules on distance between records to determine neighbors
Used R-Squared to evaluate performance
All of the Above
21.Correlation Analysis is a key step in Dimension Reduction
True
False
22.A model that fits the training data perfectly leaving no error (residuals) is likely to perform well with new data
True
False
23. A chart that plots the pairs (Sensitivity, 1-Speciality), as cutoff value increases from 0 and 1 is known as a ROC curve
True
False
24. When the event of interest is rare, which method may be appropriate in order to develop a model?
Overfitting
Oversampling
PCA
CHAID
None of the Above
Solution
Solved three problems, post multiple question to get remaining answers
Q1) The correct answer is Option B, Recursive Partitioning
Explanation: There are two methods of classifications that are recursive partitioning and prunning
Q2) There are two types of things: Data driven and Data Informed
Data Driven - You are using the data provided by the other companies to make your estimations/model
Data Informed - You are yourself collecting the data which has less probability of error at the time of making a design
Q5) The correct answer is Option B Gini Index
Gini Impurity determines the amount of time the element of set will be wrongly labelled


