TASK You are to perform a Data Analysis on a dataset of your

TASK: You are to perform a Data Analysis on a dataset of your choice to find intelligence that you can act on. These can be hidden meanings in the data, trends etc. that is otherwise not obvious. STEPS: 1. Download a dataset - from Data.Gov or other sources of your choice. 2. Eyeball the data - what is the data about? 3. Analyze: Load the data in Tableau and perform an analysis and create a dashboard if applicable. DELIVERABLES: 1. Written executive Report A. What is the data about? B. What kind of Analysis did you do? - Give Screenshots from Tableau, and explain why you did what you did. C. What did you find? - Actionable Intelligence like hidden meanings, trends etc. D. What you can you do with this intelligence? 2. Oral Presentation A. You are to do a 15 minute-max presentation covering the material in your written report.

If the goal is prediction accuracy, average many prediction models together. In general, the prediction algorithms that most frequently win Kaggle competitions or the Netflix prize blend multiple models together. The idea is that by averaging (or majority voting) multiple good prediction algorithms you can reduce variability without giving up bias. One of the earliest descriptions of this idea was of a much simplified version based on bootstrapping samples and building multiple prediction functions - a process called bagging (short for bootstrap aggregating). Random forests, another incredibly successful prediction algorithm, is based on a similar idea with classification trees. When testing many hypotheses, correct for multiple testing This comic points out the problem with standard hypothesis testing when many tests are performed. Classic hypothesis tests are designed to call a set of data significant 5% of the time, even when the null is true (e.g. nothing is going on). One really common choice for correcting for multiple testing is to use the false discovery rate to control the rate at which things you call significant are false discoveries. People like this measure because you can think of it as the rate of noise among the signals you have discovered. Benjamini and Hochber gave the first definition of the false discovery rate and provided a procedure to control the FDR. There is also a really readable introduction to FDR by Storey and Tibshirani. When you have data measured over space, distance, or time, you should smooth This is one of the oldest ideas in statistics (regression is a form of smoothing and Galton popularized that a while ago). I personally like locally weighted scatterplot smoothing a lot. This paperis a good one by Cleveland about loes

TASK You are to perform a Data Analysis on a dataset of your

Solution

Get Help Now

Submit a Take Down Notice