Q1. Explain what regularization is and why it is useful.
Q2. Which data scientists do you admire most? which startups?
Q3. How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression.
Q4. Explain what precision and recall are. How do they relate to the ROC curve?
Q5. How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything?
Q6. What is root cause analysis?
Q7. Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples.
Q8. What is statistical power?
Q9. Explain what resampling methods are and why they are useful. Also explain their limitations.
Q10. Is it better to have too many false positives, or too many false negatives? Explain.
Q11. What is selection bias, why is it important and how can you avoid it?
Q12. Give an example of how you would use experimental design to answer a question about user behavior.
Q13. What is the difference between "long" and "wide" format data?
Q14. What method do you use to determine whether the statistics published in an article (e.g. newspaper) are either wrong or presented to support the author's point of view, rather than correct, comprehensive factual information on a specific subject?
Q15. Explain Edward Tufte's concept of "chart junk."
Q16. How would you screen for outliers and what should you do if you find one?
Q17. How would you use either the extreme value theory, Monte Carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?
Q18. What is a recommendation engine? How does it work?
Q19. Explain what a false positive and a false negative are. Why is it important to differentiate these from each other?
Q20. Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs).
Q21. How to efficiently represent 5 dimension in a chart (or in a video)?
Q22. What are Data Science lessons from failure to predict 2016 US Presidential election (and from Super Bowl LI comeback)
Q23. What problems arise if the distribution of the new (unseen) test data is significantly different than the distribution of the training data?
Q24. What are bias and variance, and what are their relation to modeling data?
Q25. Why might it be preferable to include fewer predictors over many?
Q26. What error metric would you use to evaluate how good a binary classifier is? What if the classes are imbalanced? What if there are more than 2 groups?
Q27. What are some ways I can make my model more robust to outliers?
Q28. What is overfitting and how to avoid it?
Q29. What is the curse of dimensionality?
Q30. How can you determine which features are the most important in your model?
Q31. When can parallelism make your algorithms run faster? When could it make your algorithms run slower?
Q32. What is the idea behind ensemble learning?
Q33. In unsupervised learning, if a ground truth about a dataset is unknown, how can we determine the most useful number of clusters to be?
Q34. What makes a good data visualization?
Q35. What are some of the common data quality issues when dealing with Big Data? What can be done to avoid them or to mitigate their impact?
Q36. In an A/B test, how can we ensure that assignment to the various buckets is truly random?
Q37. How would you conduct an A/B test on an opt-in feature?
Q38. How to determine the influence of a Twitter user?