Statistics
4
Views
0
Downloads
0
Donations
Support
Share
Uploader

高宏飞

Shared on 2026-06-23

AuthorVishwanathan Narayanan

Know Data science with numpy, pandas, scipy, sklearn DESCRIPTION “Data science and Machine learning interview questions using Python,” a book which is a true companion of people aspiring for data science and machine learning, and it provides answers to most asked questions in an easy to remember and presentable form. Book mainly intended to be used as last-minute revision, before the interview, as all the important concepts and various terminologies have been given in a very simple and understandable format. Many examples have been provided so that the same can be used while giving answers in an interview. The book is divided into six chapters, which starts with the Data Science Basic Questions and Terms then covers the questions related to Python Programming, Numpy, Pandas, Scipy, and its Applications, then at the last covers Matplotlib and Statistics with Excel Sheet. KEY FEATURES - Questions related to core/basic Python, Excel, basic and advanced statistics are included - Book will prove to be a companion whenever you want to go for an interview - Simple to use words have been used in the answers for the questions to help ease of remembering WHAT WILL YOU LEARN - You can learn the basic concept and terms related to Data Science, python programming - You will get to learn how to program in python, basics of Numpy - You will get familiarity with the questions asked in an interview related to Pandas and learn the concepts of Scipy, Matplotib, and Statistics with Excel Sheet WHO THIS BOOK IS FOR The book is mainly intended to help people represent their answer in a sensible way to the interviewer. The answers have been carefully rendered in a way to make things quite simple and yet represent the seriousness and complexity of the matter. Since data science is incomplete without mathematics, we have also included a part of the book dedicated to statistics. Table of Contents 1. Data Science Basic Questions and Terms 2. Python Programming Questions 3. Numpy Interview Que

AI Reading Assistant

Summary and highlights from this book's index; jump to passages in the text

Passage locations
Tags
No tags
ISBN: 9389845785
Publisher: BPB Publications
Publish Year: 2020
Language: 英文
Pages: 152
File Format: PDF
File Size: 5.8 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

(This page has no text content)
Data Science and Machine Learning Interview Questions using Python A Complete Question Bank to Crack Your Interview by Vishwanathan Narayanan SECOND REVISED AND UPDATED EDITION 2020 FIRST EDITION 2019 Copyright © BPB Publications, India ISBN: 978-93-89845-785 All Rights Reserved. No part of this publication may be reproduced or distributed in any form or by any means or stored in a database or retrieval system, without the prior written permission of the publisher with the exception to the program listings which may be entered, stored and executed in a computer system, but they can not be reproduced by the means of publication. LIMITS OF LIABILITY AND DISCLAIMER OF WARRANTY The information contained in this book is true to correct and the best of author’s & publisher’s knowledge. The author has made every effort to ensure the accuracy of these publications, but cannot be held responsible for any loss or damage arising from any information in this book. All trademarks referred to in the book are acknowledged as properties of their respective owners. Distributors: BPB PUBLICATIONS 20, Ansari Road, Darya Ganj New Delhi-110002 Ph: 23254990/23254991 MICRO MEDIA Shop No. 5, Mahendra Chambers, 150 DN Rd. Next to Capital Cinema, V.T. (C.S.T.) Station, MUMBAI-400 001 Ph: 22078296/22078297 DECCAN AGENCIES
4-3-329, Bank Street, Hyderabad-500195 Ph: 24756967/24756400 BPB BOOK CENTRE 376 Old Lajpat Rai Market, Delhi-110006 Ph: 23861747 Published by Manish Jain for BPB Publications, 20 Ansari Road, Darya Ganj, New Delhi-110002 and Printed by him at Repro India Ltd, Mumbai Dedicated to Dedicated to Pratyangira, Bala, Durga, Mom, Dad, Chitti my aunt, my sister Ishwarya, Sridhar my brother in law and to all my mentors especially Shiv without whom this book would still be a dream. Also the support extended by Shyam Sir, Khadak and BPB Publications is very much appreciated. Durga has been a great inspiration for this book. She has always been and will me my encouragement to write more books. Also remember Sudarshan as a friend in need. Also dedicated to my students from whom I equally learned as I taught them. Along with all the blessing of almighty is also remembered here without which even a blade of grass does not move About the Author Mr Vishwanathan has twenty years of hard code experience in the software industry spanning across many multinational companies and domains. Playing with data to derive meaningful insights has been his domain and that is what took him towards data science and machine learning. Preface Data science is one of the hottest topics mainly because of the application areas it is involved and things which were once upon of time, impossible with earlier software has been made easy. This book tries to comprehend the ocean of data science into small book which is mainly intended to be used as last minute revision. Before interview, all the important concepts have been given in simple and understand format. This book tries to include various terminologies and logic used both as a part of Data Science and Machine learning for last minute revision. As such you can say that this book acts as a companion whenever you want to go for interview. Simple to use words have been used in the given answers for the questions to help ease of remembering and representation of same. Examples where ever deemed necessary have been provided so that same can be used while giving answers in interview. Author tried to consolidate whatever he came across, on multiple interviews that he attended and put the same in words so that it becomes easy for the reader of the book to give direction on how the interview would be. With the number of data science jobs increasing, Author is sure that everyone who wants to pursue this field would like to keep this book as a constant companion. Soon, Author will be coming shortly with a new book on R too, so that it makes a complete data science stack. Happy reading to all the readers, your feedback is highly appreciated. Foreword It is not wrong to say that today’s dynamic world is driven totally by statistics. With decision making becoming important in being successful the use of software this task has become common, Thanks to the advancement made with respect to technology. While software application always existed for doing the above task, the volume and ability of software programmes to represent complex equation related to statistics and probability was limited. Thanks to pandas, numpy, scipy and sklearn modules of Python, the above problem faced has been removed to a great extent and the problem is no more a challenge. With complex mathematical concepts easily convertible to algorithms the life of data scientist and analyst has become quite
easy. This book is mainly intended to help people represent their answer in a sensible way to the interviewer. The answers have been carefully rendered in a way to make things quite simple and yet represent the seriousness and complexity of matter. Since data science is incomplete without mathematics we have also included a part of the book dedicated to statistics. Python has already taught us that small code does not mean lesser powerful the same concept has been adopted to keep the book a powerful weapon for any one attending interview. Table of Contents 1. Data Science Basic Questions and Terms Q1: Explain the steps involved in data science? Q2: Explain variable and different types of variables? Q3: Explain Categorical measurement? Q4: Explain Binary variables? Q5: Explain Nominal measurement? Q6: Explain Ordinal variable? Q7: Explain Continuous variables? Q8: Explain Discrete variables? Q9: Is it possible to convert continuous values to discrete and vice versa? Q10: What are interval variables? Q11: What are ratio variables? Q12: What are Univariate and Bivariate variables? Q13: What is measurement error? Q14: Explain Validity? Q15: Explain Reliability? Q16: What are the different ways to test hypotheses? Q17: Explain the different types of variation? Q18: Explain repeated-measures design? Q19: What is independent design? Q20: Explain the role of randomization w.r.t variation? Q21: Explain various summary measures. Q22: Explain alternate hypotheses and null hypotheses. Q23: What is p value? Q24: What happens when null hypotheses is rejected? Q25: Explain directional and non-directional hypotheses. Q26: Explain fit of model? Q27: What is relation between sample and population? Q28: What is estimation?
Q29: Explain deviation score? Q30: Explain variance? Q31: Explain Standard deviation. Q32: Explain standard error. Q33: What is precision? Q34: Explain confidence intervals. Q35: Explain confidence level. Q36: Explain alpha. Q37: Explain Beta. Q38: Explain Accuracy. Q39: Explain Bias. Q40: What is central limit theorem? Q41: Explain Absolute value? Q42: What is degree of freedom? Q43: Explain cluster sampling. Q44: Explain Correlation coefficients? Q45: Explain sample space. Q46: What is non parametric algorithm? Q47: How can learning be classified? Q48: What is classification? Q49: Explain the steps involved in classification. Q50: What is regression? Q51: Explain the similarities and differences between Classification and Regression. Q52: Explain various terms encountered during classification algorithm. Q53: Explain multi class classification? Q54: Explain multi label classification? Q55: Explain how multi label problem can be solved? Q56: Explain some important metrics with respect to testing a model? Q57: What is logistic regression? Q58: Explain Naïve Bayes. Q59: What is Stochastic Gradient Descent? Q60: Explain decision tree algorithm. Q61: What is Gini index? Q62: Is Gini index the only means which can be used in decision tree?
Q63: What is Pruning w.r.t. decision tree? Q64: What is random forest? Q65: Explain the difference between Random forest and decision tree. Q66: What is overfitting and underfitting? Q67: What are the reasons for under fitting occurrences? Q68: Does over fitting get affected by noise? Q69: Explain KNN (K Nearest Neighbour) steps involved, advantage and disadvantage. Q70: Explain selection bias. Q71: What does selection bias indicate w.r.t. algorithm? Q72: What is Bootstrap sample? Q73: What is Resampling? Q74: Explain tail. Q75: Explain the difference between one way test and two way test. Q76: Explain degree of freedom. Q77: What is predictive modeling? Q78: What is time series analysis? Q79: What is deep learning? Q80: What is Convolutional Neural Network? Q81: What are different ways to determine optimal value of clusters?. Q82: What are various distance related functions for similarity measures? 1. Python Programming Questions Q1: Is Python Object oriented? Q2: Is Python case sensitive? Q3: What kind of language is Python? Q4: What are different versions of Python? Q5: Explain different implementations of Python? Q6: Is Python loosely typed? Q7: How to start a new block in Python? Q8: How to get data type of a particular variable? Q9: How many ways can Python program be run? Q10: Explain the importance of Pylint and Pychecker. Q11: Explain Zen of Python. Q12: How to print Zen in Python? Q13: Explain Python data types. Q14: How can we switch variables in Python?
Q15: What is the use of pass statement in Python? Q16: Is Python pass by value or pass by reference? Q17: Does Python supports chained operations? Q18: Explain ALL and ANY. Q19: Explain the difference between IS and ==. Q20: Explain supported collection of data type w.r.t. Python? Q21: Create a simple number list? Q22: Can you create nested list? Q23: Explain CRUD (Create, Update, and Delete) operations from list. Q24: Explain operations in dictionary. Q25: Explain operation with tuples. Q26: Explain del? Q27: If del can remove variable can it remove tuple variable? Q28: Delete last element in a list. Q29: Predict the output of following code. Q30: What do you mean by list comprehension? Q31: Explain the preferred way for looping through list? Q32: Find the reverse of the dictionary? Q33: How to sort dictionary by value? Q34: What is the use of shuffle function? Q35: What is the preferred way to get a value based on key in Python? Q36: Explain alternate way of merging 2 or more dictionaries without using update method? Q37: What is the preferred way of fetching last element/second last and so on from a list? Q38: What is the preferred way for reversing a list? Q39: Explain various string utility functions in Python. Q40: How to check whether two strings are equal. Q41: Can string use single quote or double quote? Q42: Explain type conversions on collection types. Q43: Explain set theory operations supported by set data type. Q44: Explain frozenset? Q45: Explain functions in Python? Q46: What is a Boolean function? Q47: Can we specify data type for arguments as well as return types in Python? Q48: Explain variable arguments?
Q49: Write a program to find occurrences or count of characters in given word. Q50: What is **kwargs? Q51: Write a simple Lambda expression? Q52: Lambda forms in Python contain statements? True or False? Q53: Explain filter function? Q54: Explain steps involved in reading and writing a file? Q55: Explain the term “withstatement”? Q56: Explain the preferred way of reading a big file? Q57: Explain modules in Python. Q58: Explain different ways of importing modules. Q59: Can we create our own module? Q60: Explain in brief about os module and its corresponding functions. Q61: Using os module print the directory structure. Q62: Explain dir function. Q63: Explain exception handling in Python. Q64: How to create user defined exception? Q65: What is the use of raise statement? Q66: How to create own class in Python? Explain constructors. Q67: Is it necessary to have the first argument of class function as self? Can’t we rename it to any other variable? Q68: Explain inheritance in Python. Q69: How to determine whether a particular class is sub class? Q70: Does Python support multiple inheritance? Q71: How is diamond problem resolved in case of Python? Q72: Does Python support private method and variables? Q73: Can __ be used for other purpose than creating private variables or functions? Q74: Does Python support abstract classes? Q75: Differentiate between static methods and class methods in Python. Q76: What are named tuple? Q77: How to sort using lamdas? Q78: Explain Generators? Q79: What is generator expression? Q80: When Python program exits, all the memory is released? Say true or false? Q81: Can a function be passed as parameter to another function? Q82: Can a function be retuned as result from another function? Q83: Explain decorator function.
Q84: How can we represent big text in Python? Q85: What is PEP 8? Q86: What is anaconda? Q87: How to install external modules? Q88: What is Jupyter notebook? Q89: What is pickling and unpickling? Q90: Explain the importance of setup.py? Q91: Is it possible to make connections to database using Python? Q92: Explain meta programming? Q93: Explain Python memory model. 1. Numpy Interview Questions Q1: What is numpy? Q2: How to install numpy? Q3: How to create single dimension numpy array? Q4: Explain different attributes provided by numpy? Q5: Explain some utility methods provided by numpy for creating different elements? Q6: How can we change shape of an object? Q7: Which all data types are supported in Python? Q8: Explain various simple mathematical operations which can be done on numpy? Q9: Explain slicing operation in numpy? Q10: Explain Boolean indexing? Q11: Perform matrix multiplication using numpy? Q12: Explain various functions available with numpy? Q13: What is broadcast? Q14: Explain rules of broadcasting. Q15: Explain some statistical measures supported by numpy. Q16: Explain functions available in numpy.linalg. Q17: How to save numpy data from memory to flat file? Q18: What is the use of where and extract? Q19: What is the use of ndenumerate? Q20: Explain how can we draw a histogram using numpy? 1. Pandas Interview Questions Q1: What is Pandas? Q2: How does Pandas represent data?
Q3: How to create Series? Q4: How to create Data frame? Q5: How are missing values represented in data frame? Q6: Explain the process of creating indexes w.r.t. pandas? Q7: Explain various attributes associated with series. Q8: Explain various statistical measures supported by pandas. Q9: Explain reindexing. Q10: Explain bfill and ffill. Q11: What all type of iterations are provided in pandas data frame? Q12: Explain how sorting is supported in pandas? Q13: How to override default reload option in pandas? Q14: Explain various slicing options available with pandas? Q15: Explain advanced statistics with pandas. Q16: Explain rolling function. Q17: How can we handle NA in pandas? Q18: Explain group by function. Q19: Explain merge functions w.r.t data frame. Q20: Explain concat method. Q21: Explain how time related range can be generated in pandas. Q22: Explain which all data sources can pandas retrieve values. Q23: Can you compare some of the functions of R and Python? Q24: How to print a histogram using pandas? 1. Scipy and its Applications Q1: Explain Scipy library. Q2: Explain how can we perform Normality Tests. Q3: Explain how can we perform correlation test? Q4: Explain tests pertaining to Parametric Statistical Hypothesis Tests. Q5: Explain how to test Nonparametric Statistical Hypothesis Tests. Q6: Implement logistic regression in Python? Q7: Explain how to implement decision tree in Python. Q8: How to implement Random forest in Python? Q9: How to implement support vector machine in Python? Q10: Which all kernels are supported by svm in Python? Q11: Implement KNN algorithm using Python.
Q12: How to select k in KNN algorithm? Q13: How to implement K means in Python? Q14: How can accuracy of any model be calculated? Q15: Explain regression metrics. Q16: Explain how we can print a decision tree or see the rules of the decision tree? Q17: What is the use of boosting techniques? Q18: Explain some of the advantages and disadvantages of boosting techniques? Q19: What is AdaBoost? Q20: Explain Gradient boosting? Q21: Explain XGBoost? Q22: Explain the differences/similarities between bagging and Boosting? Q23: Write a small snippet to perform operation with neural networks using tensorflow and keras? 1. Matplotlib Samples to Remember Q1: Explain how to draw bar plot. Q2: How to draw histogram? Q3: How to draw line chart? Q4: Draw Pie chart. Q5: How to get the equation of the line printed line plot? Q6: Draw scatter plot. 1. Statistics with Excel Sheet Q1: Does Excel has any support for statistics? Q2: Find correlation using Excel. Q3: How to get Histogram in excel? Q4: Explain how to get Descriptive Statistics using Excel. Q5: Explain how to perform Anova in excel? Q6: Explain how to perform Rank and Percentile in excel. CHAPTER 1 Data Science Basic Questions and Terms Note: [Q: Question Number and Ans: Answer] Q1: Explain the steps involved in data science? Ans: Following are the steps involved: Get data from various data sources available. Generate research question from data. Identify variables present in data. Also, identify important variables or variables to be analyzed as such. Generate hypothesis.
Analyze data using graph data like histogram for example. Fit a model from analyzed data. Accept or reject the hypothesis. Research question answer found. Example of above steps: Get data related to temperature for India reference https://data.gov.in/catalog/annual-and-seasonal-maximum-temperature-india A template of data set: “YEAR”,”ANNUAL”,”JAN-FEB”,”MAR-MAY”,”JUN-SEP”,”OCT-DEC” “1901”,”28.96”,”23.27”,”31.46”,”31.27”,”27.25” “1902”,”29.22”,”25.75”,”31.76”,”31.09”,”26.49” “1903”,”28.47”,”24.24”,”30.71”,”30.92”,”26.26” “1904”,”28.49”,”23.62”,”30.95”,”30.67”,”26.40” “1905”,”28.30”,”22.25”,”30.00”,”31.33”,”26.57” “1906”,”28.73”,”23.03”,”31.11”,”30.86”,”27.29” “1907”,”28.65”,”24.23”,”29.92”,”30.80”,”27.36” “1908”,”28.83”,”24.42”,”31.43”,”30.72”,”26.64” “1909”,”28.39”,”23.52”,”31.02”,”30.33”,”26.88” “1910”,”28.53”,”24.20”,”31.14”,”30.48”,”26.20” “1911”,”28.62”,”23.90”,”30.70”,”31.14”,”26.31” “1912”,”28.95”,”24.88”,”31.10”,”31.15”,”26.57” “1913”,”28.67”,”24.25”,”30.89”,”30.92”,”26.42” “1914”,”28.66”,”24.59”,”30.73”,”30.84”,”26.40” “1915”,”28.94”,”23.22”,”31.06”,”31.51”,”27.18” “1916”,”28.82”,”24.57”,”31.88”,”30.52”,”26.32” “1917”,”28.11”,”24.52”,”30.06”,”30.24”,”25.74” “1918”,”28.66”,”23.57”,”30.68”,”31.11”,”26.77” Research question, is the annual temperature in India rising? Variable of interest from the above data set ANNUAL. Hypothesis: Temperature is rising.
Analyze data from the above data set. Fit the model. Hypothesis accepted or rejected. Q2: Explain variable and different types of variables? Ans: Anything which keeps on changing is called variable. Variables are of different type and below are the following: Dependant/Outcome: A variable being affected, for example annual temperature in above example. Independent/Predictor: A variable affecting the outcome for e.g. deforestation, pollution, and so on in above example. Q3: Explain Categorical measurement? Ans: Categorical measurement contains categories i.e. distinct entities. Example of categories of life on earth is plants, animals, and so on. Q4: Explain Binary variables? Ans: Binary variables are those in which only two classes exist, like live or dead male or female on or off. Q5: Explain Nominal measurement? Ans: Nominal measurements are there more than two classes. Such categories can be numbers too. Q6: Explain Ordinal variable? Ans: These are nominal variables which have logical order. Examples include team ranks in cricket or football, merit list of students appearing for grade students. Q7: Explain Continuous variables? Ans: These are variables which can take can any value on the measurement scale example includes pitch of voice which can take any possible value within the range. Q8: Explain Discrete variables? Ans: These are variables which can take fixed values in range. For example, number of customers in a bank. Q9: Is it possible to convert continuous values to discrete and vice versa? Ans: Yes, based upon the motive of study, it is possible to convert discrete values to continuous and vice versa for example, Level of water in tank can take any value in the range and as such a continuous variable. But we can approximate the same to three different levels like empty, full, or half empty and this now becomes discrete in nature. Q10: What are interval variables? Ans: These are variables which are grouped on interval. Example is age can be divided in range like 10-20, 20-30 and so on and, person with particular age would be placed in one of the above groups. When intervals are equal, they represent difference in equal property being measured. Q11: What are ratio variables? Ans: This is sub type of interval variables where ratio of scales is used for measurement. For Example Water representation in chemistry is H 2 O which represent two molecules of hydrogen and one molecule of oxygen. Thus, the ratio of elements is 2: 1. Q12: What are Univariate and Bivariate variables?
Ans: Univariate variable: When the variable under consideration is only one then it is called univariate variable study. Bivariate variable: Involves study of relationship between two variables. Q13: What is measurement error? Ans: The discrepancy between the measured value and actual value in terms of number is called measurement error. For Example While buying fruits from a vendor in kilograms, if we wanted 1 kilogram of fruits and the vendor’s weighing machine showed 1 kilogram when we brought the same. After checking the same in another machine, if the measured value shows 0.1 kilogram less than expected then this difference is what we call as measurement error. Q14: Explain Validity? Ans: Validity implies whether an instrument measures what it is supposed to measure. Q15: Explain Reliability? Ans: Reliability implies whether the instrument gives consistent result across different conditions. For example, if we test the same value twice on the same entity then the results from the instrument should remain same if it has to be reliable. Such tests are known as test-retest. Q16: What are the different ways to test hypotheses? Ans: There are two ways in which hypotheses can be tested: Correlational research This is also known as cross-sectional research This involves observing the natural pattern or occurrence to test Original occurrences are not manipulated Experimental research We select the variables of interest Then we manipulate some aspect of the environment Observe the effect on selected variable Q17: Explain the different types of variation? Ans: There are two types in variation explained as follows: Systematic variation: Introduced by experimenter The participants are tested under different conditions and the difference in condition is introduced by experimenter For Example to test use of woolen clothes w.r.t. temperature, we can test a group of 20 people, in both hot and cold climate. Thus, the difference introduced here is in terms of temperature only. Unsystematic variation: Introduced by random factors that exist between the experimental conditions. For Example To test use of woolen clothes w.r.t. temperature, we can test a group of 20 people. Of the selected set some might behave differently than expected due to factors like illness and so on. Q18: Explain repeated-measures design? Ans: Same measure is measured under different conditions on same set participants.
The difference in two conditions can be caused by the following: The manipulation/changes that was carried out on the participants Factors that might affect the way in which a participant performs from one time to the next Q19: What is independent design? Ans: Same measure is measured under different conditions on different set of participants. The differences between the two conditions can be caused by the following: The manipulation/changes that were carried out on the participants Difference in nature or characteristics of participants in each case Q20: Explain the role of randomization w.r.t variation? Ans: By using randomization we can ensure that any variation introduced, is due to changes in the conditions/variables introduced rather than any other unexpected changes during the process. Thus, it helps in removing other sources of systematic variation. Q21: Explain various summary measures. Ans: Mode: Represents the value/score which occurs most frequently in data set. For example: In the values of occurrences of goal in football by players having T-shirt numbers is as follows: 1, 2, 2, 3, 4, 1, 1, 1, 1, 1, 1 If we arrange it in form of table: Thus the mode in above example is 1 which happens to have maximum frequency of 7. This can be easily determined from histogram as shown in the following screenshot: Median: This is the middle value which is obtained by ordering the values/scores in ascending order. If the middle value happens to have two numbers then the average is taken as such: For example: Median of 2, 3, 4, 5, 6 happens to be 4. Median of 2, 3, 4, 5 happens to be average of 3 and 4 which is 3.5.
Median is least affected by outliers. Mean: Represents mathematical average which is sum of all the elements divided by number of elements. For example, average strike rate of the batsmen in the game of cricket is the average of strike rates in individual matches. Range of scores: Subtraction between the maximum value and minimum value in range is called range of scores. This indicates dispersion. Trimmed mean: Represents the mean after removing extreme cases from both the end i.e. from minimum and maximum end. Both the minimum and maximum values may represent values which are not normal and hence represent outlier. So, while fining trimmed mean, we specify the percentage of values to be ignored from both the ends. Hence, trimmed mean gives better representation of data excluding outlier. Interquartile range (IQR) It is a measure of variability, based on dividing a data set into quartiles. Quartiles are the three values that split the sorted data into four equal parts: Q1 is the middle value in the first half of the rank-ordered data set Q2 is the median value in the set Q3 is the middle value in the second half of the rank-ordered data set The interquartile range is equal to Q3 minus Q1. The lower quartile is the median of the lower half of the data. The upper quartile is the median of the upper half of the data. Mean absolute deviation: Represents the mean of the absolute value of the deviations from the mean. Mean absolute deviations from median: Represented by absolute value of the deviations from the median. Outliers: Represents the value which are not normal or within the range and hence data which is corrupted at the time of capture or due to some other reasons. Since it affects all while finding the mean and other summary values they need to represent this data properly. Q22: Explain alternate hypotheses and null hypotheses. Ans: Alternative hypotheses: Also called experimental hypotheses Denoted by H 1 It assumes that effect as per prediction would exist in the conclusion Null hypotheses: Denoted by H 0 It assumes that effect as per prediction would not exist in the conclusion Thus, this represents opposite of alternative hypotheses Q23: What is p value? Ans: It measures the strength of evidence in support of null hypotheses. If this value is less than significance level then null hypotheses is accepted, else rejected. The range of values that leads the researcher to accept the null hypothesis is called the region of acceptance. The region other than acceptance is called region of rejection. Q24: What happens when null hypotheses is rejected? Ans: When a null hypothesis is rejected, it becomes Type 1 error. The probability of Type 1 error occurring is called significance level.
Q25: Explain directional and non-directional hypotheses. Ans: Directional hypotheses: Gives an indication whether the effect which is being studied would grow positively or negatively One tailed test is generally used for such cases Non directional hypotheses: Does not give indication of whether the effect which is being studied would grow positively or negatively Two tailed test is generally used for such cases Q26: Explain fit of model? Ans: It represents the degree till which the determined statistical model represents the data. Fit of model can represent either under fit, over fit, or perfect fit as such. A model which is good fit would have low variance between the calculated value and measured value. Q27: What is relation between sample and population? Ans: Samples are subset or part of original data or population. If population is very big hence performing analysis on the whole population as such is not easy. Hence, a subset of population data is taken which is known as sample data. Whether the sample data is true representation of original data set considered is determined with the help of estimation, confidence interval, and so on. Q28: What is estimation? Ans: By using information available from sample, we can make inferences w.r.t. population, which is what is known as estimation. Parameters used are mean, standard deviation, and so on. Q29: Explain deviation score? Ans: This is defined as the difference between actual score/value and mean. Q30: Explain variance? Ans: Variance is the average error between the mean and the measured values. It indicates the difference between the average value calculated and the observed value as such. It is an indication of how different individuals in group differ or vary from each other. The population variance is given: PV = Σ (Xi - X)2 / N PV is the population variance
X is the population mean Xi is the i th element from the population N is the number of elements in the population The sample variance is given by: SV = Σ (xi – x)2 / (n – 1) SV is the sample variance x is the sample mean xi is the i th element from the sample n is the number of elements in the sample Q31: Explain Standard deviation. Ans: Square root of the variance is also called standard deviation. This is done to keep the measurement same as original one. They indicate the nearness of the points measured w.r.t. mean. Smaller the standard deviation will be more nearer to mean and vice versa. Q32: Explain standard error. Ans: Standard error indicates how well a sample represents the original population. When we break the original population into various small samples, we would like to know the difference between the sample considered and original population. This is represented by standard error. The smaller the standard error, the closer or true representation of original population. SE = Σ (sample mean – overall population mean)2 / (number of samples) Q33: What is precision? Ans: It refers to the closeness between estimates from different samples. Thus indicating the opposite of standard error and are inversely related to Standard error. Q34: Explain confidence intervals. Ans: This indicates the boundaries in which the mean value will fall. It is a range of scores constructed, such that the population mean will fall within. They are limits constructed such that for a certain percentage of the time the true value of the population mean will fall within these range. Q35: Explain confidence level. Ans: Refers to the percentage of all possible samples that can be expected to include the true population parameter. Q36: Explain alpha. Ans: Alpha is defined as 1-confidence interval. This implies probability that the true value remains outside the confidence interval. If confidence interval is 99% then alpha is 1-99% which is 0.01. Q37: Explain Beta. Ans: The probability of committing Type 2 error is called beta. Type 2 error is one in which a rejected null hypotheses is accepted. The probability of avoiding Type 2 error is called power of test. Q38: Explain Accuracy. Ans: It indicates how much does sample value or parameters matches with the population statistic. If the value of mean of both sample and
population are exactly equal then we can say that the sample is fully accurate. If not fully equal then we say that sample is accurate by n limit where n is the difference between sample and population. Q39: Explain Bias. Ans: Bias indicates whether the estimation of sample is over fit or under fit w.r.t. population data. For e.g. if the population mean is 4 and sample mean calculated is 3 then this is under estimate bias. Such estimate in which both sample and population parameter is not equal are called bias estimate. Q40: What is central limit theorem? Ans: It states that the distribution of the mean of any independent, random variable can be approximated to normal if the sample size is large enough. Generally, the sample size of above 30 or sometimes 40 is taken as reference. This all ows us to approximate bigger samples to normal distribution without having to take hundreds or thousands of distribution. Standard Normal distribution is preferred as such because mean is equal to zero and variance is one. Q41: Explain Absolute value? Ans: Absolute value is positive value or magnitude irrespective of its initial sign. Q42: What is degree of freedom? Ans: It is equal to the number of independent observations in a sample minus number of population parameters to be estimated. Q43: Explain cluster sampling. Ans: In this method the number of clusters or groups to be formed as pre decided (generally denoted by N) from population data. The number of elements in each cluster is known and each element from population data is assigned to one cluster. For e.g. clustering can be done on attributes like customer state. Sampling can be further classified as: One-stage sampling: All of the elements within selected clusters are included in the sample. Two-stage sampling: A subset of elements within selected clusters is randomly selected. Q44: Explain Correlation coefficients? Ans: Correlation indicates the relationship between two variables.
As such variables can be positively correlated in which positive change in one variable effect the other variable positively. If variables are negatively correlated the positive change in one variable affects the other variable negatively. The formula for above is given by: Correlation co-efficient = Σ( xy )/ sqrt [(Σ x ² ) * (Σ y ² )], where x and y are variables under consideration. The value of correlation co-efficient ranges from -1 to +1. Q45: Explain sample space. Ans: The outcomes of any statistical experiment are denoted by sample space. Any outcome from such space is called sample point. One or more sample point is called event. When events do not have any sample point in common they are known as mutually exclusive event. Q46: What is non parametric algorithm? Ans: Non parametric algorithm does not make any assumptions on data distribution. Q47: How can learning be classified? Ans: Following are the classifications: Supervised: Data is clearly labeled and the algorithms learn to predict the output from the input data Offline analysis of data possible The number of classes are predefined Accuracy is high Examples include classification and regression Unsupervised: Much amount of data is unlabeled and the algorithms learn to inherent structure from the input data Analysis is on real time data Number of classes may be unknown Accuracy ranges from moderate to high Examples include Clustering and Association Semi-supervised: A mixture of data having label and no labels forms this one Can be considered as intermediate to above two Q48: What is classification? Ans: Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y). The output variables are often called labels or categories. The mapping function predicts the class or category for a given observation.