Data Science Interview Questions

Choose A Topic

Interview Questions

1.What is Data Science?

An interdisciplinary field that constitutes various scientific processes, algorithms, tools, and machine learning techniques working to help find common patterns and gather sensible insights from the given raw input data using statistical and mathematical analysis is called Data Science.

The following figure represents the life cycle of data science.

It starts with gathering the business requirements and relevant data.
Once the data is acquired, it is maintained by performing data cleaning, data warehousing, data staging, and data architecture.
Data processing does the task of exploring the data, mining it, and analyzing it which can be finally used to generate the summary of the insights extracted from the data.
Once the exploratory steps are completed, the cleansed data is subjected to various algorithms like predictive analysis, regression, text mining, recognition patterns, etc depending on the requirements.
In the final stage, the results are communicated to the business in a visually appealing manner. This is where the skill of data visualization, reporting, and different business intelligence tools come into the picture. Learn More.

2. What is the difference between data analytics and data science?

Data science involves the task of transforming data by using various technical analysis methods to extract meaningful insights using which a data analyst can apply to their business scenarios.
Data analytics deals with checking the existing hypothesis and information and answers questions for a better and effective business-related decision-making process.
Data Science drives innovation by answering questions that build connections and answers for futuristic problems. Data analytics focuses on getting present meaning from existing historical context whereas data science focuses on predictive modeling.
Data Science can be considered as a broad subject that makes use of various mathematical and scientific tools and algorithms for solving complex problems whereas data analytics can be considered as a specific field dealing with specific concentrated problems using fewer tools of statistics and visualization.

3. What are some of the techniques used for sampling? What is the main advantage of sampling?

Data analysis can not be done on a whole volume of data at a time especially when it involves larger datasets. It becomes crucial to take some data samples that can be used for representing the whole population and then perform analysis on it. While doing this, it is very much necessary to carefully take sample data out of the huge data that truly represents the entire dataset.

There are majorly two categories of sampling techniques based on the usage of statistics, they are:

Probability Sampling techniques: Clustered sampling, Simple random sampling, Stratified sampling.
Non-Probability Sampling techniques: Quota sampling, Convenience sampling, snowball sampling, etc.

Differentiate between the long and wide format data.

Long format Data	Wide-Format Data
Here, each row of the data represents the one-time information of a subject. Each subject would have its data in different/ multiple rows.	Here, the repeated responses of a subject are part of separate columns.
The data can be recognized by considering rows as groups.	The data can be recognized by considering columns as groups.
This data format is most commonly used in R analyses and to write into log files after each trial.	This data format is rarely used in R analyses and most commonly used in stats packages for repeated measures ANOVAs.

4. What does it mean when the p-values are high and low?

A p-value is the measure of the probability of having results equal to or more than the results achieved under a specific hypothesis assuming that the null hypothesis is correct. This represents the probability that the observed difference occurred randomly by chance.

Low p-value which means values ≤ 0.05 means that the null hypothesis can be rejected and the data is unlikely with true null.
High p-value, i.e values ≥ 0.05 indicates the strength in favor of the null hypothesis. It means that the data is like with true null.
p-value = 0.05 means that the hypothesis can go either way

5. When is resampling done?

Resampling is a methodology used to sample data for improving accuracy and quantify the uncertainty of population parameters. It is done to ensure the model is good enough by training the model on different patterns of a dataset to ensure variations are handled. It is also done in the cases where models need to be validated using random subsets or when substituting labels on data points while performing tests.

6. What do you understand by Imbalanced Data?

Data is said to be highly imbalanced if it is distributed unequally across different categories. These datasets result in an error in model performance and result in inaccuracy.

7. Are there any differences between the expected value and mean value?

There are not many differences between these two, but it is to be noted that these are used in different contexts. The mean value generally refers to the probability distribution whereas the expected value is referred to in the contexts involving random variables.

8. What do you understand by Survivorship Bias?

This bias refers to the logical error while focusing on aspects that survived some process and overlooking those that did not work due to lack of prominence. This bias can lead to deriving wrong conclusions.

9. Define the terms KPI, lift, model fitting, robustness and DOE?

KPI: KPI stands for Key Performance Indicator that measures how well the business achieves its objectives.
Lift: This is a performance measure of the target model measured against a random choice model. Lift indicates how good the model is at prediction versus if there was no model.
Model fitting: This indicates how well the model under consideration fits given observations.
Robustness: This represents the system’s capability to handle differences and variances effectively.
DOE: stands for the design of experiments, which represents the task design aiming to describe and explain information variation under hypothesized conditions to reflect variables.

10. Define confounding variables?

Confounding variables are also known as confounders. These variables are a type of extraneous variables that influence both independent and dependent variables causing spurious association and mathematical relationships between those variables that are associated but are not casually related to each other.

11. Define and explain selection bias?

The selection bias occurs in the case when the researcher has to make a decision on which participant to study. The selection bias is associated with those researches when the participant selection is not random. The selection bias is also called the selection effect. The selection bias is caused by as a result of the method of sample collection.

Four types of selection bias are explained below:

Sampling Bias: As a result of a population that is not random at all, some members of a population have fewer chances of getting included than others, resulting in a biased sample. This causes a systematic error known as sampling bias.
Time interval: Trials may be stopped early if we reach any extreme value but if all variables are similar invariance, the variables with the highest variance have a higher chance of achieving the extreme value.
Data: It is when specific data is selected arbitrarily and the generally agreed criteria are not followed.
Attrition: Attrition in this context means the loss of the participants. It is the discounting of those subjects that did not complete the trial.
Define bias-variance trade-off?

Let us first understand the meaning of bias and variance in detail:

Bias: It is a kind of error in a machine learning model when an ML Algorithm is oversimplified. When a model is trained, at that time it makes simplified assumptions so that it can easily understand the target function. Some algorithms that have low bias are Decision Trees, SVM, etc. On the other hand, logistic and linear regression algorithms are the ones with a high bias.

Variance: Variance is also a kind of error. It is introduced into an ML Model when an ML algorithm is made highly complex. This model also learns noise from the data set that is meant for training. It further performs badly on the test data set. This may lead to over lifting as well as high sensitivity.

When the complexity of a model is increased, a reduction in the error is seen. This is caused by the lower bias in the model. But, this does not happen always till we reach a particular point called the optimal point. After this point, if we keep on increasing the complexity of the model, it will be over lifted and will suffer from the problem of high variance.

Trade-off Of Bias And Variance: So, as we know that bias and variance, both are errors in machine learning models, it is very essential that any machine learning model has low variance as well as a low bias so that it can achieve good performance.

Let us see some examples. The K-Nearest Neighbor Algorithm is a good example of an algorithm with low bias and high variance. This trade-off can easily be reversed by increasing the k value which in turn results in increasing the number of neighbours. This, in turn, results in increasing the bias and reducing the variance.

Another example can be the algorithm of a support vector machine. This algorithm also has a high variance and obviously, a low bias and we can reverse the trade-off by increasing the value of parameter C. Thus, increasing the C parameter increases the bias and decreases the variance.

So, the trade-off is simple. If we increase the bias, the variance will decrease and vice versa.

12. Define the confusion matrix?

It is a matrix that has 2 rows and 2 columns. It has 4 outputs that a binary classifier provides to it. It is used to derive various measures like specificity, error rate, accuracy, precision, sensitivity, and recall.

The test data set should contain the correct and predicted labels. The labels depend upon the performance. For instance, the predicted labels are the same if the binary classifier performs perfectly. Also, they match the part of observed labels in real-world scenarios. The four outcomes shown above in the confusion matrix mean the following:

True Positive: This means that the positive prediction is correct.
False Positive: This means that the positive prediction is incorrect.
True Negative: This means that the negative prediction is correct.
False Negative: This means that the negative prediction is incorrect.

The formulas for calculating basic measures that comes from the confusion matrix are:

Error rate: (FP + FN)/(P + N)
Accuracy: (TP + TN)/(P + N)
Sensitivity = TP/P
Specificity = TN/N
Precision = TP/(TP + FP)
F-Score = (1 + b)(PREC.REC)/(b2 PREC + REC) Here, b is mostly 0.5 or 1 or 2.

In these formulas:

FP = false positive
FN = false negative
TP = true positive
RN = true negative

Also,

Sensitivity is the measure of the True Positive Rate. It is also called recall.
Specificity is the measure of the true negative rate.
Precision is the measure of a positive predicted value.
F-score is the harmonic mean of precision and recall.

13. What is logistic regression? State an example where you have recently used logistic regression?

Logistic Regression is also known as the logit model. It is a technique to predict the binary outcome from a linear combination of variables (called the predictor variables).

For example, let us say that we want to predict the outcome of elections for a particular political leader. So, we want to find out whether this leader is going to win the election or not. So, the result is binary i.e. win (1) or loss (0). However, the input is a combination of linear variables like the money spent on advertising, the past work done by the leader and the party, etc.

14. What is Linear Regression? What are some of the major drawbacks of the linear model?

Linear regression is a technique in which the score of a variable Y is predicted using the score of a predictor variable X. Y is called the criterion variable. Some of the drawbacks of Linear Regression are as follows:

The assumption of linearity of errors is a major drawback.
It cannot be used for binary outcomes. We have Logistic Regression for that.
Overfitting problems are there that can’t be solved.

15. What is a random forest? Explain it’s working?

Classification is very important in machine learning. It is very important to know to which class does an observation belongs. Hence, we have various classification algorithms in machine learning like logistic regression, support vector machine, decision trees, Naive Bayes classifier, etc. One such classification technique that is near the top of the classification hierarchy is the random forest classifier.

So, firstly we need to understand a decision tree before we can understand the random forest classifier and its works. So, let us say that we have a string as given below:

So, we have the string with 5 ones and 4 zeroes and we want to classify the characters of this string using their features. These features are colour (red or green in this case) and whether the observation (i.e. character) is underlined or not. Now, let us say that we are only interested in red and underlined observations. So, the decision tree would look something like this:

So, we started with the colour first as we are only interested in the red observations and we separated the red and the green-coloured characters. After that, the “No” branch i.e. the branch that had all the green coloured characters was not expanded further as we want only red-underlined characters. So, we expanded the “Yes” branch and we again got a “Yes” and a “No” branch based on the fact whether the characters were underlined or not.

So, this is how we draw a typical decision tree. However, the data in real life is not this clean but this was just to give an idea about the working of the decision trees. Let us now move to the random forest.

Random Forest

It consists of a large number of decision trees that operate as an ensemble. Basically, each tree in the forest gives a class prediction and the one with the maximum number of votes becomes the prediction of our model. For instance, in the example shown below, 4 decision trees predict 1, and 2 predict 0. Hence, prediction 1 will be considered.

The underlying principle of a random forest is that several weak learners combine to form a keen learner. The steps to build a random forest are as follows:

Build several decision trees on the samples of data and record their predictions.
Each time a split is considered for a tree, choose a random sample of mm predictors as the split candidates out of all the pp predictors. This happens to every tree in the random forest.
Apply the rule of thumb i.e. at each split m = p√m = p.
Apply the predictions to the majority rule.

16. In a time interval of 15-minutes, the probability that you may see a shooting star or a bunch of them is 0.2. What is the percentage chance of you seeing at least one star shooting from the sky if you are under it for about an hour?

Let us say that Prob is the probability that we may see a minimum of one shooting star in 15 minutes.

So, Prob = 0.2

Now, the probability that we may not see any shooting star in the time duration of 15 minutes is = 1 – Prob

1-0.2 = 0.8

The probability that we may not see any shooting star for an hour is:

= (1-Prob)(1-Prob)(1-Prob)*(1-Prob)
= 0.8 * 0.8 * 0.8 * 0.8 = (0.8)⁴
≈ 0.40

So, the probability that we will see one shooting star in the time interval of an hour is = 1-0.4 = 0.6

So, there are approximately 60% chances that we may see a shooting star in the time span of an hour.

17. What is deep learning? What is the difference between deep learning and machine learning?

Deep learning is a paradigm of machine learning. In deep learning, multiple layers of processing are involved in order to extract high features from the data. The neural networks are designed in such a way that they try to simulate the human brain.

Deep learning has shown incredible performance in recent years because of the fact that it shows great analogy with the human brain.

The difference between machine learning and deep learning is that deep learning is a paradigm or a part of machine learning that is inspired by the structure and functions of the human brain called the artificial neural networks. Learn More.

18. What is a Gradient and Gradient Descent?

Gradient: Gradient is the measure of a property that how much the output has changed with respect to a little change in the input. In other words, we can say that it is a measure of change in the weights with respect to the change in error. The gradient can be mathematically represented as the slope of a function.

Gradient Descent: Gradient descent is a minimization algorithm that minimizes the Activation function. Well, it can minimize any function given to it but it is usually provided with the activation function only.

Gradient descent, as the name suggests means descent or a decrease in something. The analogy of gradient descent is often taken as a person climbing down a hill/mountain. The following is the equation describing what gradient descent means:

So, if a person is climbing down the hill, the next position that the climber has to come to is denoted by “b” in this equation. Then, there is a minus sign because it denotes the minimization (as gradient descent is a minimization algorithm). The Gamma is called a waiting factor and the remaining term which is the Gradient term itself shows the direction of the steepest descent.

Vaishnavi Kulkarni

06:57 29 Feb 24

I had the pleasure of attending the software testing course at Analytiq Learning Institute in August 2022, and I must say, it was a transformative experience. The institute not only provided excellent teaching and guidance but also offered valuable placement assistance. Pratap Sonar Sir's dedicated efforts in helping with placements were commendable, and Nagesh Gaikwad Sir's guidance throughout the course was invaluable. Moreover, Priyanka Nigade Mam's expertise in teaching automation testing was exceptional, making the learning process both enriching and enjoyable. I am truly grateful for the support and preparation provided by the institute, which played a crucial role in shaping my career in software testing.Overall, Analytiq Learning Institute exceeded my expectations with its comprehensive curriculum, dedicated faculty, and effective placement assistance. The combination of good teaching, guidance, and interview preparation support truly sets this institute apart. I would highly recommend it to anyone looking to pursue a career in software testing. Thank you to the entire team for their efforts and commitment to their student's success.

Gayatri Mali

08:53 06 Jan 24

Thanks Shurti mam u r teaching skills is nice & understand the whole concept with baby steps u r lecture more informative & point to point.Special Thanks 🙏

Vaishali Sonawane

07:00 25 Dec 23

shubham shete

05:52 20 Dec 23

First of all your notes and PPTs are very clear and understandable. You have created very familiar environment in classroom so that I can ask doubts frankly, also you give proper solutions on that. Your teaching technic is very good. Thank you!!

Swanand Chavan

12:38 19 Dec 23

I am swanand chavan. I have done BE in mechanical engg. And join this class as python Full stack developer this is too good platform for non-IT candidates vaishali shete mam teach us from the scratch of all languages, I specially improve my logics and technical knowledge in all languages.How to face errors during coding and solve them.Training is too good with friendly environment which is good to perform and to grow yourself thanks.

Vinay Bhutada

10:18 06 Dec 23

As a business administration graduate, the Data Analytics course provided me with practical skills that are highly applicable in today's data-driven world. The training was thorough, and the overall experience was exceptional.I had an amazing experience learning Data Analytics from Deepali Mam. Her teaching style is very good and she makes complex concepts easy to understand.The staff is supportive, and the demo class convinced me that this was the right choice. I'm grateful for the placement assistance that landed me a great job in Software.

swarali chavan

09:41 30 Aug 23

This is Swaranjali and i have completed my Software Testing course from Analytiq Learning and under the guidance of Nagesh Sir I have enrolled for the course and it the best decision I have ever taken as coming from Non IT background it was difficult for me to get a job.I would like to thank to all Software Testing faculties for teaching us so well and and helping us to understand each and every concept. With this wonderful training I am able to crack the interview.Special thanks to Komal mam for her continue support throughout from the beginning of admission process till today.Very great platform for learning new to enter into the corporate world. Faculties and other staff are very friendly and co-operative.

Sairam Lotake

15:06 30 May 23

This is Sairam and i have completed my Software Testing course from Chinchwad branch and under the guidance of Nagesh Sir I have enrolled for the course and it the best decision I have ever taken as coming from Mechanical background it was difficult for me to get a job.I would like to thanks to all the trainers for teaching us so well and and helping us to understand each and every concept. With this wonderful training I am able to crack the interview and now I joined one of best IT firm located in Gujarat. I would recommend everyone especially students from non CS/IT branch not to loose hope and to join Analytiq Learning for better future.Once again thanks to all the team

Software Testing

Full Stack Development

Cloud Computing

Website Designing

Java Programming

Data Science Interview Questions

Choose A Topic

Interview Questions

1.What is Data Science?

2. What is the difference between data analytics and data science?

3. What are some of the techniques used for sampling? What is the main advantage of sampling?

4. What does it mean when the p-values are high and low?

5. When is resampling done?

6. What do you understand by Imbalanced Data?

7. Are there any differences between the expected value and mean value?

8. What do you understand by Survivorship Bias?

9. Define the terms KPI, lift, model fitting, robustness and DOE?

10. Define confounding variables?

11. Define and explain selection bias?

12. Define the confusion matrix?

13. What is logistic regression? State an example where you have recently used logistic regression?

14. What is Linear Regression? What are some of the major drawbacks of the linear model?

15. What is a random forest? Explain it’s working?

16. In a time interval of 15-minutes, the probability that you may see a shooting star or a bunch of them is 0.2. What is the percentage chance of you seeing at least one star shooting from the sky if you are under it for about an hour?

17. What is deep learning? What is the difference between deep learning and machine learning?

18. What is a Gradient and Gradient Descent?

Our students are working with

Our Placed Candidates

Popular Post

Benefits of Enrolling in a Software Testing Course in Pune

Full Stack Java vs. MEAN Stack: Comparing Two Web Development…

Top 5 Reasons to Start Your Career in Data Analytics…

What People Say

Do you want to know how our course is Will be a good fit for your career?

Enquire Now

Courses

Interview Questions

Contact Us

Useful Links

Locations