Message on Whatsapp 8879355057 for DSA(OA + Interview) + Fullstack Dev Training + 1-1 Personalized Mentoring to get 10+LPA Job
0 like 0 dislike
6,832 views
in DataScience|ML-AI by Expert (113,390 points) | 6,832 views

2 Answers

0 like 0 dislike

I am writing down all of the top questions -

Machine Learning Interview Questions

1. WHAT IS DATA NORMALIZATION AND WHY DO WE NEED IT? 

I felt this one would be important to highlight. Data normalization is a very important preprocessing step, used to rescale values to fit in a specific range to assure better convergence during backpropagation. In general, it boils down to subtracting the mean of each data point and dividing by its standard deviation. If we don’t do this then some of the features (those with high magnitude) will be weighted more in the cost function (if a higher-magnitude feature changes by 1%, then that change is pretty big, but for smaller features it’s quite insignificant). The data normalization makes all features weighted equally.

 

2. EXPLAIN DIMENSIONALITY REDUCTION, WHERE IT’S USED, AND ITS BENEFITS?

Dimensionality reduction is the process of reducing the number of feature variables under consideration by obtaining a set of principal variables which are basically the important features. Importance of a feature depends on how much the feature variable contributes to the information representation of the data and depends on which technique you decide to use. Deciding which technique to use comes down to trial-and-error and preference. It’s common to start with a linear technique and move to non-linear techniques when results suggest inadequate fit. Benefits of dimensionality reduction for a data set may be:

  • Reduce the storage space needed.
  • Speed up computation (for example in machine learning algorithms), less dimensions mean less computing, also less dimensions can allow usage of algorithms unfit for a large number of dimensions.
  • Remove redundant features, for example no point in storing a terrain’s size in both sq meters and sq miles (maybe data gathering was flawed). 
  • Reducing a data’s dimension to 2D or 3D may allow us to plot and visualize it, maybe observe patterns, give us insights. 
  • Too many features or too complex a model can lead to overfitting.

 

3. HOW DO YOU HANDLE MISSING OR CORRUPTED DATA IN A DATASET? 

You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value. In Pandas, there are two very useful methods: isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.

 

4. EXPLAIN THIS CLUSTERING ALGORITHM. 

I wrote a popular article on the The 5 Clustering Algorithms Data Scientists Need to Know explaining all of them in detail with some great visualizations.

 

5. HOW WOULD YOU GO ABOUT DOING AN EXPLORATORY DATA ANALYSIS (EDA)?

The goal of an EDA is to gather some insights from the data before applying your predictive model i.e gain some information. Basically, you want to do your EDA in a coarse to fine manner. We start by gaining some high-level global insights. Check out some imbalanced classes. Look at mean and variance of each class. Check out the first few rows to see what it’s all about. Run a pandas df.info() to see which features are continuous, categorical, their type (int, float, string).

Next, drop unnecessary columns that won’t be useful in analysis and prediction. Basically I have to learn all of these while working on web application development services. These can simply be columns that look useless, one’s where many rows have the same value (i.e it doesn’t give us much information), or it’s missing a lot of values. We can also fill in missing values with the most common value in that column, or the median.

Now we can start making some basic visualizations. Start with high-level stuff. Do some bar plots for features that are categorical and have a small number of groups. Bar plots of the final classes. Look at the most “general features.” Create some visualizations about these individual features to try and gain some basic insights. Now we can start to get more specific. Create visualizations between features, two or three at a time. How are features related to each other?

You can also do a PCA to see which features contain the most information. Group some features together as well to see their relationships. For example, what happens to the classes when A = 0 and B = 0? How about A = 1 and B = 0? Compare different features. For example, if feature A can be either “female” or “male” then we can plot feature A against which cabin they stayed in to see if males and females stay in different cabins. Beyond bar, scatter, and other basic plots, we can do a PDF/CDF, overlaid plots, etc. Look at some statistics like distribution, p-value, etc.

Finally it’s time to build the ML model. Start with easier stuff like Naive Bayes and linear regression. If you see that those suck or the data is highly non-linear, go with polynomial regression, decision trees, or SVMs. The features can be selected based on their importance from the EDA. If you have lots of data you can use a neural network. Check ROC curve. Precision, Recall.

 

6. HOW DO YOU KNOW WHICH MACHINE LEARNING MODEL YOU SHOULD USE?

While one should always keep the “no free lunch theorem” in mind, there are some general guidelines. I wrote an article on how to select the proper regression model here. This cheat sheet is also fantastic!

 

7. WHY DO WE USE CONVOLUTIONS FOR IMAGES RATHER THAN JUST FC LAYERS?

This one was pretty interesting since it’s not something companies usually ask. As you would expect, I got this question from a company focused on computer vision. This answer has two parts to it. Firstly, convolutions preserve, encode, and actually use the spatial information from the image. If we used only FC layers we would have no relative spatial information. Secondly, Convolutional Neural Networks (CNNs) have a partially built-in translation in-variance, since each convolution kernel acts as it’s own filter/feature detector.

 

8. WHAT MAKES CNNS TRANSLATION INVARIANT?

As explained above, each convolution kernel acts as its own filter/feature detector. So let’s say you’re doing object detection, it doesn’t matter where in the image the object is since we’re going to apply the convolution in a sliding window fashion across the entire image anyways.

9. WHY DO WE HAVE MAX-POOLING IN CLASSIFICATION CNNS?

Again as you would expect this is for a role in computer vision. Max-pooling in a CNN allows you to reduce computation since your feature maps are smaller after the pooling. You don’t lose too much semantic information since you’re taking the maximum activation. There’s also a theory that max-pooling contributes a bit to giving CNNs more translation in-variance. Check out this great video from Andrew Ng on the benefits of max-pooling.

 

10. WHY DO SEGMENTATION CNNS TYPICALLY HAVE AN ENCODER-DECODER STYLE / STRUCTURE?

The encoder CNN can basically be thought of as a feature extraction network, while the decoder uses that information to predict the image segments by “decoding” the features and upscaling to the original image size.

 

11. WHAT IS THE SIGNIFICANCE OF RESIDUAL NETWORKS?

The main thing that residual connections did was allow for direct feature access from previous layers. This makes information propagation throughout the network much easier. One very interesting paper about this shows how using local skip connections gives the network a type of ensemble multi-path structure, giving features multiple paths to propagate throughout the network.

 

12. WHAT IS BATCH NORMALIZATION AND WHY DOES IT WORK?

Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. The idea is then to normalize the inputs of each layer in such a way that they have a mean output activation of zero and standard deviation of one. This is done for each individual mini-batch at each layer i.e compute the mean and variance of that mini-batch alone, then normalize. This is analogous to how the inputs to networks are standardized. How does this help? We know that normalizing the inputs to a network helps it learn. But a network is just a series of layers, where the output of one layer becomes the input to the next. That means we can think of any layer in a neural network as the first layer of a smaller subsequent network. Thought of as a series of neural networks feeding into each other, we normalize the output of one layer before applying the activation function, and then feed it into the following layer (sub-network).

 

13. HOW WOULD YOU HANDLE AN IMBALANCED DATASET?

I have an article about this! Check out #3 :)

 

14. WHY WOULD YOU USE MANY SMALL CONVOLUTIONAL KERNELS SUCH AS 3X3 RATHER THAN A FEW LARGE ONES?

This is very well explained in the VGGNet paper. There are two reasons: First, you can use several smaller kernels rather than few large ones to get the same receptive field and capture more spatial context, but with the smaller kernels you are using less parameters and computations. Secondly, because with smaller kernels you will be using more filters, you’ll be able to use more activation functions and thus have a more discriminative mapping function being learned by your CNN.

 

15. DO YOU HAVE ANY OTHER PROJECTS THAT WOULD BE RELATED HERE?

Here you’ll really draw connections between your research and their business. Is there anything you did or any skills you learned that could possibly connect back to their business or the role you are applying for? It doesn’t have to be 100 percent exact, just somehow related such that you can show that you will be able to directly add lots of value.

Hope this will help everyone.

by (140 points)
edited by
0 0
WooW
oooooooooo
0 like 0 dislike
500 Most asked Machine-Learning/Data-Science Interview questions

Statistics:

1. What is the Central Limit Theorem and why is it important?

“Suppose that we are interested in estimating the average height among all people. Collecting data for every person in the world is impossible. While we can’t obtain a height measurement from everyone in the population, we can still sample some people. The question now becomes, what can we say about the average height of the entire population given a single sample. The Central Limit Theorem addresses this question exactly.” Read more here.

2. What is sampling? How many sampling methods do you know?

 

“Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined.” Read the full answer here.

3. What is the difference between type I vs type II error?

 

“A type I error occurs when the null hypothesis is true, but is rejected. A type II error occurs when the null hypothesis is false, but erroneously fails to be rejected.” Read the full answer here.

4. What is linear regression? What do the terms p-value, coefficient, and r-squared value mean? What is the significance of each of these components?

 

A linear regression is a good tool for quick predictive analysis: for example, the price of a house depends on a myriad of factors, such as its size or its location. In order to see the relationship between these variables, we need to build a linear regression, which predicts the line of best fit between them and can help conclude whether or not these two factors have a positive or negative relationship. Read

more here and here.

5. What are the assumptions required for linear regression?

 

There are four major assumptions: 1. There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data, 2. The errors or residuals of the data are normally distributed and independent from each other, 3. There is minimal multicollinearity between explanatory variables, and 4. Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.

6. What is a statistical interaction?

 

”Basically, an interaction is when the effect of one factor (input variable) on the dependent variable (output variable) differs among levels of another factor.” Read more here.

7. What is selection bias?

 

“Selection (or ‘sampling’) bias occurs in an ‘active,’ sense when the sample data that is gathered and prepared for modeling has characteristics that are not representative of the true, future population of cases

the model will see. That is, active selection bias occurs when a subset of the data are systematically (i.e., non-randomly) excluded from analysis.” Read more here.

8. What is an example of a data set with a non-Gaussian distribution?

“The Gaussian distribution is part of the Exponential family of distributions, but there are a lot more of them, with the same sort of ease of use, in many cases, and if the person doing the machine learning has a solid grounding in statistics, they can be utilized where appropriate.” Read more here.

9. What is the Binomial Probability Formula?

“The binomial distribution consists of the probabilities of each of the possible numbers of successes on N trials for independent events that each have a probability of π (the Greek letter pi) of occurring.” 

Data Science : 

Q1. What is Data Science? List the differences between supervised and unsupervised learning. 

Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data. How is this different from what statisticians have been doing for years? The answer lies in the difference between explaining and predicting.

image

 

For reading further , visit : https://drive.google.com/file/d/1nSHicZ81uEHuEwGerQ8nspVMj6e0BFRl/view?usp=sharing

by