Benefits of Enrolling in a Software Testing Course in Pune
If you’re looking to learn software testing in Pune, you’re in luck! Pune is a…
An interdisciplinary field that constitutes various scientific processes, algorithms, tools, and machine learning techniques working to help find common patterns and gather sensible insights from the given raw input data using statistical and mathematical analysis is called Data Science.
The following figure represents the life cycle of data science.
Data analysis can not be done on a whole volume of data at a time especially when it involves larger datasets. It becomes crucial to take some data samples that can be used for representing the whole population and then perform analysis on it. While doing this, it is very much necessary to carefully take sample data out of the huge data that truly represents the entire dataset.
There are majorly two categories of sampling techniques based on the usage of statistics, they are:
Differentiate between the long and wide format data.
Long format Data | Wide-Format Data |
Here, each row of the data represents the one-time information of a subject. Each subject would have its data in different/ multiple rows. | Here, the repeated responses of a subject are part of separate columns. |
The data can be recognized by considering rows as groups. | The data can be recognized by considering columns as groups. |
This data format is most commonly used in R analyses and to write into log files after each trial. | This data format is rarely used in R analyses and most commonly used in stats packages for repeated measures ANOVAs. |
A p-value is the measure of the probability of having results equal to or more than the results achieved under a specific hypothesis assuming that the null hypothesis is correct. This represents the probability that the observed difference occurred randomly by chance.
Resampling is a methodology used to sample data for improving accuracy and quantify the uncertainty of population parameters. It is done to ensure the model is good enough by training the model on different patterns of a dataset to ensure variations are handled. It is also done in the cases where models need to be validated using random subsets or when substituting labels on data points while performing tests.
Data is said to be highly imbalanced if it is distributed unequally across different categories. These datasets result in an error in model performance and result in inaccuracy.
There are not many differences between these two, but it is to be noted that these are used in different contexts. The mean value generally refers to the probability distribution whereas the expected value is referred to in the contexts involving random variables.
This bias refers to the logical error while focusing on aspects that survived some process and overlooking those that did not work due to lack of prominence. This bias can lead to deriving wrong conclusions.
Confounding variables are also known as confounders. These variables are a type of extraneous variables that influence both independent and dependent variables causing spurious association and mathematical relationships between those variables that are associated but are not casually related to each other.
The selection bias occurs in the case when the researcher has to make a decision on which participant to study. The selection bias is associated with those researches when the participant selection is not random. The selection bias is also called the selection effect. The selection bias is caused by as a result of the method of sample collection.
Four types of selection bias are explained below:
Let us first understand the meaning of bias and variance in detail:
Bias: It is a kind of error in a machine learning model when an ML Algorithm is oversimplified. When a model is trained, at that time it makes simplified assumptions so that it can easily understand the target function. Some algorithms that have low bias are Decision Trees, SVM, etc. On the other hand, logistic and linear regression algorithms are the ones with a high bias.
Variance: Variance is also a kind of error. It is introduced into an ML Model when an ML algorithm is made highly complex. This model also learns noise from the data set that is meant for training. It further performs badly on the test data set. This may lead to over lifting as well as high sensitivity.
When the complexity of a model is increased, a reduction in the error is seen. This is caused by the lower bias in the model. But, this does not happen always till we reach a particular point called the optimal point. After this point, if we keep on increasing the complexity of the model, it will be over lifted and will suffer from the problem of high variance.
Trade-off Of Bias And Variance: So, as we know that bias and variance, both are errors in machine learning models, it is very essential that any machine learning model has low variance as well as a low bias so that it can achieve good performance.
Let us see some examples. The K-Nearest Neighbor Algorithm is a good example of an algorithm with low bias and high variance. This trade-off can easily be reversed by increasing the k value which in turn results in increasing the number of neighbours. This, in turn, results in increasing the bias and reducing the variance.
Another example can be the algorithm of a support vector machine. This algorithm also has a high variance and obviously, a low bias and we can reverse the trade-off by increasing the value of parameter C. Thus, increasing the C parameter increases the bias and decreases the variance.
So, the trade-off is simple. If we increase the bias, the variance will decrease and vice versa.
It is a matrix that has 2 rows and 2 columns. It has 4 outputs that a binary classifier provides to it. It is used to derive various measures like specificity, error rate, accuracy, precision, sensitivity, and recall.
The test data set should contain the correct and predicted labels. The labels depend upon the performance. For instance, the predicted labels are the same if the binary classifier performs perfectly. Also, they match the part of observed labels in real-world scenarios. The four outcomes shown above in the confusion matrix mean the following:
The formulas for calculating basic measures that comes from the confusion matrix are:
In these formulas:
FP = false positive
FN = false negative
TP = true positive
RN = true negative
Also,
Sensitivity is the measure of the True Positive Rate. It is also called recall.
Specificity is the measure of the true negative rate.
Precision is the measure of a positive predicted value.
F-score is the harmonic mean of precision and recall.
Logistic Regression is also known as the logit model. It is a technique to predict the binary outcome from a linear combination of variables (called the predictor variables).
For example, let us say that we want to predict the outcome of elections for a particular political leader. So, we want to find out whether this leader is going to win the election or not. So, the result is binary i.e. win (1) or loss (0). However, the input is a combination of linear variables like the money spent on advertising, the past work done by the leader and the party, etc.
Linear regression is a technique in which the score of a variable Y is predicted using the score of a predictor variable X. Y is called the criterion variable. Some of the drawbacks of Linear Regression are as follows:
Classification is very important in machine learning. It is very important to know to which class does an observation belongs. Hence, we have various classification algorithms in machine learning like logistic regression, support vector machine, decision trees, Naive Bayes classifier, etc. One such classification technique that is near the top of the classification hierarchy is the random forest classifier.
So, firstly we need to understand a decision tree before we can understand the random forest classifier and its works. So, let us say that we have a string as given below:
So, we have the string with 5 ones and 4 zeroes and we want to classify the characters of this string using their features. These features are colour (red or green in this case) and whether the observation (i.e. character) is underlined or not. Now, let us say that we are only interested in red and underlined observations. So, the decision tree would look something like this:
So, we started with the colour first as we are only interested in the red observations and we separated the red and the green-coloured characters. After that, the “No” branch i.e. the branch that had all the green coloured characters was not expanded further as we want only red-underlined characters. So, we expanded the “Yes” branch and we again got a “Yes” and a “No” branch based on the fact whether the characters were underlined or not.
So, this is how we draw a typical decision tree. However, the data in real life is not this clean but this was just to give an idea about the working of the decision trees. Let us now move to the random forest.
Random Forest
It consists of a large number of decision trees that operate as an ensemble. Basically, each tree in the forest gives a class prediction and the one with the maximum number of votes becomes the prediction of our model. For instance, in the example shown below, 4 decision trees predict 1, and 2 predict 0. Hence, prediction 1 will be considered.
The underlying principle of a random forest is that several weak learners combine to form a keen learner. The steps to build a random forest are as follows:
Let us say that Prob is the probability that we may see a minimum of one shooting star in 15 minutes.
So, Prob = 0.2
Now, the probability that we may not see any shooting star in the time duration of 15 minutes is = 1 – Prob
1-0.2 = 0.8
The probability that we may not see any shooting star for an hour is:
= (1-Prob)(1-Prob)(1-Prob)*(1-Prob)
= 0.8 * 0.8 * 0.8 * 0.8 = (0.8)⁴
≈ 0.40
So, the probability that we will see one shooting star in the time interval of an hour is = 1-0.4 = 0.6
So, there are approximately 60% chances that we may see a shooting star in the time span of an hour.
Deep learning is a paradigm of machine learning. In deep learning, multiple layers of processing are involved in order to extract high features from the data. The neural networks are designed in such a way that they try to simulate the human brain.
Deep learning has shown incredible performance in recent years because of the fact that it shows great analogy with the human brain.
The difference between machine learning and deep learning is that deep learning is a paradigm or a part of machine learning that is inspired by the structure and functions of the human brain called the artificial neural networks. Learn More.
Gradient: Gradient is the measure of a property that how much the output has changed with respect to a little change in the input. In other words, we can say that it is a measure of change in the weights with respect to the change in error. The gradient can be mathematically represented as the slope of a function.
Gradient Descent: Gradient descent is a minimization algorithm that minimizes the Activation function. Well, it can minimize any function given to it but it is usually provided with the activation function only.
Gradient descent, as the name suggests means descent or a decrease in something. The analogy of gradient descent is often taken as a person climbing down a hill/mountain. The following is the equation describing what gradient descent means:
So, if a person is climbing down the hill, the next position that the climber has to come to is denoted by “b” in this equation. Then, there is a minus sign because it denotes the minimization (as gradient descent is a minimization algorithm). The Gamma is called a waiting factor and the remaining term which is the Gradient term itself shows the direction of the steepest descent.
If you’re looking to learn software testing in Pune, you’re in luck! Pune is a…
In the dynamic world of web development, there are several technology stacks available to build…
The world is witnessing an unprecedented surge in data generation, and the demand for data-driven…