Significance of Random Variables
In the context of Data Mining and Analysis
Random variables play a very important role in machine learning. Every feature that is present in your dataset is basically a random variable. There are different types of random variables and we will discuss about them.
So what is a random variable?
Whenever we study an object, we observe that there are several features that represent that object. These features are called variables. A variable is a placeholder that can store any value, be it an integer, a float, a character or a string. Consider working with a large number of objects. We would find that a single feature can have different values for different objects. For example, say we are analysing the BMI(Body Mass Index) of all male individuals below the age of 20 living in some particular area. On measuring the heights of the individuals we find that values for the height feature are different. This suggests that the values of the variable are subject to some random variation. If this randomness due to chance is significant, then the variable is referred to as a random variable. As a result, a random variable can take on a variety of different values, each with an associated probability equal to the relative frequencies of these different values that it can take. Hence,
A random variable is a variable whose possible values are the outcomes of a random phenomenon.
Types of Random Variables
From the above definition of random variables, we can make out that a random variable can be of many different types depending upon the attribute of the object being studied.
Quantitative Random Variables- These random variables have some numerical values. There are two types of quantitative random variables:
- Discrete Random Variables- These can take only a countable number of distinct values. For example, the number of rooms in your house or the score you earned in your last exam.
- Continuous Random Variables- These can take an infinite number of possible values and are usually used in measurements. For example, your weight and height. You might be thinking that a feature like weight can only take countable number of distinct values like 68Kg or 70Kg but if you use a more accurate weighing balance, a value of 68.1Kg could be measured and with an even more accurate weighing balance, a value of 68.13Kg could be measured. What I mean to say is that with the increase in accuracy, infinite number of outcomes are possible. Hence, a continuous random variable is not defined at specific values but over an interval of values.
Qualitative Random Variables- These random variables have non- numerical values. There are two types of qualitative random variables:
- Nominal Random Variables- These can have two or more categories without any intrinsic ordering among them. For example, eye colour can have the values red, green, blue etc.
- Ordinal Random Variables- It is similar to nominal random variable with an addition of having some ordering or ranking among the categories. For example, Officer ranks can have the values Major, Captain, Lieutenant or Officer Cadet, where there is a specific ordering among them.
Most of the times while doing data analysis, the values of qualitative random variables are mapped to numerical values for easy data processing (done using LabelEncoder or OneHotEncoding). This is also called “Categorical Data handling” and is an important task in data pre-processing.
Now that we are clear with the definition and types of random variables, lets go a little deeper. We know that a random variable can take on different values, each with some associated probabilities. Therefore, a random variable describes the probability of getting that value. If we plot these values against their associated probability, we come up with a probability distribution graph. A probability distribution describes how a random variable is distributed. It shows us which values are most likely to be taken by a random variable and which are less probable.
Let’s practically see this on a real dataset. We’ll be working on the BMI dataset taken from Kaggle. The features: Sex, Age, Height, Weight and BMI, all are random variables. Sex is a nominal random variable and Age is discrete random variable. Although, Height and Weight are continuous, they are being considered as discrete random variables.
From the above code, we observe that most of the individuals have their heights equal to 68 inches, while a very few have a height of 62 inches and 75 inches.
Importance of Random Variables
Random variables help in determining the probability of an outcome. These have many applications in real life especially in data analysis and decision making. Consider an example where an insurance company provides three different types of health coverage to its customers, these being basic, premium and exclusive. The company provides the insurance plans depending upon various customer features like age, marriage status, salary, BMI, etc. Now assume a young engineer wants to buy insurance, then what kind of plan should be provided to him. After analysing all the customer features the company comes up with the following probability:
- Basic: 0.62
- Premium: 0.20
- Exclusive: 0.18
The sum of all probabilities is 1. The data conveys that the engineer should adopt for basic insurance.
Random variables also find their importance in generalization of population behaviour. It makes us understand the dataset better and thereby helps us in selecting the appropriate machine learning model which would best fit our analysis. Random variables are used by neural networks for decision making. Even Generative Adversarial Networks work upon probability distribution for replicating the input dataset.