The evolving role of a data analyst means that it is now essential for data analyst to have an understanding of certain aspects of machine learning, especially training models. Data analysts need to be comfortable not just with analyzing data but also with preparing data for machine learning models. While some analysts may be familiar with the basic steps, a deeper understanding of why we perform certain data preparation tasks like standardization or splitting data is crucial for building robust and effective models. In this article, I want to tackle some of the most common questions (5 questions) data analysts have about machine learning. Let's start
1. Why do we split data into training and test sets?
The purpose of training a machine learning model is to make predictions on new data, so it is important to assess the model's performance on data that it has not yet seen. We split the data into training and test sets in order to evaluate how well our model can generalize to new, unseen data. This is similar to what happens in school. Students are given materials to study to prepare for the test. To test how well the students have understood the material, they are tested on questions that they have not seen before.
By using a portion of the data to train the model and another portion to test the model, we can get an estimate of how well the model will perform on new, unseen data. The training data is used to fit the parameters of the model, while the test data is used to evaluate the performance of the model on data it has not yet seen. If we did not split the data and use all of it to train the model, the model could potentially overfit to the training data and perform poorly on new data. By splitting the data, we can ensure that our model is not only memorizing the training data but is also generalizing well to new data.
To split data, you can use the train_test_split function from Sklearn. Here is how it is imported and used:
Here, it means that 80% of the data will be used for training. The test size is 0.2 or 20%. The random_state parameter ensures that the random state is saved. It ensures that the data is split in a reproducible and consistent manner.
2. Why is it important to standardize data for machine learning models before fitting?
Standardizing data means scaling the data to have a mean of zero and a standard deviation of one. Why is it important? Well, in machine learning, different features (variables) in the dataset may have different units or scales. For example, one feature might represent income in dollars (which could be in the thousands), while another feature might represent age in years (which would be much smaller numbers). Without standardization, features with larger scales (in this case, income in dollars) could dominate the model's calculations, leading to biased or inaccurate predictions.
A good example would be organizing a race involving different cars, where each car has its speed measured in different units (one in kilometers per hour and another in miles per hour). It would be difficult to compare the speeds without standardizing the units. For instance, a car with units in kilometers per hour might seem the fastest simply because kilometers per hour is a larger number. So, to get a fair comparison, the units must be standardized.
To standardize the data, you can use StandardScalar from Sklearn. Let's look at an example. Let's say we have data that is a mixture of heights and weights. Here is how the data would be without standardization:
If we were to pass this data to a machine learning model, the model would be more biased towards large numbers (weight in kg) during training, neglecting the potentially valuable information in height (cm). Standardization transforms the data to a common scale , ensuring all features contribute equally during model training. Here is how the data will look after standardization:
You can see that the data is now standardized.
Build the Confidence to Tackle Data Analysis Projects
To build a successful data analysis project, one must have skills in data cleaning and preprocessing, visualization, modeling, EDA, and so forth. The main purpose of this book is to ensure that you develop data analysis skills with Python by tackling challenges. By the end, you should be confident enough to take on any data analysis project with Python. Get “50 Days of Data Analysis”:
3. What is continuous numerical data in Machine Learning? Give an example.
A continuous numerical value is a type of numerical variable that can take on any value within a range, including decimal values. In other words, a continuous variable can take on an infinite number of values between any two points. A good example of continuous data would be the height of a person. It can be 170 cm, 170.5 cm, 120.1298909090928 cm, and so on. There are infinitely many possible values between any two heights within the measurable range.
4. If a target column in the dataset has 0s and 1s, is this a classification or regression problem?
The target column is often the last column in the data. It defines the problem that you're trying to solve. If the target variable has categories (e.g., spam/not spam email), it's a classification problem. These two categories will be labeled 0 or 1. The model aims to predict one of two classes (0 or 1). Conversely, if the target column is continuous (e.g., house price), it's a regression problem. In this case, the model learns to predict a continuous value for new data points.
As a data analyst, should I expect to train models?
As a data analyst, whether you are expected to train models depends largely on the specific role and the organization you work for. I know data analyst who are not involved in training models. Their responsibilities evolve around data cleaning and preparation, understanding the data, finding patterns, and generating insights (Exploratory Data Analysis). These organization have a clear separation between data analyst and data scientists. However, in other organizations, especially smaller ones or those with overlapping roles, data analysts might be expected to train basic models
Conclusion
These are some of the common questions that are asked by data analysts. Whether you want to transition into a machine learning engineer or stay in your role, by grasping these fundamental concepts, is essential. Remember, the key is to continuously learn, experiment, and refine your skills as you navigate the ever-evolving world of data and machine learning. You can check out the book "50 Days of Data Analysis with Python," which will give you hands-on experience to grasp these and other aspects of data analysis. Thanks for reading.
The answer to question 4 is incomplete.