{"id":7050,"date":"2023-08-28T15:00:06","date_gmt":"2023-08-28T20:00:06","guid":{"rendered":"https:\/\/www.sharpsightlabs.com\/?p=7050"},"modified":"2024-02-06T15:02:42","modified_gmt":"2024-02-06T21:02:42","slug":"sklearn-make_classification","status":"publish","type":"post","link":"https:\/\/www.sharpsightlabs.com\/blog\/sklearn-make_classification\/","title":{"rendered":"Sklearn make_classification, Explained"},"content":{"rendered":"
With the rise of AI, machine learning has suddenly become very popular. <\/p>\n
Machine learning has been around for decades, but machine learning systems are becoming increasingly important in a range of fields, from healthcare, to finance, to marketing.<\/p>\n
Python, with a range of libraries for data science and ML, has arguably become the top language for machine learning. And the most popular machine learning library in Python is scikit-learn (often referred to as sklearn).<\/p>\n
In this post, we’re going to take a close look at one particular function from scikit-learn: make_classification. <\/p>\n
This tool helps us generate synthetic datasets for classification problems. This makes it very useful for practicing machine learning and evaluating machine learning algorithms.<\/p>\n
We’ll look at what make_classification function does, how the syntax is structured, and I’ll also show you a simple example.<\/p>\n
The blog post is divided into sections, and if you need anything specific, just click on one of the following links.<\/p>\n
Table of Contents:<\/strong><\/p>\n That said, let’s dive into the sklearn make_classification function.<\/p>\n <\/a><\/p>\n The sklearn make_classification function allows Python<\/a> users to create datasets that they can use for classification models<\/a>.<\/p>\n It allows you to make data with binary labels and multiclass labels.<\/p>\n For example, here is a plot of a binary dataset that I made with make_classification:<\/p>\n <\/p>\n (I’ll show you how to create this exact dataset later.)<\/p>\n And importantly, it provides functionality that allows you to specify things like:<\/p>\n Now that we’ve seen a brief overview of its capabilities, let’s delve deeper into the syntax of make_classification to understand how we can use it properly.<\/p>\n <\/a><\/p>\n Here, I’m going to explain the syntax of the Scikit Learn make_classification function.<\/p>\n I’ll explain the high-level syntax, but also some of the details about the most important parameters.<\/p>\n Everything I’m about to explain assumes that you have Scikit Learn installed on your machine, and that you’ve imported make_classification as follows:<\/p>\n With that said, let’s look at the syntax.<\/p>\n The basic syntax, is very, very simple.<\/p>\n Assuming that you’ve imported the function as described above, you can call the function by typing <\/p>\n There are a few important parameters as well that you can specify inside the parenthesis:<\/p>\n <\/p>\n In some sense, the parameters are the most important part of the function, because they determine the exact structure and content of the output dataset.<\/p>\n That being the case, let’s quickly discuss the important parameters.<\/p>\n <\/a><\/p>\n The Scikit Learn make classification function has quite a few parameters, but I believe that the most important are:<\/p>\n Let’s look at each of these, one at a time.<\/p>\n The Said differently, it controls the number of examples (or, the number of rows of data, if you’re thinking of a simple row-and-column dataset).<\/p>\n By default, this is set to 100.<\/p>\n The Remember, the features are like the inputs to a machine learning model. They are the columns that a machine learning algorithm learns from in order to make a prediction. Features are like the inputs to a model, and labels\/targets are like the outputs of a model (to learn a bit more about features and labels, read our blog post on Supervised vs Unsupervised machine learning<\/a>).<\/p>\n This will include the informative features, redundant features, and repeated features (if you use them when you create your dataset).<\/p>\n By default, this parameter is set to 20.<\/p>\n The As mentioned above, the classes are the different possible categories for the target variable (remember that in supervised learning<\/a> the dataset has a target\/label variable that we’re trying to predict).<\/p>\n By default, this is set to The So what does informative mean? <\/p>\n An informative feature is one that has a relationship with the target label. It carries information that enables us to learn how to predict the categorical values in the data.<\/p>\n So the rest of the features (the un-informative ones) may be noisy or otherwise irrelevant.<\/p>\n Introducing uninformative (or noisy) features into a dataset be useful, especially for experimental or educational purposes. <\/p>\n For example, uninformative features can:<\/p>\n So it may sound a bit strange to have uninformative features, but if we’re making a dataset for machine learning practice or algorithm evaluation, it may actually be useful for the synthetic dataset to have uninformative features.<\/p>\n It might seem odd, but redundant features can be useful if you’re practicing machine learning or testing a particular algorithm.<\/p>\n We can use redundant features to:<\/p>\n And more.<\/p>\n So like the “uninformative” features discussed earlier, redundant features can serve a useful purpose when we practice ML or try to evaluate algorithm performance.<\/p>\n The There are some algorithm types where want (or need) the classes to be perfectly separable. <\/p>\n There are some algorithm types that allow the classes to overlap somewhat (so overlapping classes is good for testing such algorithms).<\/p>\n By default, this is set to 2.<\/p>\n Why would you want to use this?<\/p>\n In some classification datasets, all of the data points for a particular class will form a tight “cluster”. They will be grouped together in feature space. <\/p>\n But other times, members of a single class might form multiple clusters of data … they might form separate groups.<\/p>\n Datasets where classes have multiple clusters are generally more complex, and a synthetic<\/em> dataset with multiple clusters per class may be more “realistic.” <\/p>\n Essentially, the n_clusters_per_class parameter lets you emulate this complexity and real-worldness in the synthetic data created by make_classification.<\/p>\n The random_state parameter allows us to set a seed for the random number generator.<\/p>\n This ensures that any process or function that utilizes random numbers can be reproduced exactly every time we run it.<\/p>\n Essentially, this enables reproducibility when the code is run multiple times, whether by the same individual or different people.<\/p>\n If you want to learn more about seeds and random number generators, read our tutorial on Numpy Random Seed<\/a>.<\/p>\n There are several other features that I’m leaving out here for the sake of brevity, like However, many of these will be somewhat rarely used, so in the beginning, you may want to avoid using them unless absolutely necessary.<\/p>\n The output of the Scikit Learn make_classification function is 2 Numpy arrays<\/a>.<\/p>\n The first is a Numpy array with shape The second array is a Numpy array with shape <\/a><\/p>\n Now that I’ve shown you the syntax of make_classification, let’s look at a couple of examples.<\/p>\n Examples:<\/strong><\/p>\n Before you run the examples, make sure that you import the Here, we’re also importing Seaborn and Matplotlib’s Pyplot, which we’ll use to visualize the data we generate.<\/p>\n Once you run it, you’ll be ready to get started.<\/p>\n <\/a><\/p>\n Here, we’re going to generate some data that will be well suited for a Logistic Regression model.<\/p>\n We’re going to make a dataset with:<\/p>\n And we’re going to initialize a random seed for the random number generator <\/p>\n And let’s visualize this data with a Seaborn scatterplot<\/a>, so you can see it:<\/p>\n OUT:<\/p>\n <\/p>\n Here, we have a dataset with 1 feature and 2 classes.<\/p>\n We’ll be able to fit a logistic regression model<\/a> to this, which I’ll show you how to do in a future blog post.<\/p>\n\n
A Quick Introduction to Sklearn make_classification<\/h2>\n
\n
The Syntax of Sklearn make_classification<\/h2>\n
A quick note<\/h4>\n
\r\nfrom sklearn.datasets import make_classification\r\n<\/pre>\n
make_classification syntax<\/h3>\n
make_classification()<\/code>.<\/p>\n
The Parameters of make classification<\/h3>\n
\n
n_samples<\/code><\/li>\n
n_features<\/code><\/li>\n
n_classes<\/code><\/li>\n
n_informative<\/code><\/li>\n
n_redundant<\/code><\/li>\n
class_sep<\/code><\/li>\n
n_clusters_per_class<\/code><\/li>\n
random_state<\/code><\/li>\n<\/ul>\n
n_samples<\/code> (required)<\/h6>\n
n_samples<\/code> parameter controls the number of samples in the output dataset.<\/p>\n
n_features<\/code><\/h6>\n
n_features<\/code> parameter controls the number of features in the output dataset.<\/p>\n
n_classes<\/code><\/h6>\n
n_classes<\/code> parameter controls the number of classes in the output dataset.<\/p>\n
n_classes = 2<\/code>, so by default, make_classification will produce a binary dataset.<\/p>\n
n_informative<\/code><\/h6>\n
n_informative<\/code> parameter controls the number of
informative<\/code> features in the output dataset.<\/p>\n
\n
n_redundant<\/code><\/h6>\n
n_redundant<\/code> enables you to specify how many redundant features there are.<\/p>\n
\n
class_sep<\/code><\/h6>\n
class_sep<\/code> parameter (short for “class separation”) controls the amount of separability between the generated classes.<\/p>\n
class_sep<\/code> allows you to control the degree to which the classes overlap.<\/p>\n
n_clusters_per_class<\/code><\/h6>\n
n_clusters_per_class<\/code> allows you to specify how many clusters will be generated for every class.<\/p>\n
random_state<\/code><\/h6>\n
Other Parameters<\/h5>\n
weights<\/code>,
flip_y<\/code>,
hypercube<\/code>, and several others.<\/p>\n
The Output of make_classification<\/h3>\n
(n_samples, n_features)<\/code>. This is the so-called
X<\/code> array, which contains the feature data.<\/p>\n
(n_samples,)<\/code>. This is the so-called
y<\/code> array, which contains the labels. It’s essentially a vector of labels associated with every example in
X<\/code>. Importantly, the
y<\/code> array contains integers representing the classes, with the number of unique integers being determined by the
n_classes<\/code> parameter.<\/p>\n
Examples of How to Use Make Classification<\/h2>\n
\n
Run this code first<\/h4>\n
make_classification<\/code> function with this code:<\/p>\n
\r\nfrom sklearn.datasets import make_classification\r\n\r\nimport matplotlib.pyplot as plt\r\nimport seaborn as sns\r\n<\/pre>\n
EXAMPLE 1: Generate Data For Logistic Regression<\/h3>\n
\n
\r\nX, y = make_classification(n_samples = 1000\r\n ,n_features = 1\r\n ,n_informative = 1\r\n ,n_redundant = 0\r\n ,n_clusters_per_class = 1\r\n ,class_sep = 2\r\n ,random_state = 2\r\n )\r\n<\/pre>\n
\r\nplt.style.use('fivethirtyeight')\r\nsns.scatterplot(x = X.flatten(), y = y, hue = y)\r\n<\/pre>\n