Simulated Data Generation for Data Science and Machine Learning

Learn how to generate synthetic data using Python libraries such as NumPy, Scikit-learn, SciPy, Faker, and Synthetic Data Vault (SDV). These methods can be used for machine learning projects, statistical modeling, and other tasks involving data.
Simulated Data Generation for Data Science and Machine Learning
Photo by Ed Us on Unsplash

Simulated Data Generation for Data Science and Machine Learning

As data scientists and machine learning professionals, we often encounter situations where we need to test our models on specific scenarios or publish academic papers about custom data science solutions. However, real-world data is not always readily available, expensive, or private. This is where creating synthetic data comes in – a useful skill for data science practitioners and professionals.

In this article, we will explore five methods for creating simulated data, toy datasets, and ‘dummy’ values from scratch using Python. We will use methods from Python libraries and techniques that use built-in Python functions.

1. Using NumPy

NumPy is a powerful Python library for dealing with linear algebra and numerical computing. It is also helpful for data generation. Let’s create a dataset with noise having a linear relationship with the target values.

import numpy as np
import matplotlib.pyplot as plt

def create_data(N, w):
    X = np.random.rand(N, 1) * 10
    y = w[0] * X + w[1] + np.random.randn(N, 1)
    return X, y

X, y = create_data(200, [2, 1])

plt.figure(figsize=(10, 6))
plt.title('Simulated Linear Data')
plt.xlabel('X')
plt.ylabel('y')
plt.scatter(X, y)
plt.show()

Simulated Linear Data

We can also use NumPy to generate synthetic time series data with a linear trend and a seasonal component.

def create_time_series(N, w):
    time = np.arange(0,N)
    trend = time * w[0]
    seasonal = np.sin(time * w[1])
    noise = np.random.randn(N)
    y = trend + seasonal + noise
    return time, y

time, y = create_time_series(100, [0.25, 0.2])

plt.figure(figsize=(10, 6))
plt.title('Simulated Time Series Data')
plt.xlabel('Time')
plt.ylabel('y')
plt.plot(time, y)
plt.show()

Simulated Time Series Data

2. Using Scikit-learn

Scikit-learn has data generators useful for building artificial datasets with controlled size and complexity. Let’s create a random n-class dataset using the make_classification method.

from sklearn.datasets import make_classification
import pandas as pd

X, y = make_classification(n_samples=1000, n_features=5, n_classes=2)

df = pd.DataFrame(X, columns=['feature1', 'feature2', 'feature3', 'feature4', 'feature5'])
df['target'] = y
df.head()

Simulated Classification Data

We can also use the make_regression method to create datasets for regression analysis.

from sklearn.datasets import make_regression
import matplotlib.pyplot as plt

X, y, coef = make_regression(n_samples=100, n_features=1, bias=10, noise=50, n_targets=1, random_state=0, coef=True)

plt.figure(figsize=(10, 6))
plt.title('Simulated Regression Data')
plt.xlabel('X')
plt.ylabel('y')
plt.scatter(X, y)
plt.show()

Simulated Regression Data

3. Using SciPy

SciPy is a Python library for scientific computing, optimization, statistical analysis, and many other mathematical tasks. The stats model of SciPy can create simulated data from many statistical distributions, such as normal, binomial, and exponential distributions.

from scipy.stats import norm, binom, expon

norm_data = norm.rvs(size=1000)
binom_data = binom.rvs(n=50, p=0.8, size=1000)
exp_data = expon.rvs(scale=.2, size=10000)

4. Using Faker

Faker is a Python library that generates fake data. We can use it to create realistic data similar to user information.

from faker import Faker
import pandas as pd

def create_fake_data(N):
    fake = Faker()
    names = [fake.name() for _ in range(N)]
    addresses = [fake.address() for _ in range(N)]
    emails = [fake.email() for _ in range(N)]
    phone_numbers = [fake.phone_number() for _ in range(N)]
    fake_df = pd.DataFrame({'Name': names, 'Address': addresses, 'Email': emails, 'Phone Number': phone_numbers})
    return fake_df

fake_users = create_fake_data(100)
fake_users.head()

Simulated Fake Data

5. Using Synthetic Data Vault (SDV)

SDV is a Python library that allows the creation of synthetic datasets using statistical models. We can use it to create synthetic data similar to an existing dataset.

from sdv.datasets.demo import download_demo
from sdv.single_table import GaussianCopulaSynthesizer

adult_data, metadata = download_demo(dataset_name='adult', modality='single_table')

model = GaussianCopulaSynthesizer(metadata)
model.fit(adult_data)
simulated_data = model.sample(100)
simulated_data.head()

Simulated Samples

Creating synthetic data is a useful skill for data science practitioners and professionals. By using these five methods, we can generate simulated data for machine learning projects, statistical modeling, and other tasks involving data. The examples shown are easy to follow, so I recommend exploring the code, reading the documentation available, and developing other data generation methods more suitable to every need.

As said before, data scientists, machine learning professionals, and developers can gain from using synthetic datasets by improving model performance and lowering the costs of production and application testing.

Remember to check the notebook with all the methods explored in the article: GitHub - Marcussena/Synthetic-data-generation: Simulated Data Generation for Data Science and…