Simulated Data Generation for Data Science and Machine Learning
As data scientists and machine learning professionals, we often encounter situations where we need to test our models on specific scenarios or publish academic papers about custom data science solutions. However, real-world data is not always readily available, expensive, or private. This is where creating synthetic data comes in – a useful skill for data science practitioners and professionals.
In this article, we will explore five methods for creating simulated data, toy datasets, and ‘dummy’ values from scratch using Python. We will use methods from Python libraries and techniques that use built-in Python functions.
1. Using NumPy
NumPy is a powerful Python library for dealing with linear algebra and numerical computing. It is also helpful for data generation. Let’s create a dataset with noise having a linear relationship with the target values.
import numpy as np
import matplotlib.pyplot as plt
def create_data(N, w):
X = np.random.rand(N, 1) * 10
y = w[0] * X + w[1] + np.random.randn(N, 1)
return X, y
X, y = create_data(200, [2, 1])
plt.figure(figsize=(10, 6))
plt.title('Simulated Linear Data')
plt.xlabel('X')
plt.ylabel('y')
plt.scatter(X, y)
plt.show()
Simulated Linear Data
We can also use NumPy to generate synthetic time series data with a linear trend and a seasonal component.
def create_time_series(N, w):
time = np.arange(0,N)
trend = time * w[0]
seasonal = np.sin(time * w[1])
noise = np.random.randn(N)
y = trend + seasonal + noise
return time, y
time, y = create_time_series(100, [0.25, 0.2])
plt.figure(figsize=(10, 6))
plt.title('Simulated Time Series Data')
plt.xlabel('Time')
plt.ylabel('y')
plt.plot(time, y)
plt.show()
Simulated Time Series Data
2. Using Scikit-learn
Scikit-learn has data generators useful for building artificial datasets with controlled size and complexity. Let’s create a random n-class dataset using the make_classification
method.
from sklearn.datasets import make_classification
import pandas as pd
X, y = make_classification(n_samples=1000, n_features=5, n_classes=2)
df = pd.DataFrame(X, columns=['feature1', 'feature2', 'feature3', 'feature4', 'feature5'])
df['target'] = y
df.head()
Simulated Classification Data
We can also use the make_regression
method to create datasets for regression analysis.
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
X, y, coef = make_regression(n_samples=100, n_features=1, bias=10, noise=50, n_targets=1, random_state=0, coef=True)
plt.figure(figsize=(10, 6))
plt.title('Simulated Regression Data')
plt.xlabel('X')
plt.ylabel('y')
plt.scatter(X, y)
plt.show()
Simulated Regression Data
3. Using SciPy
SciPy is a Python library for scientific computing, optimization, statistical analysis, and many other mathematical tasks. The stats
model of SciPy can create simulated data from many statistical distributions, such as normal, binomial, and exponential distributions.
from scipy.stats import norm, binom, expon
norm_data = norm.rvs(size=1000)
binom_data = binom.rvs(n=50, p=0.8, size=1000)
exp_data = expon.rvs(scale=.2, size=10000)
4. Using Faker
Faker is a Python library that generates fake data. We can use it to create realistic data similar to user information.
from faker import Faker
import pandas as pd
def create_fake_data(N):
fake = Faker()
names = [fake.name() for _ in range(N)]
addresses = [fake.address() for _ in range(N)]
emails = [fake.email() for _ in range(N)]
phone_numbers = [fake.phone_number() for _ in range(N)]
fake_df = pd.DataFrame({'Name': names, 'Address': addresses, 'Email': emails, 'Phone Number': phone_numbers})
return fake_df
fake_users = create_fake_data(100)
fake_users.head()
Simulated Fake Data
5. Using Synthetic Data Vault (SDV)
SDV is a Python library that allows the creation of synthetic datasets using statistical models. We can use it to create synthetic data similar to an existing dataset.
from sdv.datasets.demo import download_demo
from sdv.single_table import GaussianCopulaSynthesizer
adult_data, metadata = download_demo(dataset_name='adult', modality='single_table')
model = GaussianCopulaSynthesizer(metadata)
model.fit(adult_data)
simulated_data = model.sample(100)
simulated_data.head()
Simulated Samples
Creating synthetic data is a useful skill for data science practitioners and professionals. By using these five methods, we can generate simulated data for machine learning projects, statistical modeling, and other tasks involving data. The examples shown are easy to follow, so I recommend exploring the code, reading the documentation available, and developing other data generation methods more suitable to every need.
As said before, data scientists, machine learning professionals, and developers can gain from using synthetic datasets by improving model performance and lowering the costs of production and application testing.
Remember to check the notebook with all the methods explored in the article: GitHub - Marcussena/Synthetic-data-generation: Simulated Data Generation for Data Science and…