Essential Python Libraries for Data Manipulation
======================================================
As a data professional, it’s essential to understand how to process your data. In the modern era, it means using programming languages to quickly manipulate our dataset to achieve our expected results.
Python is the most popular programming language data professionals use, and many libraries are helpful for data manipulation. From a simple vector to parallelization, each use case has a library that could help.
So, what are these Python libraries that are essential for Data Manipulation? Let’s get into it.
1. NumPy
The first library we would discuss is NumPy. NumPy is an open-source library for scientific computing activity. It was developed in 2005 and has been used in many data science cases.
NumPy is a popular library, providing many valuable features in scientific computing activities such as array objects, vector operations, and mathematical functions. Also, many data science use cases rely on a complex table and matrices calculation, so NumPy allows users to simplify the calculation process.
Let’s try NumPy with Python. Many data science platforms, such as Anaconda, have Numpy installed by default. But you can always install them via Pip.
pip install numpy
After the installation, we would create a simple array and perform array operations.
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = a + b
print(c)
Output:
[5 7 9]
We can also perform basic statistics calculations with NumPy.
data = np.array([1, 2, 3, 4, 5, 6, 7])
mean = np.mean(data)
median = np.median(data)
std_dev = np.std(data)
print(f"The data mean:{mean}, median:{median} and standard deviation: {std_dev}")
Output:
The data mean:4.0, median:4.0, and standard deviation: 2.0
It’s also possible to perform linear algebra operations such as matrix calculation.
x = np.array([[1, 2], [3, 4]])
y = np.array([[5, 6], [7, 8]])
dot_product = np.dot(x, y)
print(dot_product)
Output:
[[19 22]
[43 50]]
There are so many benefits you can do using NumPy. From handling data to complex calculations, it’s no wonder many libraries have NumPy as their base.
2. Pandas
Pandas is the most popular data manipulation Python library for data professionals. I am sure that many of the data science learning classes would use Pandas as their basis for any subsequent process.
Pandas are famous because they have intuitive APIs yet are versatile, so many data manipulation problems can easily be solved using the Pandas library. Pandas allows the user to perform data operations and analyze data from various input formats such as CSV, Excel, SQL databases, or JSON.
Pandas are built on top of NumPy, so NumPy object properties still apply to any Pandas object.
Let’s try on the library. Like NumPy, it’s usually available by default if you are using a Data Science platform such as Anaconda. However, you can follow the Pandas Installation guide if you are unsure.
You can try to initiate the dataset from the NumPy object and get a DataFrame object (Table-like) that shows the top five rows of data with the following code.
import numpy as np
import pandas as pd
np.random.seed(0)
months = pd.date_range(start='2023-01-01', periods=12, freq='M')
sales = np.random.randint(10000, 50000, size=12)
transactions = np.random.randint(50, 200, size=12)
data = {
'Month': months,
'Sales': sales,
'Transactions': transactions
}
df = pd.DataFrame(data)
df.head()
Then you can try several data manipulation activities, such as data selection.
df[df['Transactions'] < 100]
It’s possible to do the Data calculation.
total_sales = df['Sales'].sum()
average_transactions = df['Transactions'].mean()
Performing data cleaning with Pandas is also easy.
df = df.dropna()
df = df.fillna(df.mean())
There is so much to do with Pandas for Data Manipulation. Check out Bala Priya article on using Pandas for Data Manipulation to learn further.
3. Polars
Polars is a relatively new data manipulation Python library designed for the swift analysis of large datasets. Polars boast 30x performance gains compared to Pandas in several benchmark tests.
Polars is built on top of the Apache Arrow, so it’s efficient for memory management of the large dataset and allows for parallel processing. It also optimizes their data manipulation performance using lazy execution that delays and computational until it’s necessary.
For the Polars installation, you can use the following code.
pip install polars
Like Pandas, you can initiate the Polars DataFrame with the following code.
import numpy as np
import polars as pl
np.random.seed(0)
employee_ids = np.arange(1, 101)
ages = np.random.randint(20, 60, size=100)
salaries = np.random.randint(30000, 100000, size=100)
df = pl.DataFrame({
'EmployeeID': employee_ids,
'Age': ages,
'Salary': salaries
})
df.head()
However, there are differences in how we use Polars to manipulate data. For example, here is how we select data with Polars.
df.filter(pl.col('Age') > 40)
The API is considerably more complex than Pandas, but it’s helpful if you require fast execution for large datasets. On the other hand, you would not get the benefit if the data size is small.
To know the details, you can refer to Josep Ferrer’s article on how different Polars is are compared to Pandas.
4. Vaex
Vaex is similar to Polars as the library is developed specifically for considerable dataset data manipulation. However, there are differences in the way they process the dataset. For example, Vaex utilize memory-mapping techniques, while Polars focus on a multi-threaded approach.
Vaex is optimally suitable for datasets that are way bigger than what Polars intended to use. While Polars is also for extensive dataset manipulation processing, the library is ideally on datasets that still fit into memory size. At the same time, Vaex would be great to use on datasets that exceed the memory.
5. CuPy
CuPy is an open-source library that enables GPU-accelerated computing in Python. It is CuPy that was designed for the NumPy and SciPy replacement if you need to run the calculation within NVIDIA CUDA or AMD ROCm platforms.
This makes CuPy great for applications that require intense numerical computation and need to use GPU acceleration. CuPy could utilize the parallel architecture of GPU and is beneficial for large-scale computations.
To install CuPy, refer to their GitHub repository, as many available versions might or might not suit the platforms you use. For example, below is for the CUDA platform.
pip install cupy-cuda11x
The APIs are similar to NumPy, so you can use CuPy instantly if you are already familiar with NumPy. For example, the code example for CuPy calculation is below.
import cupy as cp
x = cp.arange(10)
y = cp.array([2] * 10)
z = x * y
print(cp.asnumpy(z))
CuPy is the end of an essential Python library if you are continuously working with high-scale computational data.
Conclusion
All the Python libraries we have explored are essential in certain use cases. NumPy and Pandas might be the basics, but libraries like Polars, Vaex, and CuPy would be beneficial in specific environments.
If you have any other library you deem essential, please share them in the comments!