Intro – Pandas Code Snippets
Whether you are a seasoned data scientist looking to brush up your skills or a budding programmer, revisiting the basics and experimenting with snippets of code can pave the way for more complex data handling projects. This article unfolds a series of hands-on Pandas code snippets to rekindle your understanding and application of this indispensable library.
Importing Pandas:
No output is produced in this step. This line imports the Pandas library and aliases it as pd
.
import pandas as pd
Creating a DataFrame in Pandas
Creates a DataFrame from a dictionary. The keys of the dictionary become the column names, and the values become the data in the columns.
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}
df = pd.DataFrame(data)
print(df)
Output
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
Reading Data from a CSV File in Pandas:
Reads data from a CSV file and creates a DataFrame from it.
# Assume the file 'data.csv' contains:
# Name,Age
# Alice,25
# Bob,30
df = pd.read_csv('data.csv')
print(df)
Output
Name Age
0 Alice 25
1 Bob 30
Selecting a Column in Pandas data frame
This snippet selects the ‘Name’ column from the DataFrame.
# Assume the file 'data.csv' contains:
# Name,Age
# Alice,25
# Bob,30
df = pd.read_csv('data.csv')
selected_column = df['Name']
print(selected_column)
Output
0 Alice
1 Bob
2 Charlie
Name: Name, dtype: object
Filtering Rows in Pandas by column name
filters the rows to only include those where the age is greater than 25.
# Using DataFrame from snippet 2
filtered_rows = df[df['Age'] > 25]
print(filtered_rows)
Output
Name Age
1 Bob 30
2 Charlie 35
Adding a New Column in Pandas Dataframe
adds a new column to the DataFrame, indicating whether each individual is an adult based on their age.
# Using DataFrame from snippet 2
df['Is Adult'] = df['Age'] >= 18
print(df)
Output
Name Age Is Adult
0 Alice 25 True
1 Bob 30 True
2 Charlie 35 True
Dropping a Column in Pandas
# Using DataFrame from snippet 6
df_dropped = df.drop(columns=['Is Adult'])
print(df_dropped)
Output
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
Setting Index in Pandas
Sets the ‘Name’ column as the index of the DataFrame.
# Using DataFrame from snippet 2
df_indexed = df.set_index('Name')
print(df_indexed)
Output
Age
Name
Alice 25
Bob 30
Charlie 35
Resetting Index in Pandas
Resets the index of the DataFrame to the default integer index.
# Using DataFrame from snippet 8
df_reset = df_indexed.reset_index()
print(df_reset)
Output
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
Grouping Data in Pandas
groups the data by the ‘Department’ column and calculates the mean salary for each department.
data = {
'Department': ['HR', 'IT', 'HR', 'IT'],
'Employee': ['Alice', 'Bob', 'Charlie', 'David'],
'Salary': [50000, 60000, 55000, 62000]
}
df = pd.DataFrame(data)
grouped = df.groupby('Department').mean()
print(grouped)
Output
Salary
Department
HR 52500
IT 61000
Merging DataFrames in Python
merges two DataFrames on the ‘Key’ column, keeping only the rows that have a key present in both DataFrames (inner join).
data1 = {
'Key': ['A', 'B', 'C'],
'Value1': [1, 2, 3]
}
data2 = {
'Key': ['B', 'C', 'D'],
'Value2': [4, 5, 6]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
merged_df = pd.merge(df1, df2, on='Key', how='inner')
print(merged_df)
Output
Key Value1 Value2
0 B 2 4
1 C 3 5
Sorting Data in Python
Sorts the DataFrame by the ‘Age’ column in descending order.
# Using DataFrame from snippet 2
sorted_df = df.sort_values(by='Age', ascending=False)
print(sorted_df)
Output
Name Age
2 Charlie 35
1 Bob 30
0 Alice 25
Renaming Columns in Pandas
Renames the columns of the DataFrame.
# Assuming df from previous examples
df_renamed = df.rename(columns={'Name': 'Employee Name', 'Age': 'Employee Age'})
print(df_renamed)
Output
Employee Name Employee Age
0 Alice 25
1 Bob 30
2 Charlie 35
Handling Missing Data in Pandas
fills missing values with the mean of the non-null values in their respective columns.
import numpy as np
data = {
'A': [1, 2, np.nan],
'B': [4, np.nan, np.nan],
'C': [7, 8, 9]
}
df = pd.DataFrame(data)
df_filled = df.fillna(value=df.mean())
print(df_filled)
Output
A B C
0 1.0 4.0 7
1 2.0 5.0 8
2 3.0 5.0 9
Applying Functions in Pandas
applies a function to each element in the ‘Age’ column.
# Assuming df from previous examples
def age_group(age):
return 'Adult' if age >= 18 else 'Minor'
df['Age Group'] = df['Age'].apply(age_group)
print(df)
Output
Name Age Age Group
0 Alice 25 Adult
1 Bob 30 Adult
2 Charlie 35 Adult
Descriptive Statistics in Pandas
provides descriptive statistics for the numerical columns in the DataFrame.
# Assuming df from previous examples
print(df.describe())
Output
Age
count 3.000000
mean 30.000000
std 5.000000
min 25.000000
25% 27.500000
50% 30.000000
75% 32.500000
max 35.000000
Unique Values in Pandas
finds the unique values in the ‘Name’ column.
# Assuming df from previous examples
unique_names = df['Name'].unique()
print(unique_names)
Output
['Alice' 'Bob' 'Charlie']
Value Counts in Pandas
counts the occurrences of each unique value in the ‘Fruit’ column.
data = {
'Fruit': ['Apple', 'Banana', 'Apple', 'Banana', 'Banana']
}
df = pd.DataFrame(data)
fruit_counts = df['Fruit'].value_counts()
print(fruit_counts)
Output
Banana 3
Apple 2
Name: Fruit, dtype: int64
Converting DataFrame to Numpy Array in Pandas
converts the DataFrame to a Numpy array.
# Assuming df from previous examples
array = df.values
print(array)
Output
[['Alice' 25]
['Bob' 30]
['Charlie' 35]]
Concatenating DataFrames in Pandas
Concatenates two DataFrames along the row axis.
data1 = {
'A': [1, 2, 3],
'B': [4, 5, 6]
}
data2 = {
'A': [7, 8, 9],
'B': [10, 11, 12]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
concatenated_df = pd.concat([df1, df2])
print(concatenated_df)
Ouput
A B
0 1 4
1 2 5
2 3 6
0 7 10
1 8 11
2 9 12
DataFrame Slicing in Pandas
slices the DataFrame to only include rows 1 and 2 (0-indexed).
# Assuming df from previous examples
sliced_df = df.iloc[1:3]
print(sliced_df)
Ouput
Name Age
1 Bob 30
2 Charlie 35
Changing Data Types in Pandas
Changes the data type of the ‘Age’ column to float.
# Assuming df from previous examples
df['Age'] = df['Age'].astype(float)
print(df.dtypes)
Output:
Name object
Age float64
dtype: object
Getting Column Names in Pandas
Retrieves the column names of the DataFrame.
# Assuming df from previous examples
columns = df.columns
print(columns)
Output
Index(['Name', 'Age'], dtype='object')
Checking for Missing Data in Pandas
import numpy as np
data = {
'A': [1, 2, np.nan],
'B': [4, np.nan, np.nan],
'C': [7, 8, 9]
}
df = pd.DataFrame(data)
missing_data = df.isnull()
print(missing_data)
Output
A B C
0 False False False
1 False True False
2 True True False
Dropping Rows with Missing Data in Pandas:
# Using DataFrame from snippet 24
df_no_missing = df.dropna()
print(df_no_missing)
Output
A B C
0 1.0 4.0 7
Finding the Index of Maximum and Minimum Values in Pandas
finds the index of the maximum and minimum values in the ‘Age’ column.
# Assuming df from previous examples
max_age_index = df['Age'].idxmax()
min_age_index = df['Age'].idxmin()
print(f'Max Age Index: {max_age_index}, Min Age Index: {min_age_index}')
Output
Max Age Index: 2, Min Age Index: 0
Saving DataFrame to CSV in pandas
This will save the DataFrame to a CSV file named ‘output.csv’. There won’t be any output displayed in the console.
# Assuming df from previous examples
df.to_csv('output.csv', index=False)
Creating a DataFrame from a Series in Pandas
- A Pandas Series is being created with the data
['Alice', 'Bob', 'Charlie']
. - The
name
parameter is used to give a name to the Series, which in this case is'Name'
.
- The
to_frame()
method is being called on the Series object to convert it into a DataFrame. - The name of the Series becomes the column name in the DataFrame.
series = pd.Series(['Alice', 'Bob', 'Charlie'], name='Name')
df_series = series.to_frame()
print(df_series)
Output
Name
0 Alice
1 Bob
2 Charlie