Python pandas useful tricks (this page gets updated over time)

2022, Jun 27    

Applying a function to multiple columns of a dataframe

import pandas as pd

# Create an example dataframe
df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1"],
                  "Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0],
                  "Value":[5,4,6,4,7.5,12]})
                  
# Now create the function where the input is a whole row of a dataframe
# inside each row, you can access to any column of the row using the
# row["column_name"] format.
def compare(row):
    if row["Amount"] > row["Value"]:
        return "Amount is bigger than value"
    elif row["Amount"] < row["Value"]:
        return "Amount is smaller than value"
    else:
        return "Amount is equal to the value"

df["comparison"] = df.apply(compare, axis=1)
df
    User  Amount  Value                    comparison
0  user1    10.0    5.0   Amount is bigger than value
1  user2     5.0    4.0   Amount is bigger than value
2  user2     8.0    6.0   Amount is bigger than value
3  user3    10.5    4.0   Amount is bigger than value
4  user2     7.5    7.5  Amount is equal to the value
5  user1     8.0   12.0  Amount is smaller than value

Flattening a dataframe after doing multiple aggregation on multiple columns

import pandas as pd
import numpy as np

# Create an example dataframe
df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1"],
                  "Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0],
                  "Value":[5,4,6,4,5,6]})

# First create the multiple aggregation on multiple column            
df_agg = df.groupby("User").agg({"Amount":[("total_amount","sum"),\
            ("average_amount","mean")],\
            "Value":[("total_value","sum"),\
                      ('value_std', lambda x: np.std(x))]})
df_agg

This is how the df_agg looks like:

            Amount                      Value          
      total_amount average_amount total_value value_std
User                                                   
user1         18.0       9.000000          11  0.500000
user2         20.5       6.833333          15  0.816497
user3         10.5      10.500000           4  0.000000

The df_agg dataframe is a grouped or aggregated dataframe and indexed by the column(s) it is grouped by. Before calling any reset_index, lets see how the columns attribute of the grouped dataframe looks like:

df_agg.columns
MultiIndex([('Amount',   'total_amount'),
            ('Amount', 'average_amount'),
            ( 'Value',    'total_value'),
            ( 'Value',      'value_std')],
           )

You can see from the list above that the User is missing in the above list of columns. This is because it is not a column yet, it is an index. If we reset the index of this dataframe then we will find User as part of the columns.

# We want to first reset the index
df_agg = df_agg.reset_index()
df_agg.columns
MultiIndex([(  'User',               ''),
            ('Amount',   'total_amount'),
            ('Amount', 'average_amount'),
            ( 'Value',    'total_value'),
            ( 'Value',      'value_std')],
           )

Now, we have already dropped the index, only left with hierarchical columns. Now we will get rid of these hierarchical columns with an easy trick. In the list above, the column names which are part of the grouping, have an empty string in the second element of the tuple. We will use this information to create list of column names, where the grouper columns come from the first element of the tuples and the aggregated columns names come from the second element of the tuple.

df_agg.columns = [f"{a}" if b=="" else f"{b}" for a,b in df_agg.columns]
df_agg
    User  total_amount  average_amount  total_value  value_std
0  user1          18.0        9.000000           11   0.500000
1  user2          20.5        6.833333           15   0.816497
2  user3          10.5       10.500000            4   0.000000

Inspired by this and this stackoverflow posts.

Create datetime by infrering the format of the datetime from string datetime column

df["dtime"] = pd.to_datetime(df["dtime_string"], infer_datetime_format = True)