Flexing my Regex muscles

I’ve spent 12 hours scouring the internet for this problem. I am close to giving up and I’ve decided to ask a question on stackoverflow. Since this blog is all about my journey, I am going to share the love and document my exploits here.

Here’s the problem:

On a dataframe, I have a column of duration in a human-readable format like “29 days 4 hours 32 minutes 1 second”. I want to break them down by having columns of days, hours, minutes, seconds with the values derived from the duration column. Like 29 for days, 4 for hours, 32 minutes, and 1 for seconds. I’ve already used this but it’s not working correctly:

# Use regex to extract time values into their respective columns
new_df = df['duration'].str.extract(r'(?P<days>\d+(?= day))|(?P<hours>\d+(?= hour))|(?P<minutes>\d+(?= min))|(?P<seconds>\d+(?= sec))')

Results in the following dataframe:

The new dataframe only has the first value but not the rest. It captured the 29 for days, and 1, 3, 4, 2, for minutes but the subsequent columns values are NaNs.

Ideally, the dataframe should like this below:

I have a feeling something is wrong with my regex.

Update: Yes, I was right about being wrong with the regex. I should’ve been using extractall instead of just extract. Here’s the correct code:

import pandas as pd
import re

list = {'id': ['123','124','125','126','127'],
        'date': ['1/1/2018', '1/2/2018', '1/3/2018', '1/4/2018','1/5/2018'],
        'duration': ['29 days 4 hours 32 minutes',
                     '1 hour 23 minutes',
                     '3 hours 2 minutes 1 second',
                     '4 hours 46 minutes 22 seconds',
                     '2 hours 1 minute']}

df = pd.DataFrame(list)

# Use regex to extract time values into their respective columns
pattern = ( '(?P<days>\d+)(?= days?\s*)|'
          + '(?P<hours>\d+)(?= hours?\s*)|'
          + '(?P<minutes>\d+)(?= minutes?\s*)|'
          + '(?P<seconds>\d+)(?= seconds?\s*)'

new_df = (df.duration.str.extractall(pattern)   # extract all with regex
          .reset_index('match',drop=True)      # merge the matches of the same row
          .unstack(level=-1, fill_value=0)     # remove fill_value if you want NaN instead of 0

Thanks to Quang Hoang and everybody else for answering the problem!

How to put two or more dataframes together

The Pandas function below takes a list of dataframes and concatenates them into. This basic flavor of concat()joins the dataframes vertically. In other words, the rows of one dataframe gets added on to the previous one.

df = pd.concat([df1,df2,df3])

Or if you want, you can store the list of dataframes into a variable first and then call the concat function. Like so:

# we must import pandas first
# put it in the beginning of your file
import pandas as pd

frames = [df1, df2, df3, df4, df5]
df = pd.concat(frames)

On the other hand, if I want to join the dataframes horizontally, then I can use merge().

For example, in the code below, we are merging df1 with df2 using ‘column_name’ as the common column. This is the column from which to base the merge. If there are any other identical columns that exist between the two dataframes, the suffixes are then appended to the each of the column names accordingly.

This particular flavor of merge() joins the dataframes horizontally. In the words, the columns of the dataframes gets added together to make one big mamma jamma of a dataframe;

df_merged = df1.merge(df2,
                      suffixes=('_left', '_right'))

How to read all the files inside a folder into a single dataframe

Have a bunch of csv files and don’t want to import them all individually? Well, I can’t blame you, and you shouldn’t anyways.

The function below reads all the files within a single folder and returns it as one Pandas dataframe. Of course, for it to work the files need to be of the same data structure. In other words, they all got to have the same column names in the same order.

# we import these first
# put them in the beginning of your file
import os
import pandas as pd

def read_data(folder):
    This function reads each the raw data files as dataframes and
    combines them into a single data frame.
    for i, file_name in enumerate(os.listdir(input_folder)):
            # df = pd.read_excel(os.path.join(input_folder, file_name)) # excel
            # df = pd.read_csv(os.path.join(input_folder, file_name), sep='\t') # tsv file
            df = pd.read_csv(os.path.join(input_folder, file_name)) # vanilla csv
            df['file_name'] = file_name
            if i == 0:
                final_df = df.copy()
                final_df = final_df.append(df)

        except Exception as e:
            print(f"Cannot read file: {file_name}")
    return final_df

# provide the folder path where your files reside
folder = 'G:/path/to/data/parent_folder_name'

# call the function
df = read_data(folder)

How to load data into a DataFrame

Below are three Panda functions that I use to load data into a dataframe. I use read_csv() to read a regular .csv file. For excel files, I use read_excel(). And finally, for those weird tab-delimited files the extension of .tsv, I also use read_csv() but with the added parameter of sep='\t'.

# we must first import pandas
# add it at the beginning of your code
import pandas as pd

# to read a regular good ol' csv file
df = pd.read_csv('../data/data_file.csv')

# to read an excel (:::gasp:::) file
df = pd.read_excel('../data/data_file.xlsx')

# to read a .tsv file
df = pd.read_csv('../data/data_file.tsv', sep='\t')

Show missing values in a Pandas DataFrame

The function below show any null, NaN, or empty values in the dataframe. It takes a dataframe as a parameter and returns a dataframe with columns that shows the count and percentage of any null, NaNs, and empty values.

def missing_values_col(df):
    Write or use a previously written function to return the
    total missing values and the percent missing values by column.
    null_count = df.isnull().sum()
    null_percentage = (null_count / df.shape[0]) * 100
    empty_count = pd.Series(((df == ' ') | (df == '')).sum())
    empty_percentage = (empty_count / df.shape[0]) * 100
    nan_count = pd.Series(((df == 'nan') | (df == 'NaN')).sum())
    nan_percentage = (nan_count / df.shape[0]) * 100
    return pd.DataFrame({'num_missing': null_count, 'missing_percentage': null_percentage,
                         'num_empty': empty_count, 'empty_percentage': empty_percentage,
                         'nan_count': nan_count, 'nan_percentage': nan_percentage})

Select Rows from a DataFrame Based on Multiple Conditions in Pandas

I’m continuing to break down a very popular stackoverflow post at https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas.

df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

Here, the df‘s in the line of code above represent the name of data frame that we are working with.

.loc is a keyword.

The outermost square brackets [...] contains the condition that you want to use as filter. In this case, we have two conditions inside separated by & with the each condition enclosed by an open and close parenthesis (...).

This is condition #1:

(df['column_name'] >= A)

This is condition #2:

(df['column_name'] <= B)

The column_name in condition #1 can be the same or different from condition #2; it all depends on how you want to filter the data frame. The same goes for the >= and <= comparison signs, and A & B.

Suppose I have a data frame called my_tie_collection. If I only want ties that are both blue and made from silk, then, I would type:

my_tie_collection.loc[(my_tie_collection['color'] == 'blue') & (my_tie_collection['material'] == 'silk')]

Take note, I used == to denote matching and I also put single quotes around blue and silk.

Select Rows from a DataFrame based on a Value in a Column in Pandas

Today, I’ll be breaking down a very popular stackoverflow post at https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas.

df.loc[df['column_name'] == some_value]

Above, df is the name of the data frame. You should replace it twice with the name of your data frame.

.loc is a keyword.

Next, replace column_name with the name of the column that contains the values you want to filter.

Finally, replace some_value with the desired value.

For example, if I have a data frame named “my_shoe_collection” and I want to select only the rows where the value of “color” is “blue” then:

my_shoe_collection.loc[my_shoe_collection['color'] == 'blue']

Also, if I have the same data frame named and I want to only select rows where the value of “price” is less than $50, then:

my_shoe_collection.loc[my_shoe_collection['price'] <= 50]

Notice how I got rid of the single quotation marks since I’m dealing with an actual number?