Flexing my Regex muscles

I’ve spent 12 hours scouring the internet for this problem. I am close to giving up and I’ve decided to ask a question on stackoverflow. Since this blog is all about my journey, I am going to share the love and document my exploits here.

Here’s the problem:

On a dataframe, I have a column of duration in a human-readable format like “29 days 4 hours 32 minutes 1 second”. I want to break them down by having columns of days, hours, minutes, seconds with the values derived from the duration column. Like 29 for days, 4 for hours, 32 minutes, and 1 for seconds. I’ve already used this but it’s not working correctly:

# Use regex to extract time values into their respective columns
new_df = df['duration'].str.extract(r'(?P<days>\d+(?= day))|(?P<hours>\d+(?= hour))|(?P<minutes>\d+(?= min))|(?P<seconds>\d+(?= sec))')

Results in the following dataframe:

The new dataframe only has the first value but not the rest. It captured the 29 for days, and 1, 3, 4, 2, for minutes but the subsequent columns values are NaNs.

Ideally, the dataframe should like this below:

I have a feeling something is wrong with my regex.

Update: Yes, I was right about being wrong with the regex. I should’ve been using extractall instead of just extract. Here’s the correct code:

import pandas as pd
import re

list = {'id': ['123','124','125','126','127'],
        'date': ['1/1/2018', '1/2/2018', '1/3/2018', '1/4/2018','1/5/2018'],
        'duration': ['29 days 4 hours 32 minutes',
                     '1 hour 23 minutes',
                     '3 hours 2 minutes 1 second',
                     '4 hours 46 minutes 22 seconds',
                     '2 hours 1 minute']}

df = pd.DataFrame(list)

# Use regex to extract time values into their respective columns
pattern = ( '(?P<days>\d+)(?= days?\s*)|'
          + '(?P<hours>\d+)(?= hours?\s*)|'
          + '(?P<minutes>\d+)(?= minutes?\s*)|'
          + '(?P<seconds>\d+)(?= seconds?\s*)'

new_df = (df.duration.str.extractall(pattern)   # extract all with regex
          .reset_index('match',drop=True)      # merge the matches of the same row
          .unstack(level=-1, fill_value=0)     # remove fill_value if you want NaN instead of 0

Thanks to Quang Hoang and everybody else for answering the problem!

How to put two or more dataframes together

The Pandas function below takes a list of dataframes and concatenates them into. This basic flavor of concat()joins the dataframes vertically. In other words, the rows of one dataframe gets added on to the previous one.

df = pd.concat([df1,df2,df3])

Or if you want, you can store the list of dataframes into a variable first and then call the concat function. Like so:

# we must import pandas first
# put it in the beginning of your file
import pandas as pd

frames = [df1, df2, df3, df4, df5]
df = pd.concat(frames)

On the other hand, if I want to join the dataframes horizontally, then I can use merge().

For example, in the code below, we are merging df1 with df2 using ‘column_name’ as the common column. This is the column from which to base the merge. If there are any other identical columns that exist between the two dataframes, the suffixes are then appended to the each of the column names accordingly.

This particular flavor of merge() joins the dataframes horizontally. In the words, the columns of the dataframes gets added together to make one big mamma jamma of a dataframe;

df_merged = df1.merge(df2,
                      suffixes=('_left', '_right'))

How to read all the files inside a folder into a single dataframe

Have a bunch of csv files and don’t want to import them all individually? Well, I can’t blame you, and you shouldn’t anyways.

The function below reads all the files within a single folder and returns it as one Pandas dataframe. Of course, for it to work the files need to be of the same data structure. In other words, they all got to have the same column names in the same order.

# we import these first
# put them in the beginning of your file
import os
import pandas as pd

def read_data(folder):
    This function reads each the raw data files as dataframes and
    combines them into a single data frame.
    for i, file_name in enumerate(os.listdir(input_folder)):
            # df = pd.read_excel(os.path.join(input_folder, file_name)) # excel
            # df = pd.read_csv(os.path.join(input_folder, file_name), sep='\t') # tsv file
            df = pd.read_csv(os.path.join(input_folder, file_name)) # vanilla csv
            df['file_name'] = file_name
            if i == 0:
                final_df = df.copy()
                final_df = final_df.append(df)

        except Exception as e:
            print(f"Cannot read file: {file_name}")
    return final_df

# provide the folder path where your files reside
folder = 'G:/path/to/data/parent_folder_name'

# call the function
df = read_data(folder)

How to load data into a DataFrame

Below are three Panda functions that I use to load data into a dataframe. I use read_csv() to read a regular .csv file. For excel files, I use read_excel(). And finally, for those weird tab-delimited files the extension of .tsv, I also use read_csv() but with the added parameter of sep='\t'.

# we must first import pandas
# add it at the beginning of your code
import pandas as pd

# to read a regular good ol' csv file
df = pd.read_csv('../data/data_file.csv')

# to read an excel (:::gasp:::) file
df = pd.read_excel('../data/data_file.xlsx')

# to read a .tsv file
df = pd.read_csv('../data/data_file.tsv', sep='\t')

Show missing values in a Pandas DataFrame

The function below show any null, NaN, or empty values in the dataframe. It takes a dataframe as a parameter and returns a dataframe with columns that shows the count and percentage of any null, NaNs, and empty values.

def missing_values_col(df):
    Write or use a previously written function to return the
    total missing values and the percent missing values by column.
    null_count = df.isnull().sum()
    null_percentage = (null_count / df.shape[0]) * 100
    empty_count = pd.Series(((df == ' ') | (df == '')).sum())
    empty_percentage = (empty_count / df.shape[0]) * 100
    nan_count = pd.Series(((df == 'nan') | (df == 'NaN')).sum())
    nan_percentage = (nan_count / df.shape[0]) * 100
    return pd.DataFrame({'num_missing': null_count, 'missing_percentage': null_percentage,
                         'num_empty': empty_count, 'empty_percentage': empty_percentage,
                         'nan_count': nan_count, 'nan_percentage': nan_percentage})

Basic cleaning

The Python function below cleans up textual data. For parameters, it takes a dataframe and a column name. The function encodes and decodes the text. After that, it performs some basic regex parsing. Finally, all the words that are designated as stop words are then lemmatized using NLTK.

def clean(text):
  A simple function to clean up the data. All the words that
  are not designated as a stop word is then lemmatized after
  encoding and basic regex parsing are performed.
  wnl = nltk.stem.WordNetLemmatizer()
  stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
  text = (unicodedata.normalize('NFKD', text)
    .encode('ascii', 'ignore')
    .decode('utf-8', 'ignore')
  words = re.sub(r'[^\w\s]', '', text).split()
  return [wnl.lemmatize(word) for word in words if word not in stopwords]

How to display all the columns and rows in the a Jupyter notebook output

Here’s the code I use in my jupyter notebook to display all the columns, rows, and characters. I usually put in the beginning of my notebook, after the import statements.

# set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)