Flexing my Regex muscles

I’ve spent 12 hours scouring the internet for this problem. I am close to giving up and I’ve decided to ask a question on stackoverflow. Since this blog is all about my journey, I am going to share the love and document my exploits here.

Here’s the problem:

On a dataframe, I have a column of duration in a human-readable format like “29 days 4 hours 32 minutes 1 second”. I want to break them down by having columns of days, hours, minutes, seconds with the values derived from the duration column. Like 29 for days, 4 for hours, 32 minutes, and 1 for seconds. I’ve already used this but it’s not working correctly:

# Use regex to extract time values into their respective columns
new_df = df['duration'].str.extract(r'(?P<days>\d+(?= day))|(?P<hours>\d+(?= hour))|(?P<minutes>\d+(?= min))|(?P<seconds>\d+(?= sec))')

Results in the following dataframe:

The new dataframe only has the first value but not the rest. It captured the 29 for days, and 1, 3, 4, 2, for minutes but the subsequent columns values are NaNs.

Ideally, the dataframe should like this below:

I have a feeling something is wrong with my regex.

Update: Yes, I was right about being wrong with the regex. I should’ve been using extractall instead of just extract. Here’s the correct code:

import pandas as pd
import re

list = {'id': ['123','124','125','126','127'],
        'date': ['1/1/2018', '1/2/2018', '1/3/2018', '1/4/2018','1/5/2018'],
        'duration': ['29 days 4 hours 32 minutes',
                     '1 hour 23 minutes',
                     '3 hours 2 minutes 1 second',
                     '4 hours 46 minutes 22 seconds',
                     '2 hours 1 minute']}

df = pd.DataFrame(list)

# Use regex to extract time values into their respective columns
pattern = ( '(?P<days>\d+)(?= days?\s*)|'
          + '(?P<hours>\d+)(?= hours?\s*)|'
          + '(?P<minutes>\d+)(?= minutes?\s*)|'
          + '(?P<seconds>\d+)(?= seconds?\s*)'
          )

new_df = (df.duration.str.extractall(pattern)   # extract all with regex
          .reset_index('match',drop=True)      # merge the matches of the same row
          .stack()
          .unstack(level=-1, fill_value=0)     # remove fill_value if you want NaN instead of 0
          )

Thanks to Quang Hoang and everybody else for answering the problem!

Basic cleaning

The Python function below cleans up textual data. For parameters, it takes a dataframe and a column name. The function encodes and decodes the text. After that, it performs some basic regex parsing. Finally, all the words that are designated as stop words are then lemmatized using NLTK.

def clean(text):
  """
  A simple function to clean up the data. All the words that
  are not designated as a stop word is then lemmatized after
  encoding and basic regex parsing are performed.
  """
  wnl = nltk.stem.WordNetLemmatizer()
  stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
  text = (unicodedata.normalize('NFKD', text)
    .encode('ascii', 'ignore')
    .decode('utf-8', 'ignore')
    .lower())
  words = re.sub(r'[^\w\s]', '', text).split()
  return [wnl.lemmatize(word) for word in words if word not in stopwords]