QBoard » Advanced Visualizations » Viz - Python » Best practice of grouping regex python

Best practice of grouping regex python

  • I have a list of strings containing arbitary phone numbers in python. The extension is an optional part.

    st = ['(800) 555-1212',
    '1-800-555-1212',
    '800-555-1212x1234',
    '800-555-1212 ext. 1234',
    'work 1-(800) 555.1212 #1234']​

    My objective is to segregate the phone numbers so that I can isolate each individual group viz. '800', '555', '1212' and the optional '1234'.

    I have tried out the following code.

    p1 = re.compile(r'(\d{3}).*(\d{3}).*(\d{4}).*(\d{4})?')
    step1 = [re.sub(r'\D','',p1.search(t).group()) for t in st]
    p2 = re.compile(r'(\d{3})(\d{3})(\d{4})(\d{4})?')
    step2 = [p2.search(t).groups() for t in step1]

     

    p1 and p2 being the two patterns to fetch the desired output.

    for i in range(len(step2)):
    print step2
    

     

    The output was:

    ('800', '555', '1212', None)
    ('800', '555', '1212', None)
    ('800', '555', '1212', '1234')
    ('800', '555', '1212', '1234')
    ('800', '555', '1212', '1234')

     

    Since I am a newbie, I wish to get suggestions if there are better ways to tacle such problems or some best practices followed in Python community. Thanks in advance.

     
      December 6, 2021 2:26 PM IST
    0
  • One more (modification of yours):

    import re
    pattern = re.compile('.*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})[^\d]*(\d{4})?$')
    print [[pattern.match(s).group(i) for i in range(1,5)] for s in st]
    
    #[['800', '555', '1212', None], ['800', '555', '1212', None], ['800', '555', '1212', '1234'], ['800', '555', '1212', '1234'], ['800', '555', '1212', '1234']]
      December 15, 2021 12:34 PM IST
    0
  • Instead of trying to match the entire string and capturing the desired substrings, you can just match digits with lenghts 3 or 4.

    Demo on Regex101: https://regex101.com/r/XNbb79/1

    import re
    
    st = ['(800) 555-1212',
    '1-800-555-1212',
    '800-555-1212x1234',
    '800-555-1212 ext. 1234',
    'work 1-(800) 555.1212 #1234']
    
    for b in [re.findall('\d{3,4}', a) for a in st]:
        if len(b) == 3:
            print "number does not have extension"
            print b
        if len(b) == 4:
            print "number has extension"
            print b

     

    Output:

    number does not have extension
    ['800', '555', '1212']
    number does not have extension
    ['800', '555', '1212']
    number has extension
    ['800', '555', '1212', '1234']
    number has extension
    ['800', '555', '1212', '1234']
    number has extension
    ['800', '555', '1212', '1234']
      December 17, 2021 11:33 AM IST
    0
  • I think re.findall and the similarity of the groups allow you a simpler approach:

    >>> import re
    >>> from pprint import pprint
    >>> res = [re.findall(r'\d{3,4}', s) for s in st]
    >>> pprint res
    [['800', '555', '1212'],
     ['800', '555', '1212'],
     ['800', '555', '1212', '1234'],
     ['800', '555', '1212', '1234'],
     ['800', '555', '1212', '1234']]
      January 29, 2022 2:56 PM IST
    0