Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
882 views
in Technique[技术] by (71.8m points)

python - How to match text in a cell to regex and keep only the text which matches regex?

What I am trying to do: There is a large excel sheet with a lot haphazard customer information. I want to sort the email address and other data in a set format in a new excel file.

I can't quite figure out how to match the cell text(which will have some format like Address Email squished togethe and similar) with the regex and to keep only the regex data in a list.

Would really appreciate some help. Thanks

import sys, os, openpyxl
def sort_email_from_xl():
    sheet = sheet_select()   #Opens the worksheet
    emailRegex = re.compile(r'''([a-zA-Z0-9._%+-]+@+[a-zA-Z0-9.-]+(.[a-zA-Z]{2,4}))''',re.VERBOSE)
    customeremails = []
    for row in range(0, max_row):
        if cell.text == emailRegex:
            mail = cell.text
            customeremails.append(mail)
            return customeremails
    print(customeremails)
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This code should work (I could only test the regex part though):

import sys, os, openpyxl
def sort_email_from_xl():
    sheet = sheet_select()   #Opens the worksheet
    emailRegex = re.compile(".*?([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,4}).*?")
    customeremails = []
    for row in range(0, max_row):
        if emailRegex.match(cell.text):
            mail = emailRegex.match(cell.text).groups()[0]
            cell.text = mail
            customeremails.append(mail)
    print(customeremails)

There were many problems with your code. First about the regex:

  • the regex was not allowing text around your email address, added that with .*? at start and end
  • you don't need the re.VERBOSE part as you'd only need it if you want to add inline comments to your regex, see doc
  • you allowed email addresses with many @ in between
  • you matched the TLD separately, that's unneeded

Now, the email regex works for basic usage, but I'd definitively recommend to take a proven email regex from other answers on Stackoverflow.

Then: with emailRegex.match(cell.text) you can check if the cell.text matches your regex and with emailRegex.match(cell.text).groups()[0] you extract only the matching part. You had one return statement too much as well.

For some reason the above code is giving me a NameError: name 'max_row' is not defined

You need to correct the looping through the rows e.g. like documented here


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...