Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
146 views
in Technique[技术] by (71.8m points)

python - Filtering dataframe based on that if string is made from specific letters

So i have Dataframe that look's like this

note i put diffrent letters in * * for you to see easy

      id                                                              genome
0    639  ATGTTTGTTTTT*Y*TTGTTTTATATGTTTGTTTTTCTTGTTTTATATGTTTGTTTTTCTTGTTTTAT
1    640  ATGTTTGTTTTT*J*TTGTTTTATATGTTTGTTTTTCTTGTTTTATATGTTTGTTTTTCTTGTTTTAT
2    641  ATGTTTGTTTTTCTTGTTTTATATGTTTGTTTTTCTTGTTTTATATGTTTGTTTTTCTTGTTTTAT
3    642  ATGTTTGTTTTTCTTGTTTTATATGTTTGTTTTTCTTGTTTTATATGTTTGTTTTTCTTGTTTTAT

I want to filter it by string. Basically if string contains any other letter than A, C, T, G, N leave this row in dataframe else just delete it.

I was tying this

df = df[~df['genome'].str.contains('[^ACTGN]')]

and this

df = df[df['genome'].str.match('^[ACTGN]+$')]

but nothings seams to work, all i get is all rows are true or false despite having diffrent letters

question from:https://stackoverflow.com/questions/66052235/filtering-dataframe-based-on-that-if-string-is-made-from-specific-letters

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It looks like your strings have leading/trailing spaces (look at those alignments in print out). So try:

df['genome'] = df['genome'].str.strip()
df = df[~df['genome'].str.contains('[^ACTGN]')]

Or you can chain them if you don't want to modify your genome column:

df = df[df[~df['genome'].str.strip().str.contains('[^ACTGN]')]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...