KarrieK/pandas_data_cleaning: A brief guide and tutorial on how to clean data us ...

原作者: [db:作者] 来自: 网络收藏邀请

开源软件名称：

KarrieK/pandas_data_cleaning

开源软件地址：

https://github.com/KarrieK/pandas_data_cleaning

开源编程语言：

开源软件介绍：

Cleaning dirty data using Pandas and Jupyter notebook

There is more to life than a million rows - fact. Most data journalists start in excel, then progress to SQL and so forth but once your data swells in size most people struggle to clean millions of rows of dirty data.

Rather than venturing down the SQL cleaning route and acknowledging that OpenRefine has its limitations I'm putting together a little cheat sheet on how to clean dirty data using pandas in Jupyter notebook.

First steps - importing data and taking a look

It's all well and good saying we're going to clean dirty data but do we even know how it's dirty? We need to eyeball that sucker and figure how it looks.

First thing we need to do is read our data into pandas and take a look for ourselves.

import pandas as pd

df = pd.read_csv('/user/home/test.csv')

df.head()

Here we import pandas using the alias 'pd', then we read in our data.

df.head - shows us the first 5 rows and headers - it gives us an idea what to expect. df.tail - shows us the last 5 rows

Take a good look at that data and figure out what values you were expecting and what looks unusual. This is a good time to pull out your data dictionary and start looking though your data.

We also have to consider what type of values each of our columns are stored as. You might see that numbers are imported as text strings making it impossible to perform calculations on them.

To check this we use the following command:

df.types

This will return a list with your data types in it - the most commong types are int, float, datetime and object. An object is often an alias for a string. All Pandas knows is that it cant perform mathematical calculations on an object.

Next we want to know how many columns and rows are in our dataset. To do that we use .shape like below: df.shape

Maybe we want to see some key stats in our dataframe without delving too deep, mean values, min and max. Just so we can get a feel of what we're working with. To do that we use the .describe like below: df.describe

Slicing your data

The quickest and cleanest way to slice off a chunk of our data is: df[df[col1]]

It's fast and really powerful, you can also build conditions into it like:

df[df[col1] > 20]

Merging, joining and concatenating data

Sometimes before we can clean up our dataset we need to re-structure or build it; merging, joining and concatenating rows and columns enables us to take multiple csvs and join them together. This saves time when it comes to cleaning our data for analysis

Concatenating data frames

Below we have three dataframes df1, df2 and df3 that we want to merge together to create one mighty dataset

In [1]: df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
   ...:                     'B': ['B0', 'B1', 'B2', 'B3'],
   ...:                     'C': ['C0', 'C1', 'C2', 'C3'],
   ...:                     'D': ['D0', 'D1', 'D2', 'D3']},
   ...:                     index=[0, 1, 2, 3])
   ...: `

In [2]: df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
   ...:                     'B': ['B4', 'B5', 'B6', 'B7'],
   ...:                     'C': ['C4', 'C5', 'C6', 'C7'],
   ...:                     'D': ['D4', 'D5', 'D6', 'D7']},
   ...:                      index=[4, 5, 6, 7])
   ...:

In [3]: df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
   ...:                     'B': ['B8', 'B9', 'B10', 'B11'],
   ...:                     'C': ['C8', 'C9', 'C10', 'C11'],
   ...:                     'D': ['D8', 'D9', 'D10', 'D11']},
   ...:                     index=[8, 9, 10, 11])
   ...:

To do this we are going to concatenate them using pd.concat

In [4]: frames = [df1, df2, df3]

In [5]: result = pd.concat(frames)

While I'm a fan of pd.concat you can use .append to join your dataframes together. Check our the code below:

result = df1.append([df2, df3])

Cleaning

Before we touch a single object we need to make a copy of our data first

df2 = df.copy()

Now we can get cracking. Hopefully at this point you have an idea of how your data is dirty and how you can clean it. Howver if you suspect that maybe everything isn't what it seems and that that pesky csv format has led to disjointed data in columns we can check that out.

To peer into our data in a single column and make sure it only contains dates and no postcodes or amounts and no names we can use the following command:

df2.DATE.value_counts().sort_index()

This will give us a list of all the unique entries and the number of each. Any unslightly data whcih has bled in from other columns should be clustered at the bottom ready for you to strip out.

Ocassionaly there is a trailing or leading space in the column headers which is making life difficult. To check for this try: df2.columns

If there is a leading space you can strip it out: df2.rename(columns=lambda x: x.strip(), inplace=True)

But maybe we're good to go, but our data types are all wrong. Perhaps our amounts are a string 1,234,222 and we want them as 1234222 so we can convert them into a numeric value. Then we need to remove the commas. To do this we are going to use str.replace()

df2['amount'] = df2['amount'].str.replace(',', '')
df2.head()

Our example above is a pretty straightforward replacement but what if we need to do something a little more complicated? We want to clean only a segment of our data set based on a condition. We need a conditional replacement.

df.loc[(df['COUNCIL'] == 'LEEDS') & (df['POSTCODE'] == 'LS8'), ['CCG']] = 'LEEDS NORTH CCG'

In this code we are selecting the column CCG where the string is Leeds and the column postcode, where the postcode is LS8. Then we are replacing the value Leeds with LEEDS NORTH CCG based upon that criteria. It's a bit clunky to look at but when cleaning it's a slice of magic.

Converting data types

When you upload your csv to pandas, it might not automatically detect that the correct data type for a number. Pandas often reads in numbers in as objects. But we can not perform calculations on objects.

The first thing we do is check our data types

df.dtypes

Date int64 Postcode object Names object Amount object

dtype: object

In order to sum or count the 'Amount' column we need to convert the data type of an integer.

To do this we use the following code below:

df['Amount'] = pd.to_numeric(df['Amount'])
df.dtypes

Inserting a new column with a fixed value

So maybe I want to join two datasets but before I do I need to know which dataset is which so I can still compare them once they've been joined.

To do that we can create a new column, we specify the index, the header and the values which will remain fixed for the length of the dataframe.

df.insert(loc=0, column='Country', value='UK')

Deleting a column

Dropping a column is very simple and straightforward in Pandas.

del df2['column_name']

Be aware that you cannot string multiple column names together like del df2['col1','col2','col3']

Instead you need to stack them on top of each other like below

del df2['column_name'] 
del df2['column_name'] 
del df2['column_name']

Re-ordering columns

Re-ordering columns is very fast and easy. We specify the order we want using double square brackets.

df2 = df2[['A', 'B', 'C','D','E']]

Renaming column headers

In order to re-name a column header, we need to specify that our current column is equal to a new one.

df2 = df2.rename(columns={'amount_clean': 'amount'})

Dates and time

Working with dates and time is pretty tricky in post programming languages, hell it's tricky in excel. What I have found though is that you can extract years, months and days from your date column without too much hassle.

We can also convert time stamps into total minutes, hours or seconds using the datetime library.

Dealing with dates

Say you have a Date Column with dates that look like this: 01/02/2010 or 01-05-2010. We want to extract the month or year without splitting it like a string.

First thing you need to do is to confirm what sort of data you're working with. Here we use our handy old dtypes command again. You should see something like this:

df2.dtypes

DATE                          object
POSTCODE                      object
dtype: object

This means that Pandas is interpreting our data as an object, a container of sorts of data it's not really able to parse. Generally this could mean that our data is a string.

We can use a python function to confirm that our DATE column is definitely a string:

type(df2['DATE'][0]) str is our output confirming our suspicions.

So what we need is a format we can work with, luckily in python there is a great library called Datetime which will do the job for us.

So back at the top of our program we import Datetime underneath where we previous imported pandas and we re-load our csv

import pandas as pd
import datetime
df = pd.read_csv('/user/home/test.csv')

Then we convert our python object into a Datetime object while at the same time creating a new column called 'Year' in our dataframe:

df2['YEAR'] = pd.DatetimeIndex(df2['DATE']).year

Run df2.head() after running the conversion above and you should have a new column in your dataframe with years cleanly extracted.

Working with times

So I FOI'd a government department for 911/999 call from a specific city. I need to calculate the mean of my time to figure out which areas get the quickest responsebut my data is a string and looks like this 00:00:00. I check my data type and yup anothr object.

Well we need to convert it to something we can work with. Because theoretically our fire department is supposed to arrive to a call within ten minutes, I want the total seconds for each call in a new column.

To do this I make sure I have imported the library datetimelike above then I create an empty list and write a for loop using one of the functions .totalseconds() contained within the datetime module. It looks something like this:

totes = []
for td in df['Best Response Duration Time']:
    totes.append(td.total_seconds())

len(totes)

By grabbing the length of my new list I know if I have the correct number of total seconds for the number of rows in my dataframe.

Now I need to assign that list as a new column in my data frame so I can compare mean response times with postcodes.

se = pd.Series(totes)
df['RESPONSE_TIME_SECONDS'] = se.values

I make sure everything has gone smoothly with df.head() but we should have a new column in our dataframe with the total number of seconds stored as an integer ready to be used for analysis.

Saving data

So your data is nice and clean and now you want to save it to csv. This is pretty easy in Pandas. We need to specify the new name and the encoding.

df2.to_csv('clean_data.csv', encoding='utf8')

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

LeoMurri/PennGrader: A seamless in-line Jupyter Notebook autograder.发布时间：2022-07-09

williamjameshandley/py2nb: convert python scripts to jupyter notebooks发布时间：2022-07-09

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：18290|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：9682|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8181|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8552|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8460|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9396|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8432|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：7867|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8416|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7394|2022-11-06

客服电话

电子邮件

KarrieK/pandas_data_cleaning: A brief guide and tutorial on how to clean data us ...

开源软件名称：

开源软件地址：

开源编程语言：

开源软件介绍：

Cleaning dirty data using Pandas and Jupyter notebook

First steps - importing data and taking a look

Slicing your data

Merging, joining and concatenating data

Concatenating data frames

Cleaning

Converting data types

Inserting a new column with a fixed value

Deleting a column

Re-ordering columns

Renaming column headers

Dates and time

Dealing with dates

Working with times

Saving data

请发表评论

全部评论

上一篇：

下一篇：

rtconner/laravel-tagging: Tag support fo

PacktPublishing/Python-Machine-Learning-

sussillo/hfopt-matlab: A parallel, cpu-b

鲁东大学一米网:Win7系统USB驱动器RAM的操

实现平台化小程序语音红包

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053