在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
开源软件名称:KarrieK/pandas_data_cleaning开源软件地址:https://github.com/KarrieK/pandas_data_cleaning开源编程语言:开源软件介绍:Cleaning dirty data using Pandas and Jupyter notebookThere is more to life than a million rows - fact. Most data journalists start in excel, then progress to SQL and so forth but once your data swells in size most people struggle to clean millions of rows of dirty data. Rather than venturing down the SQL cleaning route and acknowledging that OpenRefine has its limitations I'm putting together a little cheat sheet on how to clean dirty data using pandas in Jupyter notebook. First steps - importing data and taking a lookIt's all well and good saying we're going to clean dirty data but do we even know how it's dirty? We need to eyeball that sucker and figure how it looks. First thing we need to do is read our data into pandas and take a look for ourselves.
Here we import pandas using the alias 'pd', then we read in our data.
Take a good look at that data and figure out what values you were expecting and what looks unusual. This is a good time to pull out your data dictionary and start looking though your data. We also have to consider what type of values each of our columns are stored as. You might see that numbers are imported as text strings making it impossible to perform calculations on them. To check this we use the following command:
This will return a list with your data types in it - the most commong types are int, float, datetime and object. An object is often an alias for a string. All Pandas knows is that it cant perform mathematical calculations on an object. Next we want to know how many columns and rows are in our dataset. To do that we use .shape like below:
Maybe we want to see some key stats in our dataframe without delving too deep, mean values, min and max. Just so we can get a feel of what we're working with. To do that we use the .describe like below:
Slicing your dataThe quickest and cleanest way to slice off a chunk of our data is:
It's fast and really powerful, you can also build conditions into it like:
Merging, joining and concatenating dataSometimes before we can clean up our dataset we need to re-structure or build it; merging, joining and concatenating rows and columns enables us to take multiple csvs and join them together. This saves time when it comes to cleaning our data for analysis Concatenating data framesBelow we have three dataframes df1, df2 and df3 that we want to merge together to create one mighty dataset
To do this we are going to concatenate them using pd.concat
While I'm a fan of pd.concat you can use .append to join your dataframes together. Check our the code below:
CleaningBefore we touch a single object we need to make a copy of our data first
Now we can get cracking. Hopefully at this point you have an idea of how your data is dirty and how you can clean it. Howver if you suspect that maybe everything isn't what it seems and that that pesky csv format has led to disjointed data in columns we can check that out. To peer into our data in a single column and make sure it only contains dates and no postcodes or amounts and no names we can use the following command:
This will give us a list of all the unique entries and the number of each. Any unslightly data whcih has bled in from other columns should be clustered at the bottom ready for you to strip out. Ocassionaly there is a trailing or leading space in the column headers which is making life difficult. To check for this try:
If there is a leading space you can strip it out:
But maybe we're good to go, but our data types are all wrong. Perhaps our amounts are a string 1,234,222 and we want them as 1234222 so we can convert them into a numeric value. Then we need to remove the commas. To do this we are going to use
Our example above is a pretty straightforward replacement but what if we need to do something a little more complicated? We want to clean only a segment of our data set based on a condition. We need a conditional replacement.
In this code we are selecting the column CCG where the string is Leeds and the column postcode, where the postcode is LS8. Then we are replacing the value Leeds with LEEDS NORTH CCG based upon that criteria. It's a bit clunky to look at but when cleaning it's a slice of magic. Converting data typesWhen you upload your csv to pandas, it might not automatically detect that the correct data type for a number. Pandas often reads in numbers in as objects. But we can not perform calculations on objects. The first thing we do is check our data types
Date int64 Postcode object Names object Amount object dtype: object In order to sum or count the 'Amount' column we need to convert the data type of an integer. To do this we use the following code below:
Inserting a new column with a fixed valueSo maybe I want to join two datasets but before I do I need to know which dataset is which so I can still compare them once they've been joined. To do that we can create a new column, we specify the index, the header and the values which will remain fixed for the length of the dataframe.
Deleting a columnDropping a column is very simple and straightforward in Pandas.
Be aware that you cannot string multiple column names together like
Instead you need to stack them on top of each other like below
Re-ordering columnsRe-ordering columns is very fast and easy. We specify the order we want using double square brackets.
Renaming column headersIn order to re-name a column header, we need to specify that our current column is equal to a new one.
Dates and timeWorking with dates and time is pretty tricky in post programming languages, hell it's tricky in excel. What I have found though is that you can extract years, months and days from your date column without too much hassle. We can also convert time stamps into total minutes, hours or seconds using the datetime library. Dealing with datesSay you have a Date Column with dates that look like this: 01/02/2010 or 01-05-2010. We want to extract the month or year without splitting it like a string. First thing you need to do is to confirm what sort of data you're working with. Here we use our handy old
This means that Pandas is interpreting our data as an object, a container of sorts of data it's not really able to parse. Generally this could mean that our data is a string. We can use a python function to confirm that our DATE column is definitely a string:
So what we need is a format we can work with, luckily in python there is a great library called Datetime which will do the job for us. So back at the top of our program we import Datetime underneath where we previous imported pandas and we re-load our csv
Then we convert our python object into a Datetime object while at the same time creating a new column called 'Year' in our dataframe:
Run Working with timesSo I FOI'd a government department for 911/999 call from a specific city. I need to calculate the mean of my time to figure out which areas get the quickest responsebut my data is a string and looks like this Well we need to convert it to something we can work with. Because theoretically our fire department is supposed to arrive to a call within ten minutes, I want the total seconds for each call in a new column. To do this I make sure I have imported the library
By grabbing the length of my new list I know if I have the correct number of total seconds for the number of rows in my dataframe. Now I need to assign that list as a new column in my data frame so I can compare mean response times with postcodes.
I make sure everything has gone smoothly with df.head() but we should have a new column in our dataframe with the total number of seconds stored as an integer ready to be used for analysis. Saving dataSo your data is nice and clean and now you want to save it to csv. This is pretty easy in Pandas. We need to specify the new name and the encoding.
|
2023-10-27
2022-08-15
2022-08-17
2022-09-23
2022-08-13
请发表评论