Pandas II: drop() and isna()
In Pandas I tutorial, we used Jupyter Notebooks and Pandas to begin working the Metal Bands by Nation spreadsheet. We used a few different Pandas methods to do an initial inspection. In this tutorial, we will try to clean up and improve the data frame for analysis.
This blog is part of a series of tutorials called Data in Day. Follow these tutorials to create your first end-to-end data science project in just one day. This is a fun easy project that will teach you the basics of setting up your computer for a data science project and introduce you to some of the most popular tools available. It is a great way to get acquainted with the data science workflow.
- General Setup for Data Science Projects with Python
- Virtual Environments I: Installing Pyenv with Homebrew
- Virtual Environments II: Creating a Virtual Environment with Pyenv and Installing Data Science Packages
- Jupyter Notebooks I: Getting Started with Jupyter Notebooks
- GitHub I: Getting Started with GitHub
- Pandas I: read_csv(), head(), tail(), info(), and describe()
- Pandas II: drop() and isna()
- Pandas III: value_counts(), duplicated(), min(), and max()
I. Data Analysis Using a Jupyter Notebook
1. Return to your project folder in Terminal. Enter:
$ jupyter notebook
When jupyter launches, select your project notebook. In the previous tutorial, we did some preliminary analysis using Pandas methods. To re-execute that code and start where we left off, select Kernel > Restart and Run all from the top of the page.
II. Drop a Column with Pandas
2. In the second cell of the notebook, we viewed the top five rows. One thing that we can do to clean and improve the data frame is to get rid of the “Unnamed: 0" column. It seems to be a duplicate of the index.
3. To get rid of that row, we can use drop(). This method will allow us to drop either a row or a column in a data frame. Enter:
df = df.drop(“Unnamed: 0”, axis=1)
How this breaks down is we are using = to reassign our df variable to a modified version of itself. On the right side of this assignment applies drop() to df. For this to work, we need to provide drop() with the necessary arguments, in this case the name of the column and the axis. Keep in mind that if you want to drop more than one column as a time, put the columns in a list. When dropping columns, axis=1.
III. Filling Missing Values with Pandas
5. It seems that there are 8 rows missing from “origin”. Let’s examine the rows where the data is missing by entering:
Now, your notebook should look something like this:
When we enter this line, what we are doing is asking Pandas to display the parts of df such that the column “origin” has a null value. Using the df[df[column]] syntax allows us to slice view portions of the data frame that meet a certain criteria.
When dealing with missing values, using discretion is key. There are a variety of ways we can deal with missing values and there are trade-offs to any approach. Since we are trying to determine what factors influence the number of fans a metal band has, we will keep that in mind for the decision.
Two ways that we might deal with this case are elimination or substitution. We can eliminate the missing values by removing those rows. Since we have 5000 rows of data, our analysis will not be radically affected by losing eight rows. However, we could also minimize the amount of data lost by substituting “Missing”. That way, we can still use the other data for those bands in our analysis.
Since this is on a small scale, this choice is fairly trivial. However, when a data set is larger and the amount of missing data is larger, you’ll want to do what you can to avoid throwing out too much data.
6. I’m going to go with the second option and replace those values with “Missing”. All we need to do is us fillna():
df = df.fillna(“Missing”)
Great, now that we got rid of that extra column and fixed our nulls, we can begin to inspect the data types of the other columns.
7. If you’re following the Data in a Day series, save your notebook as MyProject.ipynb, so we can continue where we left off. Keep reading Pandas III to continue.
III. What Did We Do?
1. Reopened the a project in Jupyter Notebooks.
2. Used drop() to eliminate a column in a Pandas data frame.
3. Chained isna() and sum() to return the number of missing values for all columns 4.
4. Used fillna() to fill missing values
IV. What is Next?
In Pandas III, we are going analyze and transform the “split” and “formed” columns in the Metal Bands by Nation data set.