Pandas III: value_counts(), duplicated(), min(), max()
In Pandas II, we began to clean up the Metal Bands by Nation data set. We eliminated an unneeded column and filled some missing values. Now, we are going to examine the other columns, assess their data types and value, and take action as needed.
This blog is part of a series of tutorials called Data in Day. Follow these tutorials to create your first end-to-end data science project in just one day. This is a fun easy project that will teach you the basics of setting up your computer for a data science project and introduce you to some of the most popular tools available. It is a great way to get acquainted with the data science workflow.
- General Setup for Data Science Projects with Python
- Virtual Environments I: Installing Pyenv with Homebrew
- Virtual Environments II: Creating a Virtual Environment with Pyenv and Installing Data Science Packages
- Jupyter Notebooks I: Getting Started with Jupyter Notebooks
- GitHub I: Getting Started with GitHub
- Pandas I: read_csv(), head(), tail(), info(), and describe()
- Pandas II: drop(), isna()
- Pandas III: value_counts(), duplicated(), min(), and max()
I. Examining Data Types
- Open your project notebook, then click Kernel > Restart and Run All to re-execute our previous work.
2. It would be good to see what other kinds of values we are dealing with. According to what we learned from entering info we have five object (non-numeric) columns, and one numeric column. The numeric column is our target variable — the number of fans (more on that later). The rest of the data is categorical data that informs us about the qualities of the different bands.
II. Value Counts in Python with Pandas
4. The next column we will look at is “origin”. First, we will check the unique values using value_counts(), like this:
There are many values, and Pandas is only displaying the head and the tail of the results. We can wrap the line of code above in len to find out exactly how many unique countries of origin that we have.
III. Duplicates in Python with Pandas
3. Five thousand metal bands seems like a lot, so maybe we can check for duplicates just to make sure all of the rows are unique bands. We can check it like this:
This returns the number of rows where the value of “band_name” is a duplicate. We can drop them using duplicated():
df = df.drop(df[df[“band_name”].duplicated()].index)
Your notebook should look something like this:
IV. Using min() and max() with Python and Pandas
5. Next, we have the column “formed”. Calling value_counts() will return all the unique years and the frequency by which they appear. However, it might be more helpful to find out the earliest and latest years so we can get a range. Enter this into the next cell:
Even though info() indicated that “formed” is an object column, year data is actually numeric. So if we decide to replace “-“, using a substitute like “Missing” will cause problems later on. We are better off using a year that does not appear anywhere else in the data, like “1900”.
V. Transforming Data in Python with Pandas using replace() and astype()
7. You can take care of that replacement by using replace(). To apply replace() to a column in a Pandas data frame, you can use this:
df[“formed”] = df[“formed”].replace(“-”, “1900”)
Now, our min and max should be 1900–2016. If you need to see the actual minimum, simply enter this:
df[df[“formed”] != “1900”][‘formed’].min()
This will display the minimum value of the column `“formed”` in part of df such that the column `“formed”` does not (`!=`) have a value of 1900.
8. Now that we eliminated non-numeric characters from the “formed” column, we can transform the entire column into a numeric column with astype():
df[“formed”] = df[“formed”].astype(int)
9. Now, let’s do the same for the “split” column, since it is also a year, which is numeric data. That is, check the values with value_counts(), replace the “-“ with replace(), and change the data type with astype(). We will also check the min() and max().
VI. What Did We Do?
- Examined the data types of columns using info().
- Addressed duplicates with duplicated().
- Addressed nulls using isna() and fillna().
- Checked ranges using min() and max().
- Cleaned year data in “formed” and “split” with replace().
- Changed the data type of a column using .astype().
VII. What’s Next?
In Pandas IV, we continue the analysis of the Metal Bands by Nation data set by analyzing and transforming the “style” column.