Pandas I: read_csv(), head(), tail(), info(), and describe()
In the last tutorial, I introduced you to Jupyter Notebooks. Now that you have Jupyter notebooks installed, I can show you how to use them to work with data. I will also introduce you to Pandas, a popular library used for working with spreadsheets.
This blog is part of a series of tutorials called Data in Day. Follow these tutorials to create your first end-to-end data science project in just one day. This is a fun easy project that will teach you the basics of setting up your computer for a data science project and introduce you to some of the most popular tools available. It is a great way to get acquainted with the data science workflow.
- General Setup for Data Science Projects with Python
- Virtual Environments I: Installing Pyenv with Homebrew
- Virtual Environments II: Creating a Virtual Environment with Pyenv and Installing Data Science Packages
- Jupyter Notebooks I: Getting Started with Jupyter Notebooks
- GitHub I: Getting Started with GitHub
- Pandas I: read_csv(), head(), tail(), info(), and describe()
- Pandas II: drop(), isna()
I. Getting some Data
- Follow this link to Kaggle and download the Metal Bands by Nation data set into your project directory.
2. Next, open up Terminal and navigate to your project directory. Once in the directory, enter:
$ jupyter notebook
In your browser, a new tab will open up that contains the project directory. On the top right, you’ll see drop down menu that reads “New”. From the drop down, select the name of your virtual environment. This will create a new Jupyter notebook that uses the packages you have installed into that virtual environment.
II. Exploring Pandas
4. Check to ensure that you’ve installed Pandas, Matplotlib, and Seaborn. If you’re not sure how to install them, check out this tutorial.
5. Once you have Pandas, go back over to the Jupyter notebook and in the first cell, enter:
import pandas as pd
6. In the next cell, we are going to read in the spreadsheet that we downloaded earlier. To do this, enter the following:
df = pd.read_csv(“metal_bands_2017.csv”)
- df is the name we are using for the variable that’s going to store the spreadsheet, which now becomes a Pandas data frame.
- = is what is used to assign an data to a variable
- pd is the alias for pandas — the library we are using to read the file
- read_csv() is the method in the pandas library that performs this function
- “metal_bands_2017.csv” , (within the parentheses) is the name of the file we wish to work on
7. Now, we can begin to inspect the data frame by entering the following into a new cell in the notebook:
This method is a way that you can view the first five rows of the data frame. Placing an integer within the parentheses allows you to see that many rows in ascending order. Alternately, df.tail() will allow you to see the last five rows. Doing this gives us a quick assessment of the format and quality of the data.
8. To see all of the names of the columns, you can use:
This will return a list of columns.
9. Next, we want to know what kind of data we are working with. To find out, we can use:
This brings up a little report with all of the column names, their data types, and the number of null values.
So far, your notebook should look something like this:
10. If you’d like to see just the null values, enter:
10. Finally, if you want to find out some descriptive statistics you can use:
It should look something like this:
11. If you’re following along with the Data in a Day series, save this notebook for next time and call it “MyProject.ipynb”.
III. What Did We Do?
- Learned how to launch Jupyter from the command line, open a new notebook, and import a package (Pandas).
- Explored Pandas and learned some of the basic methods for data analysis.
Keep reading Pandas II to continue inspecting and transforming the Metal Bands by Nation data set with Python and Pandas.