Intro to Pandas

May 16, 2020

By Marc Garcia - https://github.com/pandas-dev/pandas/blob/master/web/pandas/static/img/pandas.svg, BSD, https://commons.wikimedia.org/w/index.php?curid=73107397

Pandas is a Python Library used to store and manipulate tabular data.
This is an essential library that we'll use in the future (and quite frequently so) to manipulate our data for predictions, classifications and what not.

So, without further ado, let's jump right into it.

Let us first download our data, that we will use to practice all the features and functions of library.
Go to

https://www.kaggle.com/sohier/calcofi

It should direct you to this page:

Click on the "Download" button right beside the "New Notebook".

It'll download a zipped file.

(Save the zipped file somewhere in the C drive)

Extract that to your desired folder (Also, somewhere in C drive). I put the folder here :

Now open "Anaconda Navigator" and download pandas package in your desired environment.

If you're not familiar with Anaconda, environments and Jupyter Notebook, head over to this page:

https://saatweek.blogspot.com/2020/05/intro-to-anaconda.html

AND NOW

TIME. TO. CODE.😎

Let us first import pandas. We'll do this by typing in:

import pandas as pd

Importing pandas as "pd" is just a commonly used notation, you can use anything. But using pd is just generally preferred.

Hit Ctrl+Enter to run (execute) that cell.

Now we shall load the database by specifying its type, its location and its name.

I typed in:

ds = pd.read_csv('calcofi/bottle.csv')

and run it! (hit Ctrl+Enter)

Here, the read_csv tells panda to read a csv file (there's also a read_excel, read_table and many more). calcofi is the folder where I've stored my dataset.

(The address of the file is relative to the address of your Jupyter notebook file. For example, if the dataset is in the same folder as your Jupyter notebook, then just type in the name of the file. If the dataset is in a folder, that in turn is present in the folder of your Jupyter Notebook, then type in the folder name, followed by a forward slash, then the file. Use ../ to indicate the folder above the current one, then follow it with the file name and so on. If you still don't get it, watch a YouTube video or something)

If you type ds and run the cell, it'll show you, sort of a summary of the database that you've loaded.

If you type and run,

ds.shape

it will display the number of rows and number of columns respectively in the form of a tuple.

Here, it shows that we have 864,863 rows and 74 columns.

If you type in

ds.size

it will display the number of elements in the table.
Typing in

ds.describe()

will show a descriptive statistics of numeric columns

And

df.dtypes

will show you data types of each column (I'm gonna let you see that for yourself)

Also try

df.info()

If, in case your table does not have any separators or headers (name of categories)

Then, you'll have to specify the separators and name the headers manually.

For example, load the following data:

url ='http://bit.ly/movieusers'
users = pd.read_table(url)
users.head(

You'll see the following:

As you can see, since the data is unlabeled and non-separated, the table is all weird and unusable.

Assuming someone told us that the '1' is the serial number/user ID, '24' is the age, 'M' refers to the gender, 'technician' refers to the occupation and '85711' is his pin code. And we can see that all these things are separated by a ' | '.

We can fix the above table by writing:

user_cols = ['user_id', 'age', 'gender', 'occupation', 'pin_code']
users = pd.read_table(url, sep='|', header=None, names=user_cols)
users.head()

This results in a much cleaner, much more usable dataset:

Alright!

Now the next step would be to select a series in Pandas

What is a series you ask?

A series is a (m x 1) matrix with m rows and 1 column.

Think of it selecting a whole feature of a dataset

Let's now use another dataset, to clearly understand how all of this works

Type in the following code to download and view your new example dataset

url2 = 'http://bit.ly/uforeports'
ufo = pd.read_table(url2, sep = ',')
ufo.head()

Now if you type in

ufo['City']

The "City" is case sensitive, so take care of that. Write the exact name of the column you want to select

You'll see the list of all the cities mentioned in the dataset:

You can also concatenate multiple columns, like:

ufo['Location'] = ufo.['City']+ ', '+ufo.['State']
ufo.head()

Alright!
FINALLY!

THE END

Depending on the type of person you are, this was either too painstakingly boring, or a quick read.

This might've been too overwhelming or too under.

But regardless, I'm glad you read the entirety of it.

We'd be using all these function in our future "tutorials".

Thank You

And Cheers!

Satwik

Previous Post : Introduction to Anaconda

Next Post : NumPy

Algos Expeditus

Intro to Pandas

THE END

Comments

Post a Comment

Popular posts from this blog

Solving Sudoku

Plotly

Computing Expressions