====== Pandas Cheat Sheet ======
https://pandas.pydata.org/
https://pypi.org/project/pandas/
===== Import =====
import pandas as pd
===== Series =====
Example inputs:
# list:
a = [1, 7, 2]
# dictionary:
kv = {"day1": 420, "day2": 380, "day3": 390}
Simple series, no labels:
myseries1 = pd.Series(a)
print(myseries1)
0 1
1 7
2 2
dtype: int64
Series with labels:
myseries2 = pd.Series(a, index=["x","y","z"])
print(myseries2)
x 1
y 7
z 2
dtype: int64
Key-value as series:
mykvseries = pd.Series(kv)
print(mykvseries)
day1 420
day2 380
day3 390
dtype: int64
Subset of key-value input:
mykvseries_filtered = pd.Series(kv, index = ["day1","day2"])
print(mykvseries_filtered)
day1 420
day2 380
dtype: int64
===== Dataframes =====
Input:
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
Load into a dataframe:
mydataframe = pd.DataFrame(mydataset)
print(mydataframe)
cars passings
0 BMW 3
1 Volvo 7
2 Ford 2
===== Load from a File =====
CSV:
df = pd.read_csv('data.csv')
JSON:
df = pd.read_json('data.json')
===== Simple Analysis =====
First 10 rows:
print(df.head(10))
Last 5 rows (default):
print(df.tail())
Dataset info:
print(df.info())
The result tells us the following:
* Row count and column count
* The name of each column, with the data type
* How many non-null values there are present in each column
===== Clean Empty Cells =====
Drop empty cells, placing the results in a new dataframe:
new_df = df.dropna()
Drop empty cells, modifying the original dataframe:
df.dropna(inplace = True)
Replace empty cells with a default value (''130'' in this example):
# WARNING: This affects all columns!
df.fillna(130, inplace = True)
Replace with a default value in a specific column:
df.fillna({"Calories": 130}, inplace=True)
Replace using the **mean**:
# Mean is the average value (the sum of all values divided by number of values).
x = df["Calories"].mean()
df.fillna({"Calories": x}, inplace=True)
Replace using the **median**:
# Median is the value in the middle, after you have sorted all values ascending.
x = df["Calories"].median()
df.fillna({"Calories": x}, inplace=True)
Replace using the **mode**:
# Mode is the value that appears most frequently.
x = df["Calories"].mode()[0]
df.fillna({"Calories": x}, inplace=True)
===== Clean Wrong Format =====
This example assumes that we have values that are not in a consistent format, but that can still be converted to a date:
df['Date'] = pd.to_datetime(df['Date'], format='mixed')
But, there may be some that can't be converted at all. They will end up with **NaT** (not a time) values. We can remove them with this:
df.dropna(subset=['Date'], inplace = True)
===== Clean Wrong Data =====
Sometimes, data is just wrong, e.g., typos.
For simple fixes, we can update the row directly:
# Assign a value of 45 to the Duration column in row 7:
df.loc[7, 'Duration'] = 45
For large data sets, use rules-based updating:
# For each row with a Duration value larger than 120, assign a new value of 120:
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120
Remove bad rows altogether:
# For each row with a Duration value larger than 120, drop the row:
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
===== Remove Duplicates =====
Find duplicates:
print(df.duplicated())
Remove them:
df.drop_duplicates(inplace = True)
===== Correlation =====
The ''corr()'' method calculates the relationship between each column in a data set. The closer to 1 a correlation value is, the more closely related the columns are.
A positive correlation means values are likely to move together, e.g., if one goes up, the other probably will too. A negative correlation shows the opposite, e.g., if one goes up, the other is likely to go down.
df.corr()
Example output:
^ ^ Duration ^ Pulse ^ Maxpulse ^ Calories ^
| Duration | 1.000000 | -0.155408 | 0.009403 | 0.922717 |
| Pulse | -0.155408 | 1.000000 | 0.786535 | 0.025121 |
| Maxpulse | 0.009403 | 0.786535 | 1.000000 | 0.203813 |
| Calories | 0.922717 | 0.025121 | 0.203813 | 1.000000 |
===== Plotting =====
Import ''matplotlib'':
import matplotlib.pyplot as plt
Line plot (default):
df.plot()
plt.show()
Scatter plot:
# You can use .corr() to check for strong correlation and determine good
# argument candidates for a scatter plot.
df.corr()
df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')
plt.show()
Histogram:
df["Duration"].plot(kind = 'hist')
{{tag>python}}