====== Pandas Cheat Sheet ====== https://pandas.pydata.org/ https://pypi.org/project/pandas/ ===== Import ===== import pandas as pd ===== Series ===== Example inputs: # list: a = [1, 7, 2] # dictionary: kv = {"day1": 420, "day2": 380, "day3": 390} Simple series, no labels: myseries1 = pd.Series(a) print(myseries1) 0 1 1 7 2 2 dtype: int64 Series with labels: myseries2 = pd.Series(a, index=["x","y","z"]) print(myseries2) x 1 y 7 z 2 dtype: int64 Key-value as series: mykvseries = pd.Series(kv) print(mykvseries) day1 420 day2 380 day3 390 dtype: int64 Subset of key-value input: mykvseries_filtered = pd.Series(kv, index = ["day1","day2"]) print(mykvseries_filtered) day1 420 day2 380 dtype: int64 ===== Dataframes ===== Input: mydataset = { 'cars': ["BMW", "Volvo", "Ford"], 'passings': [3, 7, 2] } Load into a dataframe: mydataframe = pd.DataFrame(mydataset) print(mydataframe) cars passings 0 BMW 3 1 Volvo 7 2 Ford 2 ===== Load from a File ===== CSV: df = pd.read_csv('data.csv') JSON: df = pd.read_json('data.json') ===== Simple Analysis ===== First 10 rows: print(df.head(10)) Last 5 rows (default): print(df.tail()) Dataset info: print(df.info()) The result tells us the following: * Row count and column count * The name of each column, with the data type * How many non-null values there are present in each column ===== Clean Empty Cells ===== Drop empty cells, placing the results in a new dataframe: new_df = df.dropna() Drop empty cells, modifying the original dataframe: df.dropna(inplace = True) Replace empty cells with a default value (''130'' in this example): # WARNING: This affects all columns! df.fillna(130, inplace = True) Replace with a default value in a specific column: df.fillna({"Calories": 130}, inplace=True) Replace using the **mean**: # Mean is the average value (the sum of all values divided by number of values). x = df["Calories"].mean() df.fillna({"Calories": x}, inplace=True) Replace using the **median**: # Median is the value in the middle, after you have sorted all values ascending. x = df["Calories"].median() df.fillna({"Calories": x}, inplace=True) Replace using the **mode**: # Mode is the value that appears most frequently. x = df["Calories"].mode()[0] df.fillna({"Calories": x}, inplace=True) ===== Clean Wrong Format ===== This example assumes that we have values that are not in a consistent format, but that can still be converted to a date: df['Date'] = pd.to_datetime(df['Date'], format='mixed') But, there may be some that can't be converted at all. They will end up with **NaT** (not a time) values. We can remove them with this: df.dropna(subset=['Date'], inplace = True) ===== Clean Wrong Data ===== Sometimes, data is just wrong, e.g., typos. For simple fixes, we can update the row directly: # Assign a value of 45 to the Duration column in row 7: df.loc[7, 'Duration'] = 45 For large data sets, use rules-based updating: # For each row with a Duration value larger than 120, assign a new value of 120: for x in df.index: if df.loc[x, "Duration"] > 120: df.loc[x, "Duration"] = 120 Remove bad rows altogether: # For each row with a Duration value larger than 120, drop the row: for x in df.index: if df.loc[x, "Duration"] > 120: df.drop(x, inplace = True) ===== Remove Duplicates ===== Find duplicates: print(df.duplicated()) Remove them: df.drop_duplicates(inplace = True) ===== Correlation ===== The ''corr()'' method calculates the relationship between each column in a data set. The closer to 1 a correlation value is, the more closely related the columns are. A positive correlation means values are likely to move together, e.g., if one goes up, the other probably will too. A negative correlation shows the opposite, e.g., if one goes up, the other is likely to go down. df.corr() Example output: ^ ^ Duration ^ Pulse ^ Maxpulse ^ Calories ^ | Duration | 1.000000 | -0.155408 | 0.009403 | 0.922717 | | Pulse | -0.155408 | 1.000000 | 0.786535 | 0.025121 | | Maxpulse | 0.009403 | 0.786535 | 1.000000 | 0.203813 | | Calories | 0.922717 | 0.025121 | 0.203813 | 1.000000 | ===== Plotting ===== Import ''matplotlib'': import matplotlib.pyplot as plt Line plot (default): df.plot() plt.show() Scatter plot: # You can use .corr() to check for strong correlation and determine good # argument candidates for a scatter plot. df.corr() df.plot(kind = 'scatter', x = 'Duration', y = 'Calories') plt.show() Histogram: df["Duration"].plot(kind = 'hist') {{tag>python}}