Things you should know when using python pandas
What is Pandas?
Pandas is a Python library that provides high-performance data analysis tools and structures. It offers a wide variety of data structures and operations for manipulating numerical data, including the ability to slice, dice and merge data sets. Additionally, Pandas includes powerful statistical tools for performing regression analysis and other statistical tests.
How do I install Pandas?
Installing Pandas is easy; simply type:
"pip install pandas"
into your terminal window and hit enter. This will install the latest version of Pandas onto your system.
How do I import Pandas?
After installing Pandas, you can import it into your Python script by typing:
"import pandas as pd"
This will give you access to all of the functions and methods in the Pandas library.
What are the basic data structures in Pandas?
The two primary data structures in Pandas are the Series and the DataFrame. A Series is a one-dimensional array of data, while a DataFrame is a two-dimensional array with columns that can be labeled.
How do I create a Series in Pandas?
Creating a Series in Pandas is easy; simply pass in a list of values to the pd.Series() function. For example, to create a Series of integers from 0 to 9, you would type:
"pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])".
How do I create a DataFrame in Pandas?
Creating a DataFrame in Pandas is just as easy as creating a Series; simply pass in a dictionary of values to the pd.DataFrame() function. For example, to create a DataFrame with three columns (A, B, and C), you would type:
"pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5], 'C': [6, 7 ,8]})".
How do I index and select data in Pandas?
Indexing and selecting data in Pandas is similar to indexing and selecting data in NumPy; simply use square brackets [] to select specific columns or rows from your DataFrame or Series object.
For example, to select all rows with an 'A' value greater than 0 from our previous DataFrame example: "df[df['A'] > 0]" To select all rows with an 'A' value greater than 0 AND a 'B' value less than 5: "df[(df['A'] > 0) & (df['B'] < 5)]"
Note that when using multiple conditions within square brackets like this you MUST use parentheses around each individual condition (for clarity as well as proper execution).
Finally - what if we want all rows where 'A' is greater than 0 OR 'B' is less than 5: We use the pipe character ("|") instead of the ampersand ("&"): "df[(df['A'] > 0) | (df['B'] < 5)]" There are many more complex things that can be done with indexing and selection but these are some basics to get you started!
What are some basic statistical functions available in Pandas?
Pandas provides many built-in statistical functions that can be used on both Series and DataFrame objects alike. Some basic statistics that can be calculated include mean(), median(), mode(), min(), max(), std(), var(), and count(). To calculate any of these statistics on your data simply type "dataframe_name.statistic name" for example: "df['age'].mean()" would calculate the mean age of all individuals in our df DataFrame object from earlier while "df[['age', 'weight']].std()" would calculate the standard deviation of both age AND weight simultaneously across our df DataFrame object