# How to use pandas_cub
The README.ipynb notebook will serve as the documentation and usage guide to pandas_cub.
## Installation
`pip install pandas-cub`
## What is pandas_cub?
pandas_cub is a simple data analysis library that emulates the functionality of the pandas library. The library is not meant for serious work. It was built as an assignment for one of Ted Petrou's Python classes. If you would like to complete the assignment on your own, visit [this repository][1]. There are about 40 steps and 100 tests that you must pass in order to rebuild the library. It is a good challenge and teaches you the fundamentals of how to build your own data analysis library.
## pandas_cub functionality
pandas_cub has limited functionality but is still capable of a wide variety of data analysis tasks.
* Subset selection with the brackets
* Arithmetic and comparison operators (+, -, <, !=, etc...)
* Aggregation of columns with most of the common functions (min, max, mean, median, etc...)
* Grouping via pivot tables
* String-only methods for columns containing strings
* Reading in simple comma-separated value files
* Several other methods
## pandas_cub DataFrame
pandas_cub has a single main object, the DataFrame, to hold all of the data. The DataFrame is capable of holding 4 data types - booleans, integers, floats, and strings. All data is stored in NumPy arrays. panda_cub DataFrames have no index (as in pandas). The columns must be strings.
### Missing value representation
Boolean and integer columns will have no missing value representation. The NumPy NaN is used for float columns and the Python None is used for string columns.
## Code Examples
pandas_cub syntax is very similar to pandas, but implements much fewer methods. The below examples will cover just about all of the API.
[1]: https://github.com/tdpetrou/pandas_cub
### Reading data with `read_csv`
pandas_cub consists of a single function, `read_csv`, that has a single parameter, the location of the file you would like to read in as a DataFrame. This function can only handle simple CSV's and the delimiter must be a comma. A sample employee dataset is provided in the data directory. Notice that the visual output of the DataFrame is nearly identical to that of a pandas DataFrame. The `head` method returns the first 5 rows by default.
```python
import pandas_cub as pdc
```
```python
df = pdc.read_csv('data/employee.csv')
df.head()
```
<table><thead><tr><th></th><th>dept </th><th>race </th><th>gender </th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td><td> 45279</td></tr><tr><td><strong>1</strong></td><td>Houston Fire Department (HFD)</td><td>White </td><td>Male </td><td> 63166</td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>Black </td><td>Male </td><td> 66614</td></tr><tr><td><strong>3</strong></td><td>Public Works & Engineering-PWE</td><td>Asian </td><td>Male </td><td> 71680</td></tr><tr><td><strong>4</strong></td><td>Houston Airport System (HAS)</td><td>White </td><td>Male </td><td> 42390</td></tr></tbody></table>
### DataFrame properties
The `shape` property returns a tuple of the number of rows and columns
```python
df.shape
```
(1535, 4)
The `len` function returns just the number of rows.
```python
len(df)
```
1535
The `dtypes` property returns a DataFrame of the column names and their respective data type.
```python
df.dtypes
```
<table><thead><tr><th></th><th>Column Name</th><th>Data Type </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>dept </td><td>string </td></tr><tr><td><strong>1</strong></td><td>race </td><td>string </td></tr><tr><td><strong>2</strong></td><td>gender </td><td>string </td></tr><tr><td><strong>3</strong></td><td>salary </td><td>int </td></tr></tbody></table>
The `columns` property returns a list of the columns.
```python
df.columns
```
['dept', 'race', 'gender', 'salary']
Set new columns by assigning the `columns` property to a list.
```python
df.columns = ['department', 'race', 'gender', 'salary']
df.head()
```
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td><td> 45279</td></tr><tr><td><strong>1</strong></td><td>Houston Fire Department (HFD)</td><td>White </td><td>Male </td><td> 63166</td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>Black </td><td>Male </td><td> 66614</td></tr><tr><td><strong>3</strong></td><td>Public Works & Engineering-PWE</td><td>Asian </td><td>Male </td><td> 71680</td></tr><tr><td><strong>4</strong></td><td>Houston Airport System (HAS)</td><td>White </td><td>Male </td><td> 42390</td></tr></tbody></table>
The `values` property returns a single numpy array of all the data.
```python
df.values
```
array([['Houston Police Department-HPD', 'White', 'Male', 45279],
['Houston Fire Department (HFD)', 'White', 'Male', 63166],
['Houston Police Department-HPD', 'Black', 'Male', 66614],
...,
['Houston Police Department-HPD', 'White', 'Male', 43443],
['Houston Police Department-HPD', 'Asian', 'Male', 55461],
['Houston Fire Department (HFD)', 'Hispanic', 'Male', 51194]],
dtype=object)
### Subset selection
Subset selection is handled with the brackets. To select a single column, place that column name in the brackets.
```python
df['race'].head()
```
<table><thead><tr><th></th><th>race </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>White </td></tr><tr><td><strong>1</strong></td><td>White </td></tr><tr><td><strong>2</strong></td><td>Black </td></tr><tr><td><strong>3</strong></td><td>Asian </td></tr><tr><td><strong>4</strong></td><td>White </td></tr></tbody></table>
Select multiple columns with a list of strings.
```python
df[['race', 'salary']].head()
```
<table><thead><tr><th></th><th>race </th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>White </td><td> 45279</td></tr><tr><td><strong>1</strong></td><td>White </td><td> 63166</td></tr><tr><td><strong>2</strong></td><td>Black </td><td> 66614</td></tr><tr><td><strong>3</strong></td><td>Asian </td><td> 71680</td></tr><tr><td><strong>4</strong></td><td>White </td><td> 42390</td></tr></tbody></table>
Simultaneously select rows and columns by passing the brackets the row selection followed by the column selection separated by a comma. Here we use integers for rows and strings for columns.
```python
rows = [10, 50, 100]
cols = ['salary', 'race']
df[rows, cols]
```
<table><thead><tr><th></th><th>salary </th><th>race </th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 77076</td><td>Black </td></tr><tr><td><strong>1</strong></td><td> 81239</td><td>White </td></tr><tr><td><strong>2</strong></td><td> 81239</td><td>White </td></tr></tbody></table>
You can use integers for the columns as well.
```python
rows = [10, 50, 100]
cols = [2, 0]
df[rows, cols]
```
<table><thead><tr><th></th><th>gender </th><th>department</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Male </td><td>Houston Police Department-HPD</td></tr><tr><td><strong>1</strong></td><td>Male </td><td>Houston Police Department-HPD</td></tr><tr><td><strong>2</strong></td><td>Male </td><td>Houston Police Department-HPD</td></tr></tbody></table>
You can use a single integer and not just a list.
```python
df[99, 3]
```
<table><thead><tr><th></th><th>salary </th></tr></thead><tbody><tr><