# How to use pandas_cub
The README.ipynb notebook will serve as the documentation and usage guide to pandas_cub.
## Installation
`pip install pandas-cub`
## What is pandas_cub?
pandas_cub is a simple data analysis library that emulates the functionality of the pandas library. The library is not meant for serious work. It was built as an assignment for one of Ted Petrou's Python classes. If you would like to complete the assignment on your own, visit [this repository][1]. There are about 40 steps and 100 tests that you must pass in order to rebuild the library. It is a good challenge and teaches you the fundamentals of how to build your own data analysis library.
## pandas_cub functionality
pandas_cub has limited functionality but is still capable of a wide variety of data analysis tasks.
* Subset selection with the brackets
* Arithmetic and comparison operators (+, -, <, !=, etc...)
* Aggregation of columns with most of the common functions (min, max, mean, median, etc...)
* Grouping via pivot tables
* String-only methods for columns containing strings
* Reading in simple comma-separated value files
* Several other methods
## pandas_cub DataFrame
pandas_cub has a single main object, the DataFrame, to hold all of the data. The DataFrame is capable of holding 4 data types - booleans, integers, floats, and strings. All data is stored in NumPy arrays. panda_cub DataFrames have no index (as in pandas). The columns must be strings.
### Missing value representation
Boolean and integer columns will have no missing value representation. The NumPy NaN is used for float columns and the Python None is used for string columns.
## Code Examples
pandas_cub syntax is very similar to pandas, but implements much fewer methods. The below examples will cover just about all of the API.
[1]: https://github.com/tdpetrou/pandas_cub
### Reading data with `read_csv`
pandas_cub consists of a single function, `read_csv`, that has a single parameter, the location of the file you would like to read in as a DataFrame. This function can only handle simple CSV's and the delimiter must be a comma. A sample employee dataset is provided in the data directory. Notice that the visual output of the DataFrame is nearly identical to that of a pandas DataFrame. The `head` method returns the first 5 rows by default.
```python
import pandas_cub as pdc
```
```python
df = pdc.read_csv('data/employee.csv')
df.head()
```
<table><thead><tr><th></th><th>dept </th><th>race </th><th>gender </th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td><td> 45279</td></tr><tr><td><strong>1</strong></td><td>Houston Fire Department (HFD)</td><td>White </td><td>Male </td><td> 63166</td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>Black </td><td>Male </td><td> 66614</td></tr><tr><td><strong>3</strong></td><td>Public Works & Engineering-PWE</td><td>Asian </td><td>Male </td><td> 71680</td></tr><tr><td><strong>4</strong></td><td>Houston Airport System (HAS)</td><td>White </td><td>Male </td><td> 42390</td></tr>
### DataFrame properties
The `shape` property returns a tuple of the number of rows and columns
```python
df.shape
```
(1535, 4)
The `len` function returns just the number of rows.
```python
len(df)
```
1535
The `dtypes` property returns a DataFrame of the column names and their respective data type.
```python
df.dtypes
```
<table><thead><tr><th></th><th>Column Name</th><th>Data Type </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>dept </td><td>string </td></tr><tr><td><strong>1</strong></td><td>race </td><td>string </td></tr><tr><td><strong>2</strong></td><td>gender </td><td>string </td></tr><tr><td><strong>3</strong></td><td>salary </td><td>int </td></tr>
The `columns` property returns a list of the columns.
```python
df.columns
```
['dept', 'race', 'gender', 'salary']
Set new columns by assigning the `columns` property to a list.
```python
df.columns = ['department', 'race', 'gender', 'salary']
df.head()
```
<table><thead><tr><th></th><th>department</th><th>race </th><th>gender </th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Houston Police Department-HPD</td><td>White </td><td>Male </td><td> 45279</td></tr><tr><td><strong>1</strong></td><td>Houston Fire Department (HFD)</td><td>White </td><td>Male </td><td> 63166</td></tr><tr><td><strong>2</strong></td><td>Houston Police Department-HPD</td><td>Black </td><td>Male </td><td> 66614</td></tr><tr><td><strong>3</strong></td><td>Public Works & Engineering-PWE</td><td>Asian </td><td>Male </td><td> 71680</td></tr><tr><td><strong>4</strong></td><td>Houston Airport System (HAS)</td><td>White </td><td>Male </td><td> 42390</td></tr>
The `values` property returns a single numpy array of all the data.
```python
df.values
```
array([['Houston Police Department-HPD', 'White', 'Male', 45279],
['Houston Fire Department (HFD)', 'White', 'Male', 63166],
['Houston Police Department-HPD', 'Black', 'Male', 66614],
...,
['Houston Police Department-HPD', 'White', 'Male', 43443],
['Houston Police Department-HPD', 'Asian', 'Male', 55461],
['Houston Fire Department (HFD)', 'Hispanic', 'Male', 51194]],
dtype=object)
### Subset selection
Subset selection is handled with the brackets. To select a single column, place that column name in the brackets.
```python
df['race'].head()
```
<table><thead><tr><th></th><th>race </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>White </td></tr><tr><td><strong>1</strong></td><td>White </td></tr><tr><td><strong>2</strong></td><td>Black </td></tr><tr><td><strong>3</strong></td><td>Asian </td></tr><tr><td><strong>4</strong></td><td>White </td></tr>
Select multiple columns with a list of strings.
```python
df[['race', 'salary']].head()
```
<table><thead><tr><th></th><th>race </th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td>White </td><td> 45279</td></tr><tr><td><strong>1</strong></td><td>White </td><td> 63166</td></tr><tr><td><strong>2</strong></td><td>Black </td><td> 66614</td></tr><tr><td><strong>3</strong></td><td>Asian </td><td> 71680</td></tr><tr><td><strong>4</strong></td><td>White </td><td> 42390</td></tr>
Simultaneously select rows and columns by passing the brackets the row selection followed by the column selection separated by a comma. Here we use integers for rows and strings for columns.
```python
rows = [10, 50, 100]
cols = ['salary', 'race']
df[rows, cols]
```
<table><thead><tr><th></th><th>salary </th><th>race </th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 77076</td><td>Black </td></tr><tr><td><strong>1</strong></td><td> 81239</td><td>White </td></tr><tr><td><strong>2</strong></td><td> 81239</td><td>White </td></tr>
You can use integers for the columns as well.
```python
rows = [10, 50, 100]
cols = [2, 0]
df[rows, cols]
```
<table><thead><tr><th></th><th>gender </th><th>department</th></tr></thead><tbody><tr><td><strong>0</strong></td><td>Male </td><td>Houston Police Department-HPD</td></tr><tr><td><strong>1</strong></td><td>Male </td><td>Houston Police Department-HPD</td></tr><tr><td><strong>2</strong></td><td>Male </td><td>Houston Police Department-HPD</td></tr>
You can use a single integer and not just a list.
```python
df[99, 3]
```
<table><thead><tr><th></th><th>salary </th></tr></thead><tbody><tr><td><strong>0</strong></td><td> 66614</td></tr>
Or a single string for the columns
```python
df[99, 'sala
pandas_cub-0.0.3.tar.gz
需积分: 5 105 浏览量
2024-03-12
18:21:20
上传
评论
收藏 29KB GZ 举报
程序员Chino的日记
- 粉丝: 2816
- 资源: 3万+
最新资源
- LM324_datasheet.pdf
- 全新Storm+Core+API管理系统源码
- 基于RP2040的电子沙漏,使用RP2040游戏机开发板,灯板是74HC595驱动的8*8LED矩阵
- 基于SSM和VUE的商店POS积分管理系统(免费提供全套java开源项目源码+论文)
- 基于SpringBoot的“在线动漫信息平台”的设计与实现.rar
- NPP xml tools
- 基于python开发的树莓派RP2040的游戏机
- 基于SNMP网络设备MIB信息采集系统(免费提供全套java开源项目源码+论文)
- 基于SSM和VUE的五子棋游戏的设计(免费提供全套java开源项目源码+论文)
- qiun-data-charts
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈