# Pandas Read XML
A tool to help read XML files as pandas dataframes.
See example in [Google Colab here](https://colab.research.google.com/github/minchulkim87/pandas_read_xml/blob/master/pandas_read_xml_example.ipynb)
Isn't it annoying working with data in XML format? I think so. Take a look at this simple example.
```xml
<first-tag>
<not-interested>
blah blah
</not-interested>
<second-tag>
<the-tag-you-want-as-root>
<row>
<columnA>
The data that you want
</columnA>
<columnB>
More data that you want
</columnB>
</row>
<row>
<columnA>
Yet more data that you want
</columnA>
<columnB>
Eh, get this data too
</columnB>
</row>
</the-tag-you-want-as-root>
</second-tag>
<another-irrelevant-tag>
some other info that you do not want
</another-irrelevant-tag>
</first-tag>
```
I wish there was a simple `df = pd.read_xml('some_file.xml')` like `pd.read_csv()` and `pd.read_json()` that we all love.
I can't solve this with my time and skills, but perhaps this package will help get you started.
## Install
```bash
pip install pandas_read_xml
```
## Import package
```python
import pandas_read_xml as pdx
```
## Read XML as pandas dataframe
You will need to identify the path to the "root" tag in the XML from which you want to extract the data.
```python
df = pdx.read_xml("test.xml", ['first-tag', 'second-tag', 'the-tag-you-want-as-root'])
```
*Sometimes, the XML structure is such that pandas will treat rows vs columns in a way that we think are opposites. For these cases, the read_xml may fail. Try using `transpose=True` as an argument in such cases.
### Real example.
Here is a real example taken from USPTO. It is one of their "daily diff" files for the US trademark applications data.
```python
test_zip_path = 'https://bulkdata.uspto.gov/data/trademark/dailyxml/applications/apc200219.zip'
root_key_list = ['trademark-applications-daily', 'application-information', 'file-segments', 'action-keys']
df = pdx.read_xml(test_zip_path, root_key_list)
```
# Auto Flatten
The real cumbersome part of working with XML data (or JSON data) is that they do not represent a single table. Rather, they are a (nested) tree representations of what probably were relational databases. Often, these XML data are exported without a clearly documented schema, and more often, no clear way of navigating the data.
What is even more annoying is that, in comparison to JSON, the data structures are not consistent across XML files from the same schema. Some files may have multiples of the same tag, resulting in a list-type data, while in other files of the *same* schema will only have on of that tag, resulting in a non-list-type data. In other times, the tags are not present which means that the resulting "column" is not just null, but not even a column. This makes it difficult to "flatten".
Pandas already has some tools to help "explode" (items in list become separate rows) and "normalise" (key, value pairs in one column become separate columns of data), but they fail when there are these mixed types within the same tags (columns). Besides, "flattening" (combining exploding and normalising) duplicates other data in the dataframe as well, leading to an explosion of memory requirements.
So, in this tool, I have also attempted to make a few different tools to separate the relational tables.
A quick example from the same dataframe from USPTO above:
```python
from pandas_read_xml import auto_separate_tables
key_columns = ['action-key', 'case-file|serial-number']
data = df.pipe(auto_separate_tables, key_columns)
```
will separate out what the `auto_separate_tables` function guesses to be separate tables. The resulting `data` is a dictionary where the keys are the "table names" and the corresponding values are the pandas dataframes. Each of the separate tables will have the `key_columns` as common columns.
You can see the list of separated tables by using python dictionary methods.
```python
data.keys()
```
And then view a table.
```python
data['classifications']
```
There are also other "smaller" functions that does parts of the job:
- flatten(df)
- auto_flatten(df, key_columns)
- fully_flatten(df, key_columns)
Even more if you look through the code.
pandas_read_xml-0.1.0.tar.gz
需积分: 1 67 浏览量
2024-03-16
13:58:46
上传
评论
收藏 5KB GZ 举报
![avatar](https://profile-avatar.csdnimg.cn/ec7f5c9efb5b4604b3b8de71dbfb0088_calculusstill.jpg!1)
程序员Chino的日记
- 粉丝: 3038
- 资源: 4万+
最新资源
- 基于GUI+MYSQL+JAVA图书管理系统文档说明+源码(高分大作业项目).zip
- 基于Qt使用C++实现图书管理系统源码+数据库(95分以上).zip
- 基于GUI+MYSQL+JAVA票务管理系统文档介绍+源码+数据库(高分大作业).zip
- 优先编码器除法电微分运算电路 全加器函数发生电路等电路经典Multisim仿真实验源文件合集(25个).zip
- 2331308JS课堂案例.zip
- STM32H750VBT6单片机最小系统开发板AD设计硬件(原理图+PCB+3D封装库)工程文件.zip
- 基于74LS161+ 74LS192芯片实现倒计时定时器Multisim仿真源文件,Multisim10以上版本可打开运行
- 科大讯飞语音引擎 jar包 demo,科大讯飞语音合成引擎3.0,支持4.0系统以上,文字转语音输出.zip
- Java架构面试笔试专题资料及经验(含答案)SpringBoot面试Linux面试专题及答案 合集.zip
- 头歌c语言实验答案tion-model-for-ne开发笔记
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
![feedback](https://img-home.csdnimg.cn/images/20220527035711.png)
![feedback](https://img-home.csdnimg.cn/images/20220527035711.png)
![feedback-tip](https://img-home.csdnimg.cn/images/20220527035111.png)