如何使用Pandas处理大批量数据

所需积分/C币:50 2019-03-05 14:45:08 1.5MB PDF
434
收藏 收藏
举报

Why and How to Use Pandas with Large Data ,如何使用Pandas处理大批量数据,介绍了如何减少内存消耗,学习利用pandas进行大批量数据处理不错的参考资料。
performance and long runtime that ultimately result in insufficient memory usage--when you're dealing with large data sets Indeed, Pandas has its own limitation when it comes to big data due to its algorithm and local memory constraints Therefore big data is typically stored in computing clusters for higher scalability and fault tolerance. And it can often be accessed through big data ecosystem (AWS EC2, Hadoop etc. )using Spark and many other tools Eventually, one of the ways to use Pandas with large data on local machines(with certain memory constraints) is to reduce memory usage of the data How to use Pandas with Large Data? (Source So the question is: How to reduce memory usage of data using Pandas? The following explanation will be based my experience on an anonymous large data set(40-50 GB)which required me to reduce the memory usage to fit into local memory for analysis(even before reading the data set to a dataframe) 1. Read csv fille data in chunk size To be honest i was baffled when i encountered an error and i couldnt read the data from Csv file, only to realize that the memory of my local machine was too small for the data with 16GB of ram Here comes the good news and the beauty of Pandas i realized that pandas. read_csv has a parameter called chunksize The parameter essentially means the number of rows to be read into a dataframe at any single time in order to fit into the local memory sir the data consists of more than 70 millions of rows, I specified the chunksize as 1 million rows each time that broke the large data set into many smaller pieces 1 read the large csv file with specified chunksize 2 df chunk= pd, read_csv(r,./input/data. cSv, chunks ize= Read csv file data in chunksize The operation above resulted in a TextFilereader object for iteration Strictly speaking, df_chunk is not a dataframe but an object for further operation in the next step. Once i had the object ready, the basic workflow was to perform operation on each chunk and concatenated each of them to form a latatrame in the end (as shown below). By iterating each chunl performed data filtering/preprocessing using a function chunk preprocessing before appending each chunk to a list And finally i concatenated the list into a final dataframe to fit into the local menory. 1 chunk_list = append each chunk df here 3 each chunk is in df format for chunk in df chunk: 456 perform data filtering chunk_ filter chunk_preprocessing(chunk) 8 Once the data filtering is done, append the chun chunk list append (chunk filter) Workflow to perform operation on each chunk 2. Filter out unimportant columns to save memory Great. At this stage i already had a dataframe to do all sorts of analysis required To save more time for data manipulation and computation, i further filtered out some unimportant columns to save more memory 1 Filter out unimportant co lumns 2 df= df[[ col 1,'col2'col3,'col4,'col5','c Filter out unimportant columns 3. Change dtypes for columns The simplest way to convert a pandas column of data to a different type is to use astype() i can say that changing data types in Pandas is extremely helpful to save memory, especially if you have large data for intense analysis or computation(For example, feed data into your machine learning model for training) By reducing the bits required to store the data, i reduced the overall memory usage by the data up to 50%! Give it a try. And i believe you'll find that useful as well! Let me know how it goes. e 1 Change the dtypes (int64 -> int32) 2 df[['col 1, 'co1 2 'col 3,'co1, 5 1]=df['col-1,'col-2' col 3 col 4 6# Change the dtypes (float64-> float32) 7 df[['col 6', col 7 Change data types to save memory Final Thoughts (Source) There you have it. Thank you for reading I hope that sharing my experience in using Pandas with large data ould help you explore another useful feature in Pandas to deal with large data by reducing memory usage and ultimately improving computational efficiency Typically, Pandas has most of the features that we need for data wrangling and analysis. I strongly encourage you to check them out as they'd come in handy to you next time Also, if you're serious about learning how to do data analysis in Python then this book is for you-Python for Data Analysis. With complete instructions for manipulating, processing, cleaning, and crunching datasets in Python using Pandas, the book gives a comprehensive and step-by-step guides to effectively use Pandas in your analysis Hope this helps As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn. Till then, see you in the next post! e About the author Admond lee is a big data Engineer at work, Data Scientist in action. He has been helping start-up founders and various companies tackle their problems using data with deep data science and industry expertise. You can connect with him on linkedIn. medium Twitter and facebook Admond lee kin lim -big data enginee Micron Technology I LinkedIn View Admond Lee kin Lim's profile on Lin kedIn the world's largest professional community. www.linkedin.com

...展开详情
试读 9P 如何使用Pandas处理大批量数据
立即下载 低至0.43元/次 身份认证VIP会员低至7折
一个资源只可评论一次,评论内容不能少于5个字
您会向同学/朋友/同事推荐我们的CSDN下载吗?
谢谢参与!您的真实评价是我们改进的动力~
  • 签到新秀

  • 分享王者

关注 私信
上传资源赚钱or赚积分
最新推荐
如何使用Pandas处理大批量数据 50积分/C币 立即下载
1/9
如何使用Pandas处理大批量数据第1页
如何使用Pandas处理大批量数据第2页

试读结束, 可继续读1页

50积分/C币 立即下载 >