# 基于python的网易云音乐分析
- MacOS Sierra 10.12.1
- Python 2.7
- selenium 3.4.3
- phantomjs
# 前言
> 发现自己有时候比挖掘别人来的更加有意义,自己到底喜欢谁的歌,自己真的知道么?习惯不会骗你
# 搭建爬虫环境
## 1.安装selenium
```shell
pip install selenium
# anaconda环境的可用conda install selenium
# 网速不好的可用到https://pypi.python.org/pypi/selenium下载压缩包,解压后使用python setup.py install
```
## 2.安装Phantomjs
### 2.1 Mac版本
```
步骤一下载包:去这里下载对应版本http://phantomjs.org/download.html
步骤二解压:双击就行,用unzip这都无所谓
步骤三切入路径:cd ~/Downloads/phantomjs-2.1.1-macosx/bin # 我下的路径的路径是download,版本不一,注意修改
步骤四:chmod +x phantomjs
步骤五: 配置环境,因为我装的的zsh,所以文件需要修改的是~/.zshrc这个文件,加上这句话export PATH="/Users/mrlevo/Downloads/phantomjs-2.1.1-macosx/bin/:$PATH",然后source ~/.zshrc 即可生效(没用zsh的同学,直接修改的文件时~/.bash_profile,添加内容和上述一致)
查看是否生效:phantomjs -v # 有信息如 2.1.1 则生效
```
mac若遇到问题请参考[PhantomJS 安装](https://segmentfault.com/a/1190000009020535)
### 2.2 Win版本
```
官网http://phantomjs.org/下载PhantomJS解压后如下图所示:
```
![](http://www.writebug.com/myres/static/uploads/2021/10/19/2297c82be14176230a4c6f3bb823c56e.writebug)
> 调用时可能会报错“**Unable to start phantomjs with ghostdriver**”如图:
![](http://www.writebug.com/myres/static/uploads/2021/10/19/0b3de519b85194a51b9cb7a481aab510.writebug)
> 此时可以设置下Phantomjs的路径,同时如果你配置了Scripts目录环境变量,可以解压Phantomjs到该文件夹下。可参考[Selenium with GhostDriver in Python on Windows - stackoverflow](http://stackoverflow.com/questions/21768554/selenium-with-ghostdriver-in-python-on-windows),整个win安装过程可参考[在Windows下安装PIP+Phantomjs+Selenium](http://blog.csdn.net/eastmount/article/details/47785123)],Mac和Linux/Ubuntu 下可参考[[解决:Ubuntu(MacOS)+phantomjs+python的部署问题](http://blog.csdn.net/mrlevo520/article/details/73196256)
## 3. 测试安装是否成功
```
# 进入python环境后执行如下操作
# win下操作
>>> from selenium import webdriver # pip install selenium
>>> driver_detail = webdriver.PhantomJS(executable_path="F:\Python\phantomjs-1.9.1-windows\phantomjs.exe")
>>> driver_detail.get('https://www.baidu.com')
>>> news = driver_detail.find_element_by_xpath("//div[@id='u1']/a")
>>> print news.text
新闻
>>> driver_detail.quit() # 记得关闭,不然耗费内存
------------------------------------------------------------------------
# mac下操作
>>> from selenium import webdriver # pip install selenium
>>> driver_detail = webdriver.PhantomJS()
>>> driver_detail.get('https://www.baidu.com')
>>> news = driver_detail.find_element_by_xpath("//div[@id='u1']/a")
>>> print news.text
新闻
>>> driver_detail.quit() # 记得关闭,不然耗费内存
```
# 爬取动态数据
> 获取自己的id号,这个可以自己登陆自己的网易云音乐后获得,就是id=后面那个值
![](http://www.writebug.com/myres/static/uploads/2021/10/19/3a6022357141aafbb88dfacfa4a50e83.writebug)
> 构造爬取的id,因为我发现,每个人的id只要被获取到,他的歌单都是公开的!!!这就节省了自动登录的一步,而且,我还有个大胆的想法,哈哈哈,我还要搞个大新闻!这次先不说~
**墙裂推荐先阅读该博客掌握获取元素方法:[Python爬虫 Selenium实现自动登录163邮箱和Locating Elements介绍](http://blog.csdn.net/eastmount/article/details/47825633)**
```python
# -*- coding: utf-8 -*-
import traceback
from selenium import webdriver
import selenium.webdriver.support.ui as ui
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
import random
# 存储为文本的子函数
def write2txt(data,path):
f = open(path,"a")
f.write(data)
f.write("\n")
f.close()
# 获取该id喜欢音乐的列表
def catchSongs(url_id,url):
user = url_id.split('=')[-1].strip()
print 'excute user:',user
driver = webdriver.PhantomJS()#,executable_path='/Users/mrlevo/phantomjs-2.1.1-macosx/bin/phantomjs') # 注意填上路径
driver.get(url)
driver.switch_to_frame('g_iframe') # 网易云的音乐元素都放在框架内!!!!先切换框架
try:
wait = ui.WebDriverWait(driver,15)
wait.until(lambda driver: driver.find_element_by_xpath('//*[@class="j-flag"]/table/tbody')) # 等待元素渲染出来
try:
song_key = 1
wrong_time = 0
while wrong_time < 5: # 不断获取歌信息,假定5次获取不到值,就判无值可获取,跳出循环
try:
songs = driver.find_elements_by_xpath('//*[@class="j-flag"]/table/tbody/tr[%s]'%song_key)
info_ = songs[0].text.strip().split("\n")
if len(info_) == 5:
info_.insert(2,'None') # 没有MV选项的进行插入None
new_line = '%s|'%user+'|'.join(info_)
song_key +=1
#new_line = "%s|%s|%s|%s|%s|%s|%s"%(user,info_[0],info_[1],info_[2],info_[3],info_[4],info_[5])
print new_line
write2txt(new_line.encode('utf-8'),user) # mac写入文件需要改变字符,以id命名的文件,存储在执行脚本的当前路径下,在win下请去掉编.endcode('utf-8')
except Exception as ex:
wrong_time +=1
# print ex
except Exception as ex:
pass
except Exception as ex:
traceback.print_exc()
finally:
driver.quit()
# 获取id所喜爱的音乐的url
def catchPlaylist(url):
driver = webdriver.PhantomJS()#,executable_path='/Users/mrlevo/phantomjs-2.1.1-macosx/bin/phantomjs') # 注意填上路径
driver.get(url)
driver.switch_to_frame('g_iframe') # 网易云的音乐元素都放在框架内!!!!先切换框架
try:
wait = ui.WebDriverWait(driver,15)
wait.until(lambda driver: driver.find_element_by_xpath('//*[@class="m-cvrlst f-cb"]/li[1]/div/a')) # 根据xpath获取元素
urls = driver.find_elements_by_xpath('//*[@class="m-cvrlst f-cb"]/li[1]/div/a')
favourite_url = urls[0].get_attribute("href")
except Exception as ex:
traceback.print_exc()
finally:
driver.quit()
# print favourite_url
return favourite_url
if __name__ == '__main__':
for url in ['http://music.163.com/user/home?id=67259702']: # 这里把自己的id替换掉,想爬谁的歌单都可以,只要你有他的id
time.sleep(random.randint(2, 4)) # 随机休眠时间2~4秒
url_playlist = catchPlaylist(url)
time.sleep(random.randint(1, 2))
catchSongs(url,url_playlist)
```
> 不出意外的话,你的执行脚本的目录下会产生一个以你的id命名的文件,里面打开应该是这样的
```shell
67259702|2|因为了解|None|04:08|汪苏泷|慢慢懂
67259702|3|潮鳴り|None|02:37|折戸伸治|CLANNAD ORIGINAL SOUNDTRACK
67259702|4|每个人都会|None|02:58|方大同|橙月 Orange Moon
67259702|5|Don't Cry (Original)|MV|04:44|Guns N' Roses|Greatest Hits
67259702|6|妖孽(Cover:蒋蒋)|None|02:58|醉影An|醉声梦影
67259702|7|好好说再见(Cover 陶喆 / 关诗敏)|None|04:06|锦零/疯疯|zero
67259702|8|好好说再见(cover陶喆)|None|03:34|AllenRock|WarmCovers ·早
# 这边分别爬取的数据结构是: id|歌次序|歌名|是否有MV|时长|歌手|专辑
```
# Show数据-ROUND1
> 接下来�