# Algorithms - Lede 2018
A course on algorithms used in journalism, for beginning Python programmers.
Taught at the Lede program, Columbia Journalism School, summer 2018
by Jonathan Stray. Some parts adapted from previous work, as noted.
## Course overview
This is a course on algorithmic data analysis in journalism, and also the journalistic analysis of algorithms used in society. The major topics are text processing, visualization of high dimensional data, regression, machine learning, algorithmic bias and accountability, monte carlo simulation, and election prediction.
There are seven weeks of material and 13 classes total. Most classes include a Jupyter notebook that students filled out during class time, following the instructor on screen. The full class notebook is named `class-week-day-topic.ipynb` while this same notebook ready to be filled out in class is called `class-week-day-topic-empty.ipynb` Though I believe it's important for learning for the students to type the code themselves, I sometimes included particularly annoying or unedifying snippets in the "empty" notebooks.
There is a homework assignment after most classes, given as a notebook which students must fill out themselves. Homework solutions are available separately for instructors -- contact me (try @jonathanstray on Twitter)
All coding is done in Python, using Pandas, matplotlib, scikit learn.
Some classes also include slide decks, in pdf and pptx formats. The pptx files are included for potential remixes and because the notes contain extra information, e.g. the source URLs for each slide.
## Prerequisites
Students must have basic ability to work with data in Python/Pandas, such as filtering, grouping, and plotting.
## Administration
- Instructor: Jonathan Stray, jms2361@columbia.edu
- Dates: Mondays and Wednesdays, 7/18-8/29
- Class: 10am-1pm
- Lab: 2pm-5pm
- Location: World Room
- Slack channel: #algorithms
## Teaching notes
For each week, I have included notes about what worked and what didn't, in the hope of improving future iterations of the course. You'll see that it took me a few weeks to dial in the level and style of teaching, so the course is strongest (for this audience) starting in week 3.
## LICENSE
This work is licensed under Creative Commons [CC-BY 3](https://creativecommons.org/licenses/by/3.0/us/). Which means you can pretty much do what you want with it, just as long as you mention my name in derivative works.
## Syllabus
### Week 1 - Introduction to Algorithms
Algorithms for doing journalism, journalism about algorithms. The purpose of mathematical formalism.
*Materials:*
- [Slides](https://github.com/jstray/lede-algorithms/blob/master/week-1/week-1.pdf)
*Homework:*
- [Average of averages](https://github.com/jstray/lede-algorithms/blob/master/week-1/week-1-homework.ipynb). First you'll do some basic group operations on the Titantic survival data. Then you'll use the results to show that an average of averages is not always the same as the overall average. You must figure out when these two averages **are** equal, and how to compute the overall average from the individual averages.
*What worked and didn't.* Students enjoyed the introductory lecture. The material on linear vs. quadratic running time was a bit abstract, and we didn't really talk about it later in the course, so I'd drop it next time. Most students were able to understand what the "average of averages" problem was about but many got stuck on the summation notation in the homework. I'd rewrite this to look like code instead of algebra (`averages.mean()` etc.)
### Week 2-1 - Text processing and TF-IDF
In this class we will develop the ubiquitous vector space document model, with TF-IDF weighting. You will learn to algorithmically summarize documents by extracting keywords, how to compare documents for similarity, and how a search engine and Google News work.
*Materials:*
- [Slides](https://github.com/jstray/lede-algorithms/blob/master/week-2/week-2.pdf)
- [Class notebook](https://github.com/jstray/lede-algorithms/blob/master/week-2/week-2-1-class.ipynb)
*References:*
- [TF-IDF is about what matters](https://planspace.org/20150524-tfidf_is_about_what_matters/) - an article which describes TF-IDF in more detail.
- [How ProPublica's Message Machine Reverse Engineers Political Microtargeting](https://www.propublica.org/nerds/how-propublicas-message-machine-reverse-engineers-political-microtargeting) - a real life example of TF-IDF and cosine similarity used in journalism.
- Stephen Ramsay's short book [Reading Machines](http://www.dansinykin.com/uploads/8/4/0/2/84026824/ramsay_algorithmic_criticism.pdf) - TF-IDF used in the digital humanities, plus a fantastic discussion of text analysis in general.
- The [Overview document mining platform](overviewdocs.com), a powerful tool you can use to explore document sets, or OCR and convert them. See also this [visualization of the TF-IDF vectors](https://blog.overviewdocs.com/2012/03/16/video-document-mining-with-the-overview-prototype/) of a document set.
- [A full-text visualization of the Iraq war logs](https://blog.overviewdocs.com/2010/12/10/a-full-text-visualization-of-the-iraq-war-logs/) - I used TF-IDF to analyze the Wikileaks Iraq War Logs, which became the inspiration for Overview.
*Homework:*
- [Analyze the state of the Union with TF-IDF](https://github.com/jstray/lede-algorithms/blob/master/week-2-1/week-2-homework.ipynb) to see how topics changed by decade
*What worked and didn't.* It would help to motivate TF-IDF by showing it used in journalism first, e.g. [message machine](https://www.propublica.org/nerds/how-propublicas-message-machine-reverse-engineers-political-microtargeting). The slide deck describing TF-IDF was adapted from my [computational journalism seminar](http://www.compjournalism.com/) and had too many equations for most. The switch from distance to similarity within the notebook was confusing. Jonathan Soma's [2017 class on TF-IDF](http://jonathansoma.com/lede/foundations/classes/text%20processing/tf-idf/) repeatedly plots a few dimensions at a time and seemed to help get the concept across.
Generally, the students were inspired by text analysis and several used TF-IDF for summarization or clustering as part of their Data Studio projects. We could usefully spend more time on other NLP topics like tagging and sentimen t analysis (both of which are supported in `TextBlob`.) Sentiment analysis in particular is ever-popular, for better or worse.
### Week 2-2 - Vectors, clusters, and visualization
We'll start with the idea of clusters, and the K-means algorithm which identifies them. Then we'll look at the voting patterns of the UK house of lords in 2012 (yes, there is a reason for this particular data) to develop the general idea of high dimensional vectors representing data. We'll learn some ways to visualize such high dimensional spaces, including the classic PCA algorithm.
*Materials:*
- [Class notebook](https://github.com/jstray/lede-algorithms/blob/master/week-2/week-2-1-class.ipynb)
*References:*
- [Visualizing K-Means Clustering](https://www.naftaliharris.com/blog/visualizing-k-means-clustering/) - an interactive demonstration by Naftali Harris
- [In Depth: Principal Component Analysis](https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html)
*Homework:*
- [Principal component analysis](https://github.com/jstray/lede-algorithms/blob/master/week-2/week-2-2-homework.ipynb) of data from the General Social Survey.
*What worked and didn't.* Students found the PCA material very abstract -- but the discusson of [Flatland](http://www.geom.uiuc.edu/~banchoff/Flatland/) as a method of thinking about higher dimensions was a hit! It would have been better to build up to PCA from simpler high dimensional EDA techniques, e.g. `scatter_matrix`.
The GSS PCA assignment was difficult for most students, in part because the General Social Survey website has a horrificia
没有合适的资源?快使用搜索试试~ 我知道了~
哥伦比亚新闻学院Lede项目的算法课程材料_Jupyter Notebook_下载.zip
共122个文件
tsv:52个
ipynb:29个
csv:23个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 41 浏览量
2023-04-30
10:27:58
上传
评论
收藏 63.17MB ZIP 举报
温馨提示
哥伦比亚新闻学院Lede项目的算法课程材料_Jupyter Notebook_下载.zip
资源推荐
资源详情
资源评论
收起资源包目录
哥伦比亚新闻学院Lede项目的算法课程材料_Jupyter Notebook_下载.zip (122个子文件)
tickets-warnings.csv 51.2MB
loan-subset.csv 50.62MB
anes_timeseries_2016_rawdata.csv 13.88MB
state-of-the-union.csv 10.01MB
menendez-press-releases.csv 5.3MB
ontime_reports_may_2015_ny.csv 3.86MB
category-training.csv 3.49MB
compas-scores-two-years.csv 2.43MB
compas-scores-two-years-violent.csv 1.53MB
apib12tx.csv 1.2MB
GSS-spending.csv 675KB
bills.csv 614KB
uk-lords-votes.csv 141KB
titanic.csv 86KB
titanic.csv 86KB
titanic.csv 86KB
15-16-cycle-disbursements-to-trump-properties.csv 38KB
wine.csv 11KB
lights-camera-algorithm-2.csv 3KB
kitchen-confusion-matrix.csv 2KB
states.csv 818B
election-fundamentals.csv 537B
height-weight.csv 326B
.gitignore 54B
week-4-1-supervised-learning-class.ipynb 835KB
week-3-1-class.ipynb 641KB
week-3-2-class.ipynb 258KB
week-2-2-class.ipynb 196KB
week-5-1-machine-bias-class.ipynb 190KB
week-6-1-election-prediction.ipynb 136KB
week-6-1-election-prediction-empty.ipynb 134KB
week-4-2-feature-selection-class.ipynb 115KB
Simpson's Paradox.ipynb 109KB
week-2-1-class.ipynb 92KB
week-5-2-lending-class.ipynb 76KB
Machine learning with Overview.ipynb 39KB
kitchen-confusion-matrix.ipynb 35KB
stormy-daniels-payments-simulation.ipynb 32KB
week-5-1-machine-bias-class-empty.ipynb 19KB
week-5-2-lending-class-empty.ipynb 12KB
week-5-1-fairness-tradeoffs-homework.ipynb 11KB
Simple Polling Simulation.ipynb 11KB
week-4-1-supervised-learning-class-empty.ipynb 10KB
week-3-1-homework.ipynb 10KB
week-4-1-supervised-learning-homework.ipynb 9KB
week-3-2-homework.ipynb 9KB
week-2-2-homework.ipynb 9KB
week-2-2-class-empty.ipynb 8KB
week-6-2-midwest-polling-errors-homework.ipynb 7KB
week-4-2-text-classification-homework.ipynb 6KB
week-6-1-election-prediction-homework.ipynb 5KB
week-2-1-homework.ipynb 5KB
week-1-homework.ipynb 3KB
README.md 30KB
README.md 2KB
week-1.pdf 4.51MB
week-5-1.pdf 3.58MB
week-2-1.pdf 2.89MB
anes_timeseries_2016_userguidecodebook.pdf 2.78MB
week-5-2.pdf 2.35MB
week-3.pdf 1.72MB
anes_timeseries_2016_varlist.pdf 314KB
week-5-1.pptx 8.84MB
week-1.pptx 7.8MB
week-2-1.pptx 7.19MB
week-3.pptx 3.56MB
week-5-2.pptx 2.68MB
prepare-loan-data.py 588B
download-states.sh 563B
16-US-Pres-GE TrumpvClinton-poll-responses-clean.tsv 376KB
US.tsv 376KB
FL.tsv 28KB
PA.tsv 26KB
NC.tsv 24KB
OH.tsv 22KB
VA.tsv 17KB
NH.tsv 17KB
CO.tsv 15KB
NV.tsv 15KB
WI.tsv 13KB
GA.tsv 13KB
IA.tsv 13KB
MI.tsv 12KB
AZ.tsv 12KB
UT.tsv 9KB
CA.tsv 9KB
NY.tsv 8KB
MO.tsv 8KB
ME.tsv 8KB
TX.tsv 8KB
KS.tsv 6KB
OR.tsv 6KB
MA.tsv 6KB
NJ.tsv 6KB
SC.tsv 6KB
IN.tsv 6KB
IL.tsv 5KB
LA.tsv 5KB
MN.tsv 5KB
WA.tsv 5KB
共 122 条
- 1
- 2
资源评论
快撑死的鱼
- 粉丝: 1w+
- 资源: 9156
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功