哥伦比亚新闻学院Lede项目的算法课程材料_JupyterNotebook

共122个文件

tsv：52个

ipynb：29个

csv：23个

版权申诉

41 浏览量 2023-04-30 10:27:58 上传评论收藏 63.17MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

哥伦比亚新闻学院Lede项目的算法课程材料_Jupyter Notebook_下载.zip （122个子文件）

tickets-warnings.csv 51.2MB

loan-subset.csv 50.62MB

anes_timeseries_2016_rawdata.csv 13.88MB

state-of-the-union.csv 10.01MB

menendez-press-releases.csv 5.3MB

ontime_reports_may_2015_ny.csv 3.86MB

category-training.csv 3.49MB

compas-scores-two-years.csv 2.43MB

compas-scores-two-years-violent.csv 1.53MB

apib12tx.csv 1.2MB

GSS-spending.csv 675KB

bills.csv 614KB

uk-lords-votes.csv 141KB

titanic.csv 86KB

15-16-cycle-disbursements-to-trump-properties.csv 38KB

wine.csv 11KB

lights-camera-algorithm-2.csv 3KB

kitchen-confusion-matrix.csv 2KB

states.csv 818B

election-fundamentals.csv 537B

height-weight.csv 326B

.gitignore 54B

week-4-1-supervised-learning-class.ipynb 835KB

week-3-1-class.ipynb 641KB

week-3-2-class.ipynb 258KB

week-2-2-class.ipynb 196KB

week-5-1-machine-bias-class.ipynb 190KB

week-6-1-election-prediction.ipynb 136KB

week-6-1-election-prediction-empty.ipynb 134KB

week-4-2-feature-selection-class.ipynb 115KB

Simpson's Paradox.ipynb 109KB

week-2-1-class.ipynb 92KB

week-5-2-lending-class.ipynb 76KB

Machine learning with Overview.ipynb 39KB

kitchen-confusion-matrix.ipynb 35KB

stormy-daniels-payments-simulation.ipynb 32KB

week-5-1-machine-bias-class-empty.ipynb 19KB

week-5-2-lending-class-empty.ipynb 12KB

week-5-1-fairness-tradeoffs-homework.ipynb 11KB

Simple Polling Simulation.ipynb 11KB

week-4-1-supervised-learning-class-empty.ipynb 10KB

week-3-1-homework.ipynb 10KB

week-4-1-supervised-learning-homework.ipynb 9KB

week-3-2-homework.ipynb 9KB

week-2-2-homework.ipynb 9KB

week-2-2-class-empty.ipynb 8KB

week-6-2-midwest-polling-errors-homework.ipynb 7KB

week-4-2-text-classification-homework.ipynb 6KB

week-6-1-election-prediction-homework.ipynb 5KB

week-2-1-homework.ipynb 5KB

week-1-homework.ipynb 3KB

README.md 30KB

README.md 2KB

week-1.pdf 4.51MB

week-5-1.pdf 3.58MB

week-2-1.pdf 2.89MB

anes_timeseries_2016_userguidecodebook.pdf 2.78MB

week-5-2.pdf 2.35MB

week-3.pdf 1.72MB

anes_timeseries_2016_varlist.pdf 314KB

week-5-1.pptx 8.84MB

week-1.pptx 7.8MB

week-2-1.pptx 7.19MB

week-3.pptx 3.56MB

week-5-2.pptx 2.68MB

prepare-loan-data.py 588B

download-states.sh 563B

16-US-Pres-GE TrumpvClinton-poll-responses-clean.tsv 376KB

US.tsv 376KB

FL.tsv 28KB

PA.tsv 26KB

NC.tsv 24KB

OH.tsv 22KB

VA.tsv 17KB

NH.tsv 17KB

CO.tsv 15KB

NV.tsv 15KB

WI.tsv 13KB

GA.tsv 13KB

IA.tsv 13KB

MI.tsv 12KB

AZ.tsv 12KB

UT.tsv 9KB

CA.tsv 9KB

NY.tsv 8KB

MO.tsv 8KB

ME.tsv 8KB

TX.tsv 8KB

KS.tsv 6KB

OR.tsv 6KB

MA.tsv 6KB

NJ.tsv 6KB

SC.tsv 6KB

IN.tsv 6KB

IL.tsv 5KB

LA.tsv 5KB

MN.tsv 5KB

WA.tsv 5KB

共 122 条

# Algorithms - Lede 2018 A course on algorithms used in journalism, for beginning Python programmers. Taught at the Lede program, Columbia Journalism School, summer 2018 by Jonathan Stray. Some parts adapted from previous work, as noted. ## Course overview This is a course on algorithmic data analysis in journalism, and also the journalistic analysis of algorithms used in society. The major topics are text processing, visualization of high dimensional data, regression, machine learning, algorithmic bias and accountability, monte carlo simulation, and election prediction. There are seven weeks of material and 13 classes total. Most classes include a Jupyter notebook that students filled out during class time, following the instructor on screen. The full class notebook is named `class-week-day-topic.ipynb` while this same notebook ready to be filled out in class is called `class-week-day-topic-empty.ipynb` Though I believe it's important for learning for the students to type the code themselves, I sometimes included particularly annoying or unedifying snippets in the "empty" notebooks. There is a homework assignment after most classes, given as a notebook which students must fill out themselves. Homework solutions are available separately for instructors -- contact me (try @jonathanstray on Twitter) All coding is done in Python, using Pandas, matplotlib, scikit learn. Some classes also include slide decks, in pdf and pptx formats. The pptx files are included for potential remixes and because the notes contain extra information, e.g. the source URLs for each slide. ## Prerequisites Students must have basic ability to work with data in Python/Pandas, such as filtering, grouping, and plotting. ## Administration - Instructor: Jonathan Stray, jms2361@columbia.edu - Dates: Mondays and Wednesdays, 7/18-8/29 - Class: 10am-1pm - Lab: 2pm-5pm - Location: World Room - Slack channel: #algorithms ## Teaching notes For each week, I have included notes about what worked and what didn't, in the hope of improving future iterations of the course. You'll see that it took me a few weeks to dial in the level and style of teaching, so the course is strongest (for this audience) starting in week 3. ## LICENSE This work is licensed under Creative Commons [CC-BY 3](https://creativecommons.org/licenses/by/3.0/us/). Which means you can pretty much do what you want with it, just as long as you mention my name in derivative works. ## Syllabus ### Week 1 - Introduction to Algorithms Algorithms for doing journalism, journalism about algorithms. The purpose of mathematical formalism. *Materials:* - [Slides](https://github.com/jstray/lede-algorithms/blob/master/week-1/week-1.pdf) *Homework:* - [Average of averages](https://github.com/jstray/lede-algorithms/blob/master/week-1/week-1-homework.ipynb). First you'll do some basic group operations on the Titantic survival data. Then you'll use the results to show that an average of averages is not always the same as the overall average. You must figure out when these two averages **are** equal, and how to compute the overall average from the individual averages. *What worked and didn't.* Students enjoyed the introductory lecture. The material on linear vs. quadratic running time was a bit abstract, and we didn't really talk about it later in the course, so I'd drop it next time. Most students were able to understand what the "average of averages" problem was about but many got stuck on the summation notation in the homework. I'd rewrite this to look like code instead of algebra (`averages.mean()` etc.) ### Week 2-1 - Text processing and TF-IDF In this class we will develop the ubiquitous vector space document model, with TF-IDF weighting. You will learn to algorithmically summarize documents by extracting keywords, how to compare documents for similarity, and how a search engine and Google News work. *Materials:* - [Slides](https://github.com/jstray/lede-algorithms/blob/master/week-2/week-2.pdf) - [Class notebook](https://github.com/jstray/lede-algorithms/blob/master/week-2/week-2-1-class.ipynb) *References:* - [TF-IDF is about what matters](https://planspace.org/20150524-tfidf_is_about_what_matters/) - an article which describes TF-IDF in more detail. - [How ProPublica's Message Machine Reverse Engineers Political Microtargeting](https://www.propublica.org/nerds/how-propublicas-message-machine-reverse-engineers-political-microtargeting) - a real life example of TF-IDF and cosine similarity used in journalism. - Stephen Ramsay's short book [Reading Machines](http://www.dansinykin.com/uploads/8/4/0/2/84026824/ramsay_algorithmic_criticism.pdf) - TF-IDF used in the digital humanities, plus a fantastic discussion of text analysis in general. - The [Overview document mining platform](overviewdocs.com), a powerful tool you can use to explore document sets, or OCR and convert them. See also this [visualization of the TF-IDF vectors](https://blog.overviewdocs.com/2012/03/16/video-document-mining-with-the-overview-prototype/) of a document set. - [A full-text visualization of the Iraq war logs](https://blog.overviewdocs.com/2010/12/10/a-full-text-visualization-of-the-iraq-war-logs/) - I used TF-IDF to analyze the Wikileaks Iraq War Logs, which became the inspiration for Overview. *Homework:* - [Analyze the state of the Union with TF-IDF](https://github.com/jstray/lede-algorithms/blob/master/week-2-1/week-2-homework.ipynb) to see how topics changed by decade *What worked and didn't.* It would help to motivate TF-IDF by showing it used in journalism first, e.g. [message machine](https://www.propublica.org/nerds/how-propublicas-message-machine-reverse-engineers-political-microtargeting). The slide deck describing TF-IDF was adapted from my [computational journalism seminar](http://www.compjournalism.com/) and had too many equations for most. The switch from distance to similarity within the notebook was confusing. Jonathan Soma's [2017 class on TF-IDF](http://jonathansoma.com/lede/foundations/classes/text%20processing/tf-idf/) repeatedly plots a few dimensions at a time and seemed to help get the concept across. Generally, the students were inspired by text analysis and several used TF-IDF for summarization or clustering as part of their Data Studio projects. We could usefully spend more time on other NLP topics like tagging and sentimen t analysis (both of which are supported in `TextBlob`.) Sentiment analysis in particular is ever-popular, for better or worse. ### Week 2-2 - Vectors, clusters, and visualization We'll start with the idea of clusters, and the K-means algorithm which identifies them. Then we'll look at the voting patterns of the UK house of lords in 2012 (yes, there is a reason for this particular data) to develop the general idea of high dimensional vectors representing data. We'll learn some ways to visualize such high dimensional spaces, including the classic PCA algorithm. *Materials:* - [Class notebook](https://github.com/jstray/lede-algorithms/blob/master/week-2/week-2-1-class.ipynb) *References:* - [Visualizing K-Means Clustering](https://www.naftaliharris.com/blog/visualizing-k-means-clustering/) - an interactive demonstration by Naftali Harris - [In Depth: Principal Component Analysis](https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html) *Homework:* - [Principal component analysis](https://github.com/jstray/lede-algorithms/blob/master/week-2/week-2-2-homework.ipynb) of data from the General Social Survey. *What worked and didn't.* Students found the PCA material very abstract -- but the discusson of [Flatland](http://www.geom.uiuc.edu/~banchoff/Flatland/) as a method of thinking about higher dimensions was a hit! It would have been better to build up to PCA from simpler high dimensional EDA techniques, e.g. `scatter_matrix`. The GSS PCA assignment was difficult for most students, in part because the General Social Survey website has a horrificia

评论收藏

内容反馈

版权申诉