Bryn Mawr College
CMSC 325: Computational Linguistics
Fall 2018
Course Materials
Prof. Deepak Kumar
Lecture Hours: Mondays & Wednesdays, 10:10 a.m. to
11:30 a.m.
Office Hours:Tuesdays 1:30 to 2:30p and during Lab Hours
Lecture Room: 349 PSB
Lab: All labs will meet in Room 230 PSB. Students should register for the lab shown below:
- Fridays 10:10 a.m. to 11:30 a.m. (Led by Prof. Kumar)
Laboratories:
- Computer Science Lab Room 231 (Science Building)
- You may also be able to use your own computer to do the labs
for this course.
Texts &
Software
TEXTS:
Speech & language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition by
Daniel Jurafsky & James H. Martin, 2nd Edition, Pearson Prentice Hall 2008.
NOTE: The authors are currently updating the text for a Third Edition. Updates are available electronically at Prof. Jurafsky's website. More details on accessing its content will be provided in the First Week.
Natural Language processing with Python - Analyzing Text with the Natural Language Toolkit (NLTK) by Steven Bird, Ewan Klein, and Edward Loper. Available under Creative Commons License at NLTK Book Site.
SOFTWARE:
Python 3.0+NLTK
This software is installed on all computers in the CS Lab. It can also be installed on your computers. Await instructions in the lectures/labs about installing on your own computers.
|
|
Syllabus
Course Description: Class Number: 2201
Introduction to computational models of understanding and processing human languages. How elements of linguistics, computer science, and artificial intelligence can be combined to help computers process human language and to help linguists understand language through computer models. Topics covered: syntax, semantics, pragmatics, generation and knowledge representation techniques. Prerequisite: CMSC 206 , or H106 and CMSC 231 or permission of instructor. Haverford: Natural Science (NA)
Enrollment Limit; 24.
Lab Attendance: Attendance in Lab is optional, but will be required during specific weeks. Look for announcements below during the semester. Prof. Kumar will be available in the Lab during all Lab times throughout the semester.
Important Dates
September 5: First lecture
October 1: Exam 1
November 7: Exam 2
December 10: Last lecture
December 12: Exam
3
Assignments
- Assignment#1 (Due on Wednesday, September 19) is posted. Click here for details.
- Assignment#2 (Due on Wednesday, October 3) is posted.Click here for details.
- Assignment#3 (Due on Wednesday, October 24): Click here for details.
- Assignment#4 (Due on Wednesday, November 7):Click here for details.
- Assignment#5 (Due on Monday, December 10):Click here for deatils.
Lectures
- Week 1 (September 5) NO LAB THIS WEEK
September 5: Intoruduction to Computational Linguistics. Course overview. Examples of language processing: Google Search, machine translation: Google Translate, iTranslate iPhone app, Microsoft Demo (Nov. 2012). Identifying language tasks and the knowldege required for these tasks. Language processing versus data processing.
Read: Chapter 1 from Jurafsky & Martin.
Slides from today:Introduction.pdf (Slides).
- Week 2 (September 10, 12)
September 10: Regular Languages. Three equivalent models for specifying regular languages: regular expressions, finite automata, and regular grammars. Regular Expressions: for searching and specifying languages.
Basic elements of regular expressions: expressions, anchors, counters,
operator precedence, substitution, memory, examples.
Read: Chapter 2 from Jurafsky & Martin.
Do: Look up the documentation of the Linux egrep command. Try out some regular expression searches from class. Also, search in Google for "Microsoft Word regular expression search" and look for a link to a page at office.microsoft.com site on using regular expressions in Word. Follow the tutorial and learn how to use regular expressions in Word and note the little differences in the the use and specification of patterns in Word vs how we did them in class/text.
September 12: Regular expressions in Python - a demo. Also, how to access text files and web pages through Python. Linux tools for processing texts: tr, sort, uniq, etc. How to count number of "words" and their frequencies. NLTK - First Look - Accessing pre-provided text corpora. Examples.
Read: Chapter 1 from NLTK book.
Assignment#1 (Due on Wednesday, September 19) is posted. Click here for details.
- Week 3 (September 17, 19)
September 17: Finite State Automata. Recognition versus generation. Deterministic and non-deterministic FSAs. The equivalence of deterministic and non-deterministic FSAs. Formal Languages. The equivalence between regular expressions, regular languages, and finite state automata. Words: what is a word. Word and sentence tokenization. Ambuguities.
Read: Read Sections 3.1 and 3.9 from J&M.
September 19: Parsing words using morphology: Morphology: overview, types of affixes. Morphological Parsing using Finite State Transducers.
Read: Chapter 3 from J&M.
Assignment#2 (Due on Wednesday, October 3) is posted.Click here for details.
- Week 4 (September 24, 26)
September 24: Porter Stemmer - lexicon free morphological stemming. Porter Stemmer in NLTK. Detection and correction of spelling errors: Minimum edit distance algorithm. Examples.
Read Chapter 3 from J&M.
Useful Links: Martin Porter's Stemmer Resource page, The original stemming algorithm paper,
September 26: Python implementation of the Minimum Edit Distance algorithm. Examples. The MaxMatch algorithm for word (hashtag) segmentation-Python implementation and examples. Introduction to N-grams. Please see the directory ~dkumar/cs325 for the Python programs from this morning.
Read: SKIM Chapter 4 (though we just got started).
- Week 5 (October 1, 3)
October 1: Exam 1 is today.
October 3: N-Gram Language models. Google N-grams data. Computing MLEs from N-Gram data. Introduction to Parts of speech classes.
Read: Start reading Chapter 5 from J&M.
Assignment#3 is posted (Dueon Wednesday, October 24): Click here for details.
- Week 6 (October 8, 10)
October 8: Grammar basics: parts of speech (POS). We will watch the Schoolhouse Rock/Grammar Rock videos for a quick introduction to POS. POS Tagging - defining the problem.
Read: Chapter 5 from J&M.
October 10: POS Tagging: base line tagger (most probable tag), Stochastic taggers: Hidden Markov Model (HMM) taggers.
Read: Chapter 5 & start of Chapter 6 from J&M.
- Week 7 (October 15, 17)
No Classes. Fall Break
- Week 8 (October 22, 24)
October 22: Doing POS Tagging in NLTK: Rule-based tagging (using Regular Expressions); Using the Brown tagged corpus. Doing tagger evaluation for accuracy. Stochastic taggers: Unigram and Bigram taggers.
Read: Python for Linguists, Tutorial#3.
October 24: Stochastic Taggers with backoff. HMM Taggers - Tracing the Viterbi algorithm. Doing HMM tagging in NLTK.
Read: Python for Linguists, Tutorial#4.
Assignment#4 (Due on Wednesday, November 7):Click here for details.
- Week 9 (October 29, 31)
October 29: No class today. Deepak is out of town.
October 31: Syntax and grammars: an introduction. Formal Grammars, language defined by a grammar. Types of grammars, Chomsky Hierarchy. Capturing English syntax: Constituency, Grammatical Relations, Subcategorization/Dependency Relations.
S Chapter 12 from J&M.
- Week 10 (November 5, 7)
November 5: Writing grammars for english: noun phrases, verb phrases. Moods/Classes of sentences: declarative, imperative, yes-no questions, wh-questions.
Read: Sections 12.1 to 12.3 from J&M.
November 7: Exam 2 is today.
- Week 11 (November 12, 14)
November 12: CFGs, contd. Probabilistic Context Free Grammars. Recursive Transition Networks (RTNs). Parsing: Top-Down vs Bottom up. Recursive Descent parsers, Shift-Reduce parsers.
Read: Start reading Chapter 13 from J&M.
November 14: Recursive Descent Prisers, Shift-Reduce Parsers. Dynamic Programming for parsing: CKY/CYK Parsing: Chomsly Normal Form, parsing algorithm.
Read: Chapter 13 from J&M.
- Week 12 (November 19, 21)
November 19: CKY Parsing. Earley Parsing (Top-down, chart-based). Doing parsing in NLTK (demo, if time).
Read: Chapter 13 from J&M.
Assignment#5 (Due on Wednesday, December 5):Click here for deatils.
Tutorial on Parsing in NLTK:Click here.
November 21: No class. Thanksgiving Break!
- Week 13 (November 26, 28)
November 26: Parsing with features using ATNs. Semantics: Meaning Representations. INtroduction to First-Order Predicate Calculus (FOPC).
Read: Chapter 17 from J&M.
November 28:
FOPC, contd. Lambda reduction. Computational Semantics: Syntax-driven semantic analysis. Using Semantic attachments to CFG rules.
Attachments for POS productions (Propernouns, nouns, verbs, adjectives, etc.). Using attachments to derive meaning from a parse tree: Simple atomic wffs, wffs with quantifiers.
Read: Chapter 17 from J&M.
- Week 14 (December 3, 5)
December 3: Semantics of quantifiers in CFG attachments. Ding parsing and Semantic analysis together.
Read: Chapter 18 from J&M.
December 5: No class. Deepak is out of town.
- Week 15 (December 10, 12)
December 10: Course Wrap up.
Slides:Click here.
December 12: Exam 3 is today.
Course Policies
Communication
Attendance and active participation are
expected in every class. Participation includes asking questions,
contributing answers, proposing ideas, and providing constructive
comments.
As you will discover, we are proponents of two-way communication
and we welcome feedback during the semester about the course. We
are available to answer student questions, listen to concerns, and
talk about any course-related topic (or otherwise!). Come to
office hours! This helps us get to know you. You are welcome to
stop by and chat. There are many more exciting topics to talk
about that we won't have time to cover in-class.
Although computer science work can be intense and solitary, please
stay in touch with us, particularly if you feel stuck on a topic
or project and can't figure out how to proceed. Often a quick
e-mail, phone call or face-to-face conference can reveal solutions
to problems and generate renewed creative and scholarly energy. It
is essential that you begin assignments early, since we will be
covering a variety of challenging topics in this course.
Grading
All graded work will receive a grade, 4.0, 3.7, 3.3, 3.0, 2.7, 2.3, 2.0, 1.7, 1.3, 1.0, or 0.0. At the end of the semester, final grades will be calculated as a weighted average of all grades according to the following weights:
Exam 1: 20%
Exam 2: 20%
Exam 3: 20%
Labs & Written Work: 40%
Total: 100%
Incomplete grades will be given only for verifiable medical
illness or other such dire circumstances.
Submission and Late Policy
All work must be turned in either in hard-copy or electronic
submission, depending on the instructions given in the
assignment. E-mail submissions, when permitted, should request
a "delivery receipt" to document time and date of submission. Extensions will be given only in the case of verifiable medical
excuses or other such dire circumstances, if requested in advance
and supported by your Academic Dean.
No assignment will be
accepted after it is past due.
No past work can be "made up" after it is due.
No regrade requests will be entertained one week after the graded work is returned in class.
Exams
There will be three exams in this course. The exams will be
closed-book and closed-notes. The exams
will cover material from lectures, homeworks, and assigned
readings (including topics not discussed in class).
Links (To be updated)
The Language Computer Q&A demo
An online version of ELIZA
NLTK Home page
NLTK LITE Tutorials
NLTK LITE API Documentation
Created on May 4, 2018.