W1 - Machine Learning (ML) and Natural Language Processing (NLP)
For the first week, we were re-introduced to Machine Learning and Deep Learning, a topic that we have tackled before from the two previous Elective courses. After the refresher on machine learning, the concept of Natural Language Processing is introduced to us. NLP is the process of interpreting natural language, or textual, data into insights generated from a machine learning model.
The activity this week involved setting up our portfolios, as well as a brief personal research on machine learning, deep learning, and natural language processing. The activity also involved expectation setting, as well as a future look on how these processing can be applied in our aspired career.
W2 - Text Preprocessing
On the second week, we were introduced to text preprocessing techinques, and how these processes help improve the quality of the data. From simple processes like lowercasing and special character removal, to more complex techniques like lemmatization and stop-word removal, the lectures outlined how to implement these, as well as how these steps affect the performance of the model.
For the activity, we were tasked to implement said text preprocessing techniques on a text dataset, as well as find a noise unique to that dataset that we can highlight and implement a custom solution.
W3 - Text Representation
For the third week of the semester, we were introduced to ways on how we can represent text data for machine learning models. The three main ones taught to us were bag-of-words, where the frequency of the word in a document is measured, TF-IDF, where it aims to reduce the score of frequent but unimportant words, and word embeddings, where it takes into account the surrounding words to provide context for a given term.
The activity this week involved implementing these three representation methods, as well as some simpler methods like one-hot encoding. The libraries utilized were ones that we were already familiar with, like numpy and scikit-learn.
W4 & W5 - Named Entity Recognition and spaCy library
Week four introduced us to named entity recognition, which is the act of highlighting named entities in a document and labelling them with their appropriate entity type. Alongside it is introduction to spaCy, a powerful Python library that implements NLP processes in a clean and robust manner. The library is said to be used well in the industry, and learning it can help us understand how NLP is implemented for projects outside of the academe.
The activity spanned three weeks and involves using spaCy to implement named entity recognition on three documents in a chosen domain and dataset. It let us explore how spaCy works, and how it simplifies the implementation of several NLP techniques.
W6 - Prelim Term Examination
The sixth week was reserved primarily for examinations. The day of the test involved two tests, a long quiz about the spaCy and NER activity, and the prelim examination which is about our overall learnings about NLP, text preprocessing, and text representation.