Word Vectors with Tidy Data Principles

Word Vectors with Ti...

Last week I saw Chris Moody’s post on the Stitch Fix blog about calculating word vectors from a corpus of text using word counts and matrix factorization, and I was so excited! This blog post illustrates how to implement that approach to find word vector representations in R using tidy data principles and sparse matrices. Word vectors, […]

Understanding Gender Roles in Movies with Text Mining

Understanding Gender...

I have a new visual essay up at The Pudding, using text mining to explore how women are portrayed in film.   In April 2016, we broke down film dialogue by gender. The essay presented an imbalance in which men delivered more lines than women across 2,000 screenplays. But quantity of lines is only part of the story. What characters […]

Word embeddings in 2017: Trends and future directions

Word embeddings in 2...

Table of contents: Subword-level embeddings OOV handling Evaluation Multi-sense embeddings Beyond words as points Phrases and multi-word expressions Bias Temporal dimension Lack of theoretical understanding Task and domain-specific embeddings Embeddings for multiple languages Embeddings based on other contexts The word2vec method based on skip-gram with negative sampling (Mikolov et al., 2013) [49] was published in […]

Introduction to Natural Language Processing with NLTK

Introduction to Natu...

Hello all and welcome to the second of the series – NLP with NLTK. The first of the series can be found here, incase you have missed. In this article we will talk about basic NLP concepts and use NLTK to implement the concepts. Contents: Corpus Tokenization/Segmentation Frequency Distribution Conditional Frequency Distribution Normalization Zipf’s law […]

ConceptNet 5.5 and conceptnet.io

ConceptNet 5.5 and c...

ConceptNet is a large, multilingual knowledge graph about what words mean. This is background knowledge that’s very important in NLP and machine learning, and it remains relevant in a time when the typical thing to do is to shove a terabyte or so of text through a neural net. We’ve shown that ConceptNet provides information […]

On Building a “Fake News” Classification Model *update

On Building a “...

“A lie gets halfway around the world before the truth has a chance to get its pants on.” – Winston Churchill Since the 2016 presidential election, one topic dominating political discourse is the issue of “Fake News”. A number of political pundits claim that the rise of  significantly biased and/or untrue news influenced the election, though a study by researchers […]

How Twitter Reacted to the Academy Awards and Its Crazy Ending

How Twitter Reacted ...

Award show night-Twitter is a special genre of tweeting. It’s a several-hour long affair where Twitter users perform just as much for their followers as the award recipients. It’s a cutthroat competition to beat your fellow tweeters to the funniest jokes and best takes for every moment of the night. To some, the tweets are the event’s main draw and […]

Introduction to NLP with NLTK – Part 1

Introduction to NLP ...

Introduction: The idea of using a structured programming language to interact with computers is being challenged by Natural Language Processing (NLP) and Natural Language Understanding methods. NLP holds great promises of making computer interfaces accessible to a wide range of audiences – as humans would be able to talk to computers in their own native […]

Here’s What Twitter Was Like During the Super Bowl.

Here’s What Tw...

The Patriots 34-28 win in Super Bowl 51 was, quite possibly, one of greatest football games of all time. It had the largest Super Bowl comeback of all time and was the first to ever to go to overtime. For data-minded folks, the game exhibited striking parallels to the election. ESPN’s live prediction models was saying that Atlanta was almost certain to […]