Creating a Text Classification Program: A Tutorial on Natural Language Processing

Deep learning, a crucial element of machine learning workflows, leverages advancements in parallel computing and tools to make sophisticated neural networks achievable.

The availability of user-friendly libraries like Tensorflow, Torch, and Deeplearning4j has democratized development, extending it beyond academic and research-focused settings. This widespread adoption is evident in companies like Huawei and Apple, which are incorporating dedicated deep learning processors into their latest devices.

Deep learning’s impact spans various domains. Notably, Google’s AlphaGo triumphed over human Go players, a game previously considered too complex for computers to master. Sony’s Flow Machines project employs a neural network to generate music in the style of renowned composers. Additionally, Apple’s FaceID utilizes deep learning for facial recognition and tracking changes in a user’s appearance over time.

This article explores the application of deep learning to natural language processing and wine. The objective is to construct a model that can interpret wine reviews written by experts and accurately identify the type of wine being reviewed.

Deep Learning in NLP

Deep learning’s strength in natural language processing (NLP) lies in its ability to learn the intricate structure of sentences and the semantic relationships between words. For instance, state-of-the-art sentiment analysis techniques rely on deep learning to grasp challenging linguistic concepts like negations and mixed sentiments.

Deep learning offers several advantages over other NLP algorithms:

Adaptable Models: Deep learning models are more adaptable than traditional ML models, enabling easy experimentation with different architectures by adding or removing layers. They also accommodate flexible outputs, making them suitable for understanding intricate linguistic structures and developing applications like translation tools, chatbots, and text-to-speech systems.
Reduced Domain Expertise: While some domain knowledge is beneficial, deep learning algorithms’ capacity to independently learn feature hierarchies reduces the need for extensive expertise in the specific problem area. This is particularly advantageous in the intricate field of natural language.
Simplified Continuous Learning: Updating deep learning algorithms with new data is straightforward. Unlike some machine learning algorithms that require processing the entire dataset for updates, deep learning models can readily adapt to live, large datasets.

Today’s Challenge

Our aim is to create a deep learning algorithm capable of determining the wine varietal being reviewed solely from the review’s text. The dataset used is the wine magazine dataset found at https://www.kaggle.com/zynicide/wine-reviews, courtesy of Kaggle user zackthoutt.

The fundamental question is whether we can train a neural network to analyze a wine review, such as:

Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn’t overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.

…and accurately classify it as a white blend. Experienced wine enthusiasts might pick up on clues like apple, citrus, and prominent acidity that suggest a white wine, but can our neural network be trained to do the same? Moreover, can it distinguish between a white blend review and a Pinot Grigio review?

Comparable Algorithms

This task can be classified as an NLP classification problem. Several existing NLP classification algorithms address various NLP challenges. For example, naive Bayes are used in spam detection algorithms, while support vector machines (SVM) classify texts like progress notes in healthcare. Implementing a basic version of these algorithms would provide a useful baseline for our deep learning model.

Naive Bayes

A common implementation of naive Bayes for NLP involves preprocessing text using TF-IDF and then applying the multinomial naive Bayes algorithm to the preprocessed data. This focuses the algorithm on the most significant words in a document. We can implement naive Bayes as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer

df = pd.read_csv('data/wine_data.csv')

counter = Counter(df['variety'].tolist())
top_10_varieties = {i[0]: idx for idx, i in enumerate(counter.most_common(10))}
df = df[df['variety'].map(lambda x: x in top_10_varieties)]

description_list = df['description'].tolist()
varietal_list = [top_10_varieties[i] for i in df['variety'].tolist()]
varietal_list = np.array(varietal_list)

count_vect = CountVectorizer()
x_train_counts = count_vect.fit_transform(description_list)


tfidf_transformer = TfidfTransformer()
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)

train_x, test_x, train_y, test_y = train_test_split(x_train_tfidf, varietal_list, test_size=0.3)

clf = MultinomialNB().fit(train_x, train_y)
y_score = clf.predict(test_x)

n_right = 0
for i in range(len(y_score)):
    if y_score[i] == test_y[i]:
        n_right += 1

print("Accuracy: %.2f%%" % ((n_right/float(len(test_y)) * 100)))

Executing this code should yield a result similar to: 73.56%

This is a promising outcome considering there are ten classes.

We can also assess the performance of a support vector machine. To do so, replace the classifier definition with:

1
 clf = SVC(kernel='linear').fit(train_x, train_y)

Running this should produce the following output:

Accuracy: 80.66%

A respectable result as well.

The next step is to attempt to develop a deep learning model that can match or even exceed these results. Success in this endeavor would strongly indicate our deep learning model’s effectiveness in replicating the performance of established machine learning models enriched with domain knowledge.

Model Construction

We will utilize Keras with Tensorflow to create our model. Keras, a Python library, simplifies the development of deep learning models compared to the lower-level Tensorflow API. Along with dense layers, we’ll incorporate embedding and convolutional layers to enable the model to learn both the underlying semantic meaning of words and potential structural patterns within the data.

Data Preparation

First, we need to restructure the data to facilitate processing by our neural network. This involves replacing words with unique numerical identifiers. By combining this with an embedding vector, we can represent words in a way that is both adaptable and semantically aware.

In practice, a more refined preprocessing approach is desirable. It would be beneficial to concentrate on frequently used words while filtering out the most common ones (e.g., “the,” “this,” “a”).

This functionality can be implemented using Defaultdict and NLTK. The following code should be placed in a separate Python module. For this example, it will be in lib/get_top_x_words.py.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from nltk import word_tokenize
from collections import defaultdict

def count_top_x_words(corpus, top_x, skip_top_n):
    count = defaultdict(lambda: 0)
    for c in corpus:
        for w in word_tokenize(c):
            count[w] += 1
    count_tuples = sorted([(w, c) for w, c in count.items()], key=lambda x: x[1], reverse=True)
    return [i[0] for i in count_tuples[skip_top_n: skip_top_n + top_x]]


def replace_top_x_words_with_vectors(corpus, top_x):
    topx_dict = {top_x[i]: i for i in range(len(top_x))}

    return [
        [topx_dict[w] for w in word_tokenize(s) if w in topx_dict]
        for s in corpus
    ], topx_dict


def filter_to_top_x(corpus, n_top, skip_n_top=0):
    top_x = count_top_x_words(corpus, n_top, skip_n_top)
    return replace_top_x_words_with_vectors(corpus, top_x)

With this, we can proceed to build the model. We aim to incorporate an embedding layer, a convolutional layer, and a dense layer to harness all relevant deep learning features. Keras simplifies this process considerably:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from keras.models import Sequential
from keras.layers import Dense, Conv1D, Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.utils import to_categorical
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from lib.get_top_xwords import filter_to_top_x

df = pd.read_csv('data/wine_data.csv')

counter = Counter(df['variety'].tolist())
top_10_varieties = {i[0]: idx for idx, i in enumerate(counter.most_common(10))}
df = df[df['variety'].map(lambda x: x in top_10_varieties)]

description_list = df['description'].tolist()
mapped_list, word_list = filter_to_top_x(description_list, 2500, 10)
varietal_list_o = [top_10_varieties[i] for i in df['variety'].tolist()]
varietal_list = to_categorical(varietal_list_o)

max_review_length = 150

mapped_list = sequence.pad_sequences(mapped_list, maxlen=max_review_length)
train_x, test_x, train_y, test_y = train_test_split(mapped_list, varietal_list, test_size=0.3)

max_review_length = 150

embedding_vector_length = 64
model = Sequential()

model.add(Embedding(2500, embedding_vector_length, input_length=max_review_length))
model.add(Conv1D(50, 5))
model.add(Flatten())
model.add(Dense(100, activation='relu'))
model.add(Dense(max(varietal_list_o) + 1, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(train_x, train_y, epochs=3, batch_size=64)

y_score = model.predict(test_x)
y_score = [[1 if i == max(sc) else 0 for i in sc] for sc in y_score]
n_right = 0
for i in range(len(y_score)):
    if all(y_score[i][j] == test_y[i][j] for j in range(len(y_score[i]))):
        n_right += 1

print("Accuracy: %.2f%%" % ((n_right/float(len(test_y)) * 100)))

Running the code should result in the following output:

Accuracy: 77.20%

Recall that the accuracy for naive Bayes and SVC was 73.56% and 80.66%, respectively. This demonstrates that our neural network is performing competitively against some of the more prevalent text classification methods.

Conclusion

This article explored the creation of a classification deep learning model for analyzing wine reviews.

We successfully built a model capable of competing with and surpassing the performance of some other machine learning algorithms. The hope is that this information inspires the development of applications that analyze more complex datasets and produce more sophisticated outputs.

Note: The code used for this article can be found on GitHub.