Textanalys och språklig bearbetning

Natural Language Processing (NLP) handlar om att låta datorer förstå och bearbeta mänskligt språk. Här utforskar vi grunderna i NLP med hjälp av Python och bibliotek som NLTK, spaCy, och Scikit-learn.

Vad är NLP?

NLP används för att analysera textdata och kan tillämpas på uppgifter som:

Tokenisering: Dela upp text i ord eller meningar.
Stoppordsborttagning: Filtrera bort vanliga ord (t.ex. “och”, “är”).
Stämningsanalys: Avgöra om en text är positiv, negativ eller neutral.
Textklassificering: Kategorisera texter (t.ex. spam vs. icke-spam).

Steg 1: Tokenisering

Tokenisering innebär att dela upp text i mindre delar, såsom ord eller meningar.

import nltkfrom nltk.tokenize import word_tokenize, sent_tokenize# Example texttext = "Natural Language Processing is amazing! It allows machines to understand text."# Tokenize into sentencessentences = sent_tokenize(text)print("Sentences:", sentences)# Tokenize into wordswords = word_tokenize(text)print("Words:", words)Code language: PHP (php)

Output:

Sentences: [‘Natural Language Processing is amazing!’, ‘It allows machines to understand text.’]
Words: [‘Natural’, ‘Language’, ‘Processing’, ‘is’, ‘amazing’, ‘!’, ‘It’, ‘allows’, ‘machines’, ‘to’, ‘understand’, ‘text’, ‘.’]

Metod	Beskrivning
`sent_tokenize`	Delar upp text i meningar
`word_tokenize`	Delar upp text i ord

Steg 2: Stoppordsborttagning

Stoppord är vanliga ord som inte tillför mycket mening och ofta filtreras bort.

from nltk.corpus import stopwords# Load English stopwordsnltk.download('stopwords')stop_words = set(stopwords.words('english'))# Filter out stopwordsfiltered_words = [word for word in words if word.lower() not in stop_words]print("Filtered Words:", filtered_words)Code language: PHP (php)

Output:

Filtered Words: [‘Natural’, ‘Language’, ‘Processing’, ‘amazing’, ‘!’, ‘allows’, ‘machines’, ‘understand’, ‘text’, ‘.’]

Steg 3: Lemmatization

Lemmatization innebär att reducera ord till sina grundformer.

from nltk.stem import WordNetLemmatizer# Initialize lemmatizernltk.download('wordnet')lemmatizer = WordNetLemmatizer()# Lemmatize wordslemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]print("Lemmatized Words:", lemmatized_words)Code language: PHP (php)

Output:

Lemmatized Words: [‘Natural’, ‘Language’, ‘Processing’, ‘amazing’, ‘!’, ‘allow’, ‘machine’, ‘understand’, ‘text’, ‘.’]

Steg 4: Textklassificering

Här kategoriserar vi text som positiv eller negativ med hjälp av Scikit-learn.

from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.model_selection import train_test_splitfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.metrics import accuracy_score# Example datasettexts = ["I love this product", "This is the worst!", "Absolutely amazing!", "Not bad, but not great", "Horrible experience"]labels = [1, 0, 1, 1, 0] # 1 = Positive, 0 = Negative# Vectorize text datavectorizer = CountVectorizer()X = vectorizer.fit_transform(texts)# Split dataX_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=42)# Train modelmodel = MultinomialNB()model.fit(X_train, y_train)# Test modely_pred = model.predict(X_test)accuracy = accuracy_score(y_test, y_pred)print(f"Accuracy: {accuracy:.2f}")Code language: PHP (php)

Output:

Accuracy: 1.00

Del	Beskrivning
`CountVectorizer`	Konverterar text till numeriska värden
`MultinomialNB`	Naive Bayes-algoritm för klassificering

Sammanfattande kod

Här kombinerar vi alla steg för att analysera text.

import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score # Example text texts = ["I love this product", "This is the worst!", "Absolutely amazing!", "Not bad, but not great", "Horrible experience"] labels = [1, 0, 1, 1, 0] # 1 = Positive, 0 = Negative # Preprocessing nltk.download('stopwords') nltk.download('wordnet') stop_words = set(stopwords.words('english')) lemmatizer = WordNetLemmatizer() def preprocess(text): words = word_tokenize(text) words = [word for word in words if word.lower() not in stop_words] words = [lemmatizer.lemmatize(word) for word in words] return " ".join(words) processed_texts = [preprocess(text) for text in texts] # Vectorization and classification vectorizer = CountVectorizer() X = vectorizer.fit_transform(processed_texts) X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=42) model = MultinomialNB() model.fit(X_train, y_train) y_pred = model.predict(X_test) # Evaluation accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy:.2f}")Code language: PHP (php)

Tips och vanliga fallgropar

Tips

AnvändCountVectorizer ellerTfidfVectorizer för att konvertera text till numeriska värden.
Börja med små dataset innan du hanterar stora mängder text.
Utför noggrann förbearbetning av text för bättre resultat.

Vanliga fallgropar

Stoppord i klassificering: Filtrera bort irrelevanta ord för att förbättra precisionen.
Obalanserad data: Säkerställ att data är balanserad för rättvisa resultat.
Överträning: Undvik att träna modellen för mycket på träningsdata.

Sammanfattning

Natural Language Processing är en kraftfull teknik för att bearbeta och analysera textdata. Genom att kombinera tokenisering, lemmatization och textklassificering kan du lösa många intressanta problem inom textanalys. Börja smått, experimentera med olika tekniker och bygg successivt mer avancerade lösningar!

Textanalys och språklig bearbetning

Vad är NLP?

Steg 1: Tokenisering

Steg 2: Stoppordsborttagning

Steg 3: Lemmatization

Steg 4: Textklassificering

Sammanfattande kod

Tips och vanliga fallgropar

Tips

Vanliga fallgropar

Sammanfattning

Leave a Reply Cancel reply