Data Science

🐍 Master Detecting Ai Generated Text With Python: Every Expert Uses!

Hey there! Ready to dive into Detecting Ai Generated Text With Python? This friendly guide will walk you through everything step-by-step with easy-to-follow examples. Perfect for beginners and pros alike!

SuperML Team
Share this article

Share:

🚀

💡 Pro tip: This is one of those techniques that will make you look like a data science wizard! Introduction to AI-Generated Text Detection - Made Simple!

Detecting AI-generated text has become increasingly important in the digital age. This presentation will cover various Python-based techniques to identify machine-generated content, from simple statistical methods to more cool machine learning approaches.

Here’s a handy trick you’ll love! Here’s how we can tackle this:

import nltk
from nltk.tokenize import word_tokenize

def tokenize_text(text):
    return word_tokenize(text)

sample_text = "This is a sample text that could be human or AI-generated."
tokens = tokenize_text(sample_text)
print(f"Tokenized text: {tokens}")

🚀

🎉 You’re doing great! This concept might seem tricky at first, but you’ve got this! Basic Statistical Analysis - Made Simple!

One of the simplest methods to detect AI-generated text is through statistical analysis of word frequencies and sentence structures. AI models often have distinct patterns in their output that differ from human writing.

This next part is really neat! Here’s how we can tackle this:

from collections import Counter

def analyze_word_frequency(tokens):
    return Counter(tokens)

word_freq = analyze_word_frequency(tokens)
print(f"Word frequencies: {word_freq}")

🚀

Cool fact: Many professional data scientists use this exact approach in their daily work! Perplexity Calculation - Made Simple!

Perplexity is a measure of how well a probability model predicts a sample. Lower perplexity indicates that the text is more likely to be generated by the model used for calculation.

Let’s break this down together! Here’s how we can tackle this:

import math
from nltk import ngrams

def calculate_perplexity(text, n=2):
    n_grams = list(ngrams(text.split(), n))
    N = len(n_grams)
    freq_dist = nltk.FreqDist(n_grams)
    entropy = -sum(freq_dist[ng] * math.log2(freq_dist[ng]) for ng in freq_dist)
    perplexity = 2 ** (entropy / N)
    return perplexity

text = "This is a sample text for perplexity calculation."
perplexity = calculate_perplexity(text)
print(f"Perplexity: {perplexity}")

🚀

🔥 Level up: Once you master this, you’ll be solving problems like a pro! Burstiness Analysis - Made Simple!

Burstiness refers to the phenomenon where certain words or phrases appear in clusters rather than being evenly distributed throughout the text. Human-written text often exhibits more burstiness than AI-generated content.

Here’s a handy trick you’ll love! Here’s how we can tackle this:

import numpy as np

def calculate_burstiness(tokens):
    word_positions = {}
    for i, word in enumerate(tokens):
        if word not in word_positions:
            word_positions[word] = []
        word_positions[word].append(i)
    
    burstiness_scores = {}
    for word, positions in word_positions.items():
        if len(positions) > 1:
            gaps = np.diff(positions)
            burstiness = np.std(gaps) / np.mean(gaps)
            burstiness_scores[word] = burstiness
    
    return burstiness_scores

burstiness = calculate_burstiness(tokens)
print(f"Burstiness scores: {burstiness}")

🚀 Sentiment Analysis - Made Simple!

AI-generated text may have different sentiment patterns compared to human-written text. Analyzing sentiment can provide insights into the nature of the content.

Here’s where it gets exciting! Here’s how we can tackle this:

from nltk.sentiment import SentimentIntensityAnalyzer

def analyze_sentiment(text):
    sia = SentimentIntensityAnalyzer()
    return sia.polarity_scores(text)

sentiment = analyze_sentiment(sample_text)
print(f"Sentiment analysis: {sentiment}")

🚀 Named Entity Recognition - Made Simple!

AI models might struggle with consistently and accurately using named entities. Analyzing the presence and usage of named entities can help identify AI-generated text.

Here’s a handy trick you’ll love! Here’s how we can tackle this:

from nltk import ne_chunk, pos_tag

def extract_named_entities(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    named_entities = ne_chunk(pos_tags)
    return named_entities

ner_result = extract_named_entities(sample_text)
print(f"Named entities: {ner_result}")

🚀 Text Coherence Analysis - Made Simple!

Human-written text generally maintains better coherence and flow between sentences and paragraphs. Analyzing text coherence can help identify AI-generated content.

Let’s make this super clear! Here’s how we can tackle this:

from gensim.models import Word2Vec
import numpy as np

def analyze_coherence(sentences):
    # Train a simple Word2Vec model
    model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
    
    # Calculate sentence embeddings
    sentence_embeddings = []
    for sentence in sentences:
        words = [word for word in sentence if word in model.wv]
        if words:
            sentence_embedding = np.mean([model.wv[word] for word in words], axis=0)
            sentence_embeddings.append(sentence_embedding)
    
    # Calculate cosine similarity between consecutive sentences
    coherence_scores = []
    for i in range(len(sentence_embeddings) - 1):
        similarity = np.dot(sentence_embeddings[i], sentence_embeddings[i+1]) / (
            np.linalg.norm(sentence_embeddings[i]) * np.linalg.norm(sentence_embeddings[i+1]))
        coherence_scores.append(similarity)
    
    return np.mean(coherence_scores)

sentences = [word_tokenize(sent) for sent in nltk.sent_tokenize(sample_text)]
coherence = analyze_coherence(sentences)
print(f"Text coherence score: {coherence}")

🚀 Language Model Perplexity - Made Simple!

Using a pre-trained language model to calculate perplexity can be more effective than simple n-gram models. Lower perplexity suggests that the text is more likely to be generated by an AI model similar to the one used for evaluation.

Here’s a handy trick you’ll love! Here’s how we can tackle this:

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

def calculate_gpt2_perplexity(text):
    model = GPT2LMHeadModel.from_pretrained('gpt2')
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    
    inputs = tokenizer(text, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs['input_ids'])
    
    loss = outputs.loss
    perplexity = torch.exp(loss)
    return perplexity.item()

gpt2_perplexity = calculate_gpt2_perplexity(sample_text)
print(f"GPT-2 Perplexity: {gpt2_perplexity}")

🚀 Readability Metrics - Made Simple!

AI-generated text might have different readability characteristics compared to human-written text. Analyzing readability can provide insights into the nature of the content.

Ready for some cool stuff? Here’s how we can tackle this:

import textstat

def analyze_readability(text):
    flesch_reading_ease = textstat.flesch_reading_ease(text)
    flesch_kincaid_grade = textstat.flesch_kincaid_grade(text)
    gunning_fog = textstat.gunning_fog(text)
    
    return {
        "Flesch Reading Ease": flesch_reading_ease,
        "Flesch-Kincaid Grade": flesch_kincaid_grade,
        "Gunning Fog Index": gunning_fog
    }

readability_scores = analyze_readability(sample_text)
print(f"Readability scores: {readability_scores}")

🚀 Stylometric Analysis - Made Simple!

Stylometry involves analyzing writing style features such as sentence length, vocabulary richness, and punctuation usage. These features can help distinguish between human and AI-generated text.

Here’s a handy trick you’ll love! Here’s how we can tackle this:

import re

def analyze_style(text):
    sentences = nltk.sent_tokenize(text)
    words = word_tokenize(text)
    
    avg_sentence_length = len(words) / len(sentences)
    vocabulary_richness = len(set(words)) / len(words)
    punctuation_frequency = len(re.findall(r'[^\w\s]', text)) / len(words)
    
    return {
        "Average Sentence Length": avg_sentence_length,
        "Vocabulary Richness": vocabulary_richness,
        "Punctuation Frequency": punctuation_frequency
    }

style_features = analyze_style(sample_text)
print(f"Stylometric features: {style_features}")

🚀 Machine Learning Classifier - Made Simple!

Combining various features extracted from the text, we can train a machine learning model to classify text as human-written or AI-generated.

Ready for some cool stuff? Here’s how we can tackle this:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Assume we have a dataset of texts and labels
texts = ["Sample text 1", "Sample text 2", "Sample text 3"]
labels = [0, 1, 0]  # 0 for human-written, 1 for AI-generated

# Feature extraction
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

# Train a Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))

🚀 Real-life Example: News Article Verification - Made Simple!

In this example, we’ll analyze a news article to determine if it’s likely to be AI-generated or human-written.

Here’s a handy trick you’ll love! Here’s how we can tackle this:

import requests
from bs4 import BeautifulSoup

def fetch_article(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    paragraphs = soup.find_all('p')
    return ' '.join([p.text for p in paragraphs])

url = "https://example.com/news-article"
article_text = fetch_article(url)

# Apply various detection techniques
perplexity = calculate_perplexity(article_text)
sentiment = analyze_sentiment(article_text)
readability = analyze_readability(article_text)
style = analyze_style(article_text)

print(f"Perplexity: {perplexity}")
print(f"Sentiment: {sentiment}")
print(f"Readability: {readability}")
print(f"Style: {style}")

# Based on these results, make a judgment about whether the article is likely AI-generated

🚀 Real-life Example: Social Media Post Analysis - Made Simple!

In this example, we’ll analyze a collection of social media posts to identify potential AI-generated content.

Let’s break this down together! Here’s how we can tackle this:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load social media posts (assuming we have a CSV file with 'text' and 'is_ai_generated' columns)
df = pd.read_csv('social_media_posts.csv')

# Split the data
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['is_ai_generated'], test_size=0.2, random_state=42)

# Feature extraction
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Train a Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_vectorized, y_train)

# Make predictions
y_pred = clf.predict(X_test_vectorized)

# Evaluate the model
print(classification_report(y_test, y_pred))

# Analyze feature importance
feature_importance = pd.DataFrame({
    'feature': vectorizer.get_feature_names_out(),
    'importance': clf.feature_importances_
}).sort_values('importance', ascending=False)

print("Top 10 most important features:")
print(feature_importance.head(10))

🚀 Limitations and Challenges - Made Simple!

While these techniques can be effective, it’s important to note their limitations:

  1. AI models are constantly evolving, making detection more challenging.
  2. Some methods may produce false positives or negatives.
  3. Hybrid content (partly human-written, partly AI-generated) can be difficult to classify.
  4. The context and purpose of the text should be considered alongside technical analysis.

To address these challenges, it’s crucial to:

  1. Regularly update detection models and techniques.
  2. Combine multiple methods for more reliable results.
  3. Consider the broader context of the content being analyzed.
  4. Stay informed about the latest developments in AI text generation and detection.

🚀 Additional Resources - Made Simple!

For further exploration of AI-generated text detection techniques, consider the following peer-reviewed articles:

  1. “Automatic Detection of Machine Generated Text: A Critical Survey” by Jawahar et al. (2020) ArXiv: https://arxiv.org/abs/2011.01314
  2. “Defending Against Neural Fake News” by Zellers et al. (2019) ArXiv: https://arxiv.org/abs/1905.12616
  3. “Detecting Machine-generated Text using Machine Learning: A Systematic Review” by Fagni et al. (2021) ArXiv: https://arxiv.org/abs/2103.04540

These resources provide in-depth discussions of various detection methods and their effectiveness against different types of AI-generated text.

🎊 Awesome Work!

You’ve just learned some really powerful techniques! Don’t worry if everything doesn’t click immediately - that’s totally normal. The best way to master these concepts is to practice with your own data.

What’s next? Try implementing these examples with your own datasets. Start small, experiment, and most importantly, have fun with it! Remember, every data science expert started exactly where you are right now.

Keep coding, keep learning, and keep being awesome! 🚀

Back to Blog

Related Posts

View All Posts »