🧹 Text Preprocessing Steps - NLP के लिए टेक्स्ट को तैयार करना

Text preprocessing, NLP pipelines की पहली और सबसे ज़रूरी स्टेप होती है, जिसमें हम text को साफ़ और structured रूप में लाते हैं ताकि machine इसे सही तरीके से समझ सके।

🔹 1. Tokenization

Text को छोटे-छोटे units में बांटना जिन्हें tokens कहा जाता है।

from nltk.tokenize import word_tokenize
text = "मशीन लर्निंग बहुत मज़ेदार है।"
print(word_tokenize(text))

Output: ['मशीन', 'लर्निंग', 'बहुत', 'मज़ेदार', 'है', '।']

🔸 2. Stop Words Removal

ऐसे शब्द जो भाषा में आम होते हैं लेकिन meaning में ज़्यादा contribute नहीं करते (जैसे "है", "और", "का")।

from nltk.corpus import stopwords
words = ['मैं', 'AI', 'सीख', 'रहा', 'हूं']
filtered = [w for w in words if w not in stopwords.words('hindi')]
print(filtered)

🧠 3. Lemmatization

Words को उनकी base/root form में convert करना। यह grammatical analysis पर आधारित होता है।

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("running", pos="v"))

Output: run

📋 Summary Table

Step	Purpose
Tokenization	Text को words में बांटना
Stop Words Removal	Non-important शब्द हटाना
Lemmatization	Words को root form में लाना

✅ निष्कर्ष

Text preprocessing से raw टेक्स्ट को ऐसे structured रूप में बदला जाता है जिसे मशीन आसानी से process कर सके। यह NLP model की accuracy और performance बढ़ाने में अहम भूमिका निभाता है।

🚀 अगले ब्लॉग में: Bag of Words & TF-IDF (Hindi में)

Text Preprocessing Steps - NLP के लिए टेक्स्ट प्रोसेसिंग (Hindi)