NFTs Sentiment Analysis (2nd position Solution)

raheem nasirudeen
6 min readJun 10, 2022

--

Image_Source

This hackathon was hosted and organized by Dphi and Bitscrunch.

My solution came 2nd amongst other competitors for solving the business use case provided and judged by the bitscrunch team.

The congratulatory message.

Attached Above is the congratulatory message for my Win in the Challenge.

Abstract

A non-fungible token is a non-interchangeable unit of data stored on a blockchain, a form of digital ledger, that can be sold and traded. Types of NFT data units may be associated with digital files such as photos, videos, and audio. In this project, we used data collected with various NFTs tags on Twitter to identify the trend of NFTs using natural language processing (Sentiment Analysis) and machine learning. we are able to identify the trend data using.

Introduction

This rising cryptocurrency niche recorded over $23 billion in trading volumes as per the latest DappRadar report. Currently, NFT-related active wallets account for close to 50% of the total crypto industry usage, a statistic that will likely increase given the continued interest in 2022.

Before jumping into the developments and prospects, it is worth understanding why NFTs are gaining traction across the board. Well, there are many factors behind the sudden surge but the most significant one is the indistinguishable nature of NFTs. Each NFT token has a unique value, making them a suitable on-chain asset to represent digital collectibles such as in-game items or off-chain assets like property and tokenized stocks.

Problem Statement

We are currently living in a world, where there is a massive explosion of digital assets — hundreds of blockchains, thousands of metaverses, tens of thousands of NFT collections, and millions of NFTs. Also, this number is growing rapidly day by day. So there is a dire need to identify the new and trending NFT collections across different blockchains to keep up with the latest happenings. Social media plays a crucial role in today’s NFT world. Collectors flaunt their NFT arts on social media platforms which become viral soon. So the aim of this challenge is to identify those collections early using these social media signals.

Hackathon Objective:

Identify the trending NFT collections on Twitter using Twitter data on a daily basis and analyze their sentiments.

Methodology

  1. Data collection and gathering.
  2. Sentiment analysis.
  3. Generating statistical and time data features
  4. Exploratory Data Analysis.
  5. Named Entity Recognition
  6. Data Preprocessing
  7. Word Cloud on each Sentiment
  8. Text Feature extraction
  9. Machine learning model
  10. Model Explainability.

Data Collection and Gathering

Most texts come as unstructured data. The data was collected from Twitter Developer API using Twint at 14 days intervals for the challenge. The image below shows the pipeline for the data collection on Twitter. Hence necessary features are selected for the analysis.

The image above is the pipeline for the data collection on Twitter, hence necessary features are selected for the analysis.

Sentiment Analysis

This is an unsupervised learning task. Thus, we label the data using text blob API for sentiment analysis on each data collected.

Sentiment Analysis on the polarity score.

Statistical and Time data features

Generating statistical features on each tweet like word_count, character_count, and word_density.

Generating statistical features.

Generating some features from the date feature like the days of week and days.

Generating date features.

Exploratory Data Analysis

In this section, we analyzed our data using univariate and bivariate analysis to answer the following questions in our data

1. The percentage of each Sentiment in the data

2. Top 10 most occurring word count and character count in the data

3. The 10 Likes count in the data

4. The percentage of days of the week

5. Average Likes counts for each Sentiment

6. Sentiment with the total number of words and character count

7. Days of week with Sentiment percentage

Univariate and Bivariate Analysis

We have 46.3% of Sentiment to be Neutral, 45.7% Positive, and 8.0% Negative.
The Top 10 most occurring word count is from 10 words to 28 words.
The top 10 character word count from 72–255.
The top 10 likes count are from 2–11 counts.
Most Tweets are made on the 4th day of the week (Friday) with 48.4%, and fewer tweets on the 5th day (Saturday) with 2.7%.
A positive sentiment with the Highest average number of likes for the tweets.
A positive sentiment with the Highest Number of word counts for the tweets.
A positive sentiment with the Highest number of Character counts for the tweets.

On Friday, we have more positive sentiment with 48.3%.

On Tuesday, we have a more neutral sentiment with 54.2%.

On Saturday, we have 15.2% of negative Sentiment, which is the highest percentage for negative Sentiment across the days of the week.

Named Entity Recognition

In this section, we used a spacy model to analyze the top 5 tweets with the most likes. After that, I will be showing the Top 2 tweets analysis. Other analyses can be found in the notebook link on my GitHub page.

Top 5 tweet likes
The analysis of the tweet with the most likes.
Tweet number 2 with the most likes.

Data Preprocessing

In this section, we preprocessed our raw text data by lowering the text case, removing stop words, repeated words, and other noise. Then we lemmatized the text data.

Preprocessing

Word Cloud analysis

We analyzed the clean text data in this section using a word cloud.

Word cloud for all data points.
Word Cloud on Positive Sentiment
Word cloud on Neutral sentiment.
Word cloud on Negative sentiment.

Machine Learning

The data sentiment has been transformed into class, and then the data was split into 80% training data and 20% test data. We then extracted features from the text data using TfidfVectorizer. Afterwards, we compared 5 machine learning models; Logistic Regression, RandomForestClassifier, DecisionTreeClassifier, Lightgbm, and, Xgboost, to come up with the model that gives the best accuracy using accuracy score as the evaluation metrics.

Models Accuracy
Model accuracy

Model accuracy

It was observed that a Gradient Boosting model Xgboost, a non-linear model, performed better with 75.10% accuracy, and Logistic Regression which is a linear model also performed better with an accuracy of 74.69%.

Model Explainability

We can interpret the model using lime, which is done to understand better the Sentiment of a text and model debugging in the long run.

Lime snapshot.

I give a shout-out to Dphi and Nexford University.

Dphi has improved my technical skills in solving data science-related problems through the boot camps have attended.

Nexford University has improved my problem-solving skills through their well-created educational content for their students.

Important document

Notebook_for_data_collection

Notebook

Document

Contact

LinkedIn

Twitter

References

  1. https://dphi.tech/challenges/nft-datacrunch-league-bitscrunch-bronze-edition/198/overview/about
  2. https://dphi.tech/blog/winners-of-the-bitscrunch-nft-datathon-envisioned-towards-safeguarding-the-nft-ecosystem/

3. https://www.youtube.com/watch?v=_SqgSh3aR1g&t=420s

4. https://medium.com/analytics-vidhya/how-to-scrape-tweets-from-twitter-with-python-twint-83b4c70c5536

5. https://stackoverflow.com/questions/69803757/importerror-cannot-import-name-ceiltimeout-from-aiohttp-helpers

6. https://stackoverflow.com/questions/65836173/twint-issue-time-data-2020-04-29-000000-does-not-match-format-y-m-d-h

--

--

raheem nasirudeen
raheem nasirudeen

Written by raheem nasirudeen

Data scientist | BBA graduate | Statistician | Causal Inference

No responses yet