Cryptocurrency market capitalization surged from $17 billion in 2017 to over $2.25 trillion in 2021, representing a remarkable 13,000% return on investment in just five years. Despite this explosive growth, cryptocurrencies remain highly volatile, influenced by factors ranging from market trends and technological developments to political events and social media sentiment—especially on platforms like Twitter.
In a Harvard Extension School data engineering and analytics course project, our team developed a cryptocurrency data lake to analyze trends and assess the impact of social media sentiment on crypto asset volatility, using Bitcoin (BTC) as a primary example. We leveraged the Databricks Lakehouse Platform to ingest unstructured Twitter data via the Tweepy library and structured pricing data from Yahoo Finance, building a machine learning model to predict how investor sentiment affects valuation. The resulting insights were visualized on a Databricks SQL dashboard for stakeholders.
This article explains how we constructed this ML model in weeks using Databricks and collaborative notebooks.
Project Overview
Cryptocurrency markets operate 24/7, providing continuous data ideal for analyzing correlations between social media activity and price movements. Figure 1 illustrates the high-level architecture of our data and ML pipeline.
The workflow orchestrates a sequence of Databricks notebooks to execute:
- Data ingestion into Bronze tables
- Data cleaning and sentiment analysis in Silver tables
- Aggregation into Gold tables for correlation modeling
- SQL-based business intelligence queries
The Lakehouse architecture combines the scalability of data lakes with the performance of data warehouses, accelerating pipeline development to just one week. It also enabled seamless collaboration among data engineers, ML practitioners, and BI analysts without data movement across systems.
Data and ML Pipeline Implementation
Data Ingestion with Medallion Architecture
We sourced data from Twitter and Yahoo Finance, using a lookup table to map crypto tickers to Twitter hashtags. The yfinance Python library downloaded historical market data at 15-minute intervals, stored in Bronze tables with fields like symbol, timestamp, open/close prices, and volume. Silver tables enriched this data with calculated fields like price change percentages.
Delta Lake ensured atomic operations, schema enforcement, and data quality, simplifying reprocessing. Twitter data, ingested via Tweepy, was stored in Bronze tables before removing non-ASCII characters and irrelevant fields for Silver storage.
Data Science Workflow
Our data science process included exploratory data analysis (EDA), sentiment modeling, and correlation analysis. We used MLflow for experiment tracking, reproducibility, and model management.
Exploratory Data Analysis
EDA involved visualizing tweet length distributions by sentiment category using Seaborn violin plots. Word clouds (via matplotlib and wordcloud) highlighted frequent terms in positive/negative tweets. An interactive topic modeling dashboard built with Gensim revealed common themes and term frequencies.
Sentiment Analysis Model
The sentiment model classified tweets as positive, neutral, or negative. We evaluated four approaches:
- Lexicon-based algorithms: Compare words to labeled databases; simple but limited by dictionary quality.
- Off-the-shelf systems (e.g., Amazon Comprehend): Easy to deploy but less customizable.
- Classical ML algorithms: Logistic Regression, Random Forest, SVM, and Naïve Bayes offered interpretability but required extensive preprocessing.
- Deep learning models: Pre-trained embeddings like BERT via SparkNLP provided state-of-the-art accuracy with transfer learning.
SparkNLP was chosen for its scalability and accuracy. A classical ML pipeline with SVM achieved 75.7% accuracy, while a DL model pre-trained on IMDb reached 83%, demonstrating the superiority of deep learning for this task.
Correlation Model
A linear regression model (via scikit-learn) quantified relationships between sentiment scores and price changes. Sentiment was scored as -1 (negative), 0 (neutral), or +1 (positive), aggregated at 15-minute intervals. Initial results showed no strong linear correlation, suggesting future work with sentiment polarity or alternative models.
Business Intelligence and Visualization
Databricks SQL enabled efficient querying, visualization, and dashboard creation. Alerts notified users of significant events like price swings or sentiment shifts.
Dashboard Creation
The SQL Editor simplified query development and visualization. Dashboards included:
- Overview: Twitter influencer impact, stock movements, and tweet frequency.
- Sentiment Analysis: Sentiment distribution per cryptocurrency over time.
- Stock Volatility: Price trends and historical data for each crypto asset.
Key Findings
- Tweet volume correlates with volatility: High tweet activity often precedes or follows price changes.
- Follower count doesn’t equate to influence: While influencers like Elon Musk cause market swings, follower count alone doesn’t predict impact.
- Retweets show negative correlation: Suggesting influence may propagate through secondary channels like news articles.
Databricks streamlined pipeline orchestration, ML modeling, and visualization, enabling project completion in under four weeks.
👉 Explore advanced data analytics techniques
Frequently Asked Questions
What is a Lakehouse architecture?
A Lakehouse combines data lake scalability with data warehouse performance, supporting both BI and AI workloads. It uses Delta Lake for ACID transactions and schema enforcement, eliminating data silos.
How does sentiment analysis work for crypto tweets?
Sentiment analysis classifies tweets as positive, neutral, or negative using NLP models. We used SparkNLP for deep learning-based classification, achieving 83% accuracy with pre-trained models.
Can social media sentiment predict crypto prices?
Our correlation model found no strong linear relationship, but tweet volume correlates with volatility. Future models may use alternative approaches like sentiment polarity or time-series analysis.
What tools are needed to replicate this project?
Databricks, Python libraries (Tweepy, yfinance, SparkNLP), and MLflow for model management. Basic knowledge of data engineering and ML is required.
How is data ingested from Twitter and Yahoo Finance?
We used Tweepy for Twitter API access and yfinance for market data, storing raw data in Delta Bronze tables before processing.
What are the benefits of using Databricks for such projects?
Databricks provides an integrated environment for data ingestion, processing, ML, and visualization, reducing technical overhead and accelerating development.
Conclusion
Our project demonstrated the power of Databricks in building end-to-end data pipelines for crypto analysis. While social media sentiment alone may not predict prices, it provides valuable insights into market volatility. The Lakehouse architecture facilitated collaboration and efficiency, making it ideal for similar analytics projects.
👉 Learn more about building data-driven investment strategies
Note: This content is for educational purposes only and does not constitute financial advice.