Bitcoin Transaction Graph Dataset: A Foundation for Advanced Research

·

Introduction

Bitcoin represents a revolutionary set of concepts and technologies that power a decentralized digital economy. Introduced in 2008 by the pseudonymous entity Satoshi Nakamoto, it enables the storage and transfer of value between participants without relying on any central authority. The term encompasses multiple dimensions: the protocol governing the economy’s rules, the peer-to-peer network, the blockchain (a public ledger of all transactions), and the unit of value itself.

Unlike traditional economies, Bitcoin operates without centralized control over inflation or transaction validation. It relies on a distributed system that ensures rule adherence through cryptographic proof and consensus mechanisms. Since its launch, Bitcoin has seen substantial adoption. By 2023, the network averaged 270,000 daily users and facilitated approximately $8.6 trillion in transfers. It has also attracted significant scientific interest, with over 30,000 research papers indexed annually on Google Scholar in recent years.

The Critical Need for Public Bitcoin Datasets

Although all Bitcoin transaction data is publicly available, there is a notable scarcity of well-structured, curated datasets for researchers. The Bitcoin community has largely focused on enhancing network security, scalability, and utility, while also tackling risks like financial crimes and security vulnerabilities. However, the absence of accessible datasets impedes deeper understanding of the ecosystem and its broader implications.

Analyzing Bitcoin transaction graphs has become a vital research area. These graphs offer crucial insights into network health, growth, and various transaction patterns. They allow researchers to detect anomalies, uncover risks, and identify behaviors linked to criminal activities like money laundering and fraud. The decentralized and transparent nature of Bitcoin provides a unique opportunity for analysis, though the volume and complexity of data pose significant challenges.

Most existing datasets provide labeled sets of addresses, requiring researchers to build transaction graphs and associate addresses with nodes. This process demands advanced expertise and substantial effort—downloading blockchain data, developing graph construction methods, and applying curation techniques. While a few open-source graph datasets exist, such as Elliptic 1 and Elliptic 2, their scope is limited and primarily suited for money laundering research.

Introducing a Large-Scale Bitcoin Transaction Graph Dataset

To address these limitations, we present a large-scale, temporally resolved Bitcoin transaction graph. This dataset represents real entities—such as organizations, individuals, or institutions—as nodes, with directed edges symbolizing value transfers between them. It spans nearly 13 years of data, comprising 252 million nodes, 785 million edges, and 670 million transactions, making it the largest publicly available dataset of its kind.

Each node and edge is timestamped, enabling detailed temporal analysis. The graph is also richly labeled, with 34,000 nodes annotated by entity type and 100,000 Bitcoin addresses linked to named entities. This structure supports a wide range of applications, from studying transaction patterns to detecting malicious activities.

To demonstrate its utility, we trained graph neural networks (GNNs) to predict entity types, such as ransomware operators or Ponzi schemes. This classification, combined with graph analysis, enables deeper insights and supports efforts like early threat detection and regulatory compliance. Our results establish baselines for future research and underscore the dataset’s potential to enhance the security and transparency of the Bitcoin ecosystem.

Understanding the Bitcoin Framework

Bitcoin relies on asymmetric cryptography, where users control private keys to secure and spend their funds. Instead of revealing these keys, users interact with the network through pseudonyms derived from them, commonly referred to as addresses. The network’s value units, bitcoins, are stored in transaction outputs (TXOs). Each TXO is defined by a value and a locking script that sets the spending conditions.

A Bitcoin transaction consumes existing TXOs as inputs and creates new ones as outputs. A transaction is valid if the input value is at least equal to the output value and if the input TXOs exist and haven’t been spent. Unspent TXOs (UTXOs) can be used in future transactions. This structure allows transactions to transform value distribution among entities efficiently.

Methodology: Building the Graph Dataset

Raw Data Extraction

The complete Bitcoin transaction history is stored in a public ledger known as the blockchain, maintained by a decentralized network of peers. New blocks of transactions are added approximately every ten minutes. We installed Bitcoin Core version 24.0 to set up a node, connect to the network, and download the entire transaction ledger. The data was stored locally and parsed to extract all transaction details.

In this work, we focused on the first 700,000 blocks of the blockchain, preceding the Taproot upgrade. This upgrade introduced significant changes to Bitcoin’s transaction structure, and by focusing on earlier blocks, we ensured methodological consistency.

Node Definition

All bitcoins in circulation are held within UTXOs, each protected by a locking script. Multiple TXOs can be locked by the same script, meaning they can be spent by the same address or group of addresses. Scripts naturally serve as candidates for graph nodes, as they represent the owners of the bitcoins.

However, since users often control multiple private keys and scripts, it is more meaningful to analyze value transfers between real entities rather than scripts. We employed heuristics from previous research to cluster scripts likely belonging to a single entity. This process identified approximately 252 million clusters, each representing a node in the graph.

Edge Construction

Transactions transform input TXOs into output TXOs. An alias (entity) can appear in both inputs and outputs, such as when receiving change from a payment. The net value received by an alias in a transaction is calculated as the difference between the value received in outputs and the value spent in inputs. Directed edges are drawn from senders to recipients based on these value transfers.

Handling Special Transactions

We excluded CoinJoin transactions, which enhance privacy by combining multiple transactions, as they can obscure value flows and disrupt clustering heuristics. Similarly, colored coin transactions, used for transferring assets other than bitcoin, were identified and excluded to maintain analysis integrity.

Attributes

Edges and nodes are enriched with attributes that provide insights into transactional behavior. Edge attributes include the total value transferred, transaction count, and timestamps. Node attributes cover aspects like total received value, degree, and clustering coefficients, all derived from the transaction data.

Dataset Construction Overview

The graph dataset is stored in PostgreSQL, with separate tables for nodes and edges. The construction process involved multiple steps:

  1. Block Indexing: Organizing blocks chronologically for efficient data reading.
  2. Transaction Processing: Storing created and spent TXOs, while excluding special transactions.
  3. Address Clustering: Merging scripts into clusters representing entities.
  4. Inter-Cluster Edges: Calculating value transfers between entities.
  5. Intra-Cluster Edges: Recording value transfers within the same cluster.
  6. Node Attribute Computation: Deriving features from transactional data.

All code is written in Python and publicly available to ensure reproducibility.

Node Labeling Strategy

Bitcoin is used by diverse entities, including individuals, corporations, service providers, and criminal organizations. Research often focuses on understanding their behaviors and motivations. However, users are identified by randomly generated addresses, making it challenging to determine their true nature without external data.

We focused on labeling entities such as:

These categories were selected for their relevance and prevalence within the cryptocurrency ecosystem.

Labeling Pipeline

We leveraged BitcoinTalk, a prominent online forum, to extract and analyze Bitcoin-related data. Using a Python-based scraper with Selenium, we collected 14 million messages from 546,000 threads, focusing on posts mentioning Bitcoin addresses. These addresses were associated with entities using ChatGPT, which was prompted to identify deposit addresses, hot/cold wallets, and withdrawal transactions based on post content, transaction IDs, and converted USD amounts.

This approach allowed us to label 34,000 nodes and 100,000 Bitcoin addresses with entity types. The pipeline involved:

  1. Data Collection: Scraping BitcoinTalk for relevant posts.
  2. Information Extraction: Using ChatGPT to parse posts and associate addresses with entities.
  3. Label Mapping: Assigning entity categories based on extracted information.
  4. Validation: Manually verifying labels using historical web snapshots when necessary.

Despite its effectiveness, this method has limitations, including potential inaccuracies from user-generated content, biases toward English-speaking entities, and challenges in extracting precise information from unstructured text.

Additional Data Sources

To enrich the dataset, we incorporated addresses from:

These sources added breadth and depth to our labeled data.

Data Records and Accessibility

The dataset is available under a Creative Commons Attribution 4.0 International license. It includes:

The graph has a density of approximately 1%, indicating sparse connectivity. Out of 252 million nodes, 34,098 are labeled, with distributions across categories like exchanges, miners, and gambling platforms.

Technical Validation: Predicting Entity Types

Predicting node labels based on attributes validates the dataset’s reliability. We trained several graph neural networks (GCN, GraphSage, GAT, GIN) and a gradient boosting classifier (GBC) to predict labels for nodes. The task focused on labels with sufficient representation, such as 'exchange', 'mining', and 'ransomware'.

Preprocessing and Training

Features were processed to handle power law distributions, including logarithmic transformation, normalization, and clipping. We added derived features like average amounts received/sent and node age. The labeled nodes were split into training, validation, and test sets, with oversampling and undersampling to address class imbalance.

Models were trained using the Adam optimizer with a weighted cross-entropy loss. For scalability, we used neighborhood sampling, constructing subgraphs around labeled nodes to reduce computational overhead.

Results and Analysis

The GAT and GIN models achieved the best performance, with macro-F1 scores of 0.64 and 0.63, respectively. The GBC achieved a score of 0.57, indicating that node features alone are useful for prediction. However, GNNs struggled with the 'ransomware' class, suggesting room for improvement.

Feature importance analysis for the GBC showed that outgoing degree normalized by age was the most important feature, highlighting the distinction between entities with high connectivity (e.g., exchanges) and those with lower connectivity (e.g., ransomware operators).

Enhancing Ransomware Detection

Malicious actors often form dense subgraphs in transactional networks. Methods like Fraudar could be adapted to identify these clusters by using suspicion scores derived from model probabilities. Temporal algorithms like Spade+ could enable real-time detection, offering proactive identification of criminal activities.

Practical Usage and Applications

Restoring the Database

After decompressing the archive, the tables can be restored into a PostgreSQL database using the pg_restore utility. The database requires substantial storage—40GB for node features and 80GB for transaction edges—so adequate provisioning is essential. Configuring parameters like shared_buffers and work_mem can optimize performance.

Research Applications

This dataset supports various research directions:

👉 Explore advanced graph analysis techniques

Frequently Asked Questions

What makes this Bitcoin transaction graph dataset unique?
This dataset is the largest publicly available Bitcoin transaction graph, with 252 million nodes and 785 million edges. It is temporally resolved, richly labeled, and designed to represent real entities, making it invaluable for a wide range of research applications.

How can researchers use this dataset for fraud detection?
Researchers can train machine learning models, such as graph neural networks, to predict entity types and detect anomalous patterns. The dataset’s scale and labeling support tasks like ransomware detection and money laundering analysis.

What are the limitations of the node labeling process?
Labeling relies on data from BitcoinTalk and other public sources, which may contain inaccuracies or biases. The focus on English-language sources may underrepresent non-English entities. However, automated extraction with ChatGPT showed high accuracy in evaluations.

Can this dataset be used for real-time analysis?
While the dataset is historical, its temporal structure allows for the study of evolution over time. Real-time analysis would require integrating live blockchain data, but the methodologies demonstrated here can be adapted for such purposes.

How does this dataset compare to existing Bitcoin datasets?
Unlike most datasets that provide labeled addresses, this offering includes a pre-constructed graph with entity clusters, reducing the preprocessing burden on researchers. It also covers a broader range of entity types and a longer time span.

What computational resources are needed to work with this dataset?
The database requires about 120GB of storage, including indexes. Processing and modeling may require significant memory and computational power, especially for full-graph analyses. Neighborhood sampling and distributed computing are recommended for scalability.

👉 Access comprehensive blockchain datasets