This article assumes that you are a data analyst who is new to web3, starting to build your web3 analysis team, or just getting interested in web3 data. In either case, you should already be roughly familiar with how APIs, databases, transformations, and models work in web2. In this new guide, I will try to lay out my three main points as succinctly as possible: 1. Thinking: Why open data channels will change the way data is used 2. Tools: Overview of tools in the web3 data stack and how to use them 3. Team: Basic considerations and skills of the web3 data team Data ThinkingLet’s first summarize how to build, query, and access data in web2 (i.e. access Twitter’s API). We have four steps to simplify the data pipeline: 1. Trigger an API event (some tweets were sent) 2. Update to database (connect to existing user model/state changes) 3. Data transformation for specific product/analytic use cases 4. Model training and deployment (for managing your Twitter feed) When the data is open source, the only required steps are after the transformation is complete. Communities such as Kaggle (1000 data science/feature engineering competitions) and Hugs Face (26,000 top NLP models) use some publicly available subsets of data to help companies build better models. There are some domain-specific cases, such as Open Street Maps, which open data in the previous three steps, but they still have write permissions restrictions. I want to state that I am only talking about data here, I am not saying that web2 is completely devoid of open source. Like most other engineering roles, web2 data has a ton of open source tools to build their pipelines (dbt, apache, TensorFlow). We still use all of these tools in web3. In short, their tools are open, but their data is closed. Web3 also makes data open source, which means that it’s no longer just data scientists working in the open, but also analytics engineers and data engineers! Everyone is involved in a more continuous workflow rather than an almost black box data loop. The work has moved from web2 data dams to web3 data rivers, deltas, and oceans. It’s also important to note that all products in the ecosystem are affected by this cycle simultaneously. Let's look at an example of how web3 analysts can work together. There are dozens of exchanges that allow you to swap token A for token B using different trading mechanisms and fees. If these were typical exchanges like NASDAQ, each exchange would report their own data in 10k or some API, and then some other service like capIQ would put all the exchange data together and then charge a fee to let you access their API. Maybe sometimes, they'll run an innovation competition so they can charge for additional data/charting features in the future. In web3 exchanges, we have data flows like this: 1. dex.trades is a table on Dune (collated over time by many community analytics engineers) where all DEX exchange data is aggregated together, so you can easily search for the trading volume of a single token across all exchanges. 2. A data analyst created a dashboard from community open source queries, so now we have a public overview of the entire DEX industry. Even though all the queries appear to be written by one person, you can guess that it took a lot of debate on discord to piece it together accurately. 3. DAO scientists look at the dashboard and start to slice and dice the data in their own queries, looking at specific pairs, such as stablecoins. They will observe user behavior and business models, and then start to build hypotheses. Since scientists can see which DEX has a larger share of trading volume, they will come up with a new model and propose changes to governance parameters to be voted and executed on-chain. 4. Afterwards, we can always check the public query/dashboard to see how the proposals create a more competitive product. 5. In the future, if another DEX comes out (or upgrades to a new version), this process will repeat. Someone will create insert queries to update this table. This will in turn be reflected in all dashboards and models (without anyone having to go back and manually fix/change anything). Any other analyst/scientist can build on the work that others have done. Because of the shared ecosystem, discussion, collaboration, and learning happen in a tighter feedback loop. I admit this can be overwhelming at times, and the analysts I know are basically rotating through data exhaustion. However, as long as one of us continues to push the data forward (e.g. someone creates an insert DEX query), then everyone else benefits. It doesn't always have to be complex abstract views, sometimes it's just utility features like making it easy to search ENS reverse resolvers or tooling improvements like automatically generating most graphQL mappings with one CLI command! All of which can be reused by everyone and made available for API consumption in some product frontend or your own personal transaction model. While the possibilities opened here are amazing, I do admit that the wheels are not yet running smoothly. The ecosystem in the data analyst/science field is still very immature compared to data engineering. I think there are several reasons for this: Data engineering has been a core focus of web3 for many years, from improvements to client RPC APIs to basic SQL/graphQL aggregation. Products like theGraph and Dune are examples of their efforts in this regard. It’s been incredibly difficult for analysts to understand web3’s unique cross-protocol relationship tables. For example, an analyst can understand how to analyze just Uniswap, but struggle to add aggregators, other DEXs, and different token types into the mix. On top of that, the tools to make this happen didn’t really emerge until the last year. Data scientists are typically used to collecting raw data and doing all the legwork on their own (building their own pipelines). I don’t think they’re used to working so closely and openly with analysts and engineers early in development. For me personally, this took a while. In addition to learning how to work together, the web3 data community is also learning how to work across this new data stack. You no longer need to control infrastructure, or slowly build from excel to a data pool or data warehouse, as soon as your product goes online, your data will go online everywhere. Your team is basically thrown into the deep end of data infrastructure. Data Tools
Here is a summary of some data tools: Let's look at each type and how to use it: 1. Interaction + Data Source: This is mainly used for front-end, wallet and lower-level data ingestion. 1. Client: Although the underlying implementation of Ethereum is the same, each client has different additional features. For example, Erigon has made a lot of optimizations for data storage/synchronization, and Quorum supports privacy chains. 1.2. Node as a Service: You don’t have to choose which client to run, but using these services will save you the trouble of maintaining nodes and APIs up and running. The complexity of the node depends on how much data you want to capture (light node → full node → archive node). 2. Query + Data Mapping: Data in this layer is either referenced in the contract as a URI, or comes from mapping transaction data from bytes to table schema using the contract ABI. The contract ABI tells us which functions and events are contained in the contract, otherwise, we can only see the deployed bytecode (without this ABI, you can't reverse engineer/decode contract transactions). 2.1. Transaction data: These are the most commonly used and are mainly used for dashboards and reports. TheGraph and Flipside APIs are also used in the front end. Some tables are 1:1 mappings of contracts, and some allow additional transformations in the schema. 2.2 Metadata “protocols”: These are not really data products, but are used to store DIDs or file storage. Most NFTs will use one or more of these data sources, and I think this year we will start to use these data sources more and more to enhance our queries. 2.3. Professional providers: Some of them are very robust data streaming products, Blocknative for mempool data, Parsec for on-chain transaction data. Others aggregate on-chain and off-chain data, such as DAO governance or treasury data. 2.4. High-dimensional data providers: You cannot query/transform their data, but they have done all the heavy work for you. Without strong, outstanding communities to support these tools, web3 wouldn’t exist! We can see the outstanding communities for each type: 1. Flashbots: Focuses on MEV, providing everything from custom RPCs to protect transactions to professional white hat services. MEV mainly refers to the gun run problem, when someone pays more gas than you (but directly to the miner) so that they can execute their transactions first. 2. Dune Data Elite: Data analysis elites who focus on contributing to Dune’s data ecosystem. 3. Flipside Data Elite: Data analysis elites who focus on contributing to the advancement of Web3 data. 4. MetricsDAO: Works across ecosystems and handles various data rewards on multiple chains. 5. DiamondDAO: Focuses on data science work for Stellar, mainly in governance, treasury, and token management. 6. IndexCoop: Focuses on analysis of specific areas such as tokens to develop the best index in the cryptocurrency industry. 7. OurNetwork: Weekly data coverage of various protocols and Web3. Note: For the contact information of participating in the above DAO, please see the original text. Each community has done a ton of work to improve the web3 ecosystem. There is no doubt that products that have communities will grow 100x faster. This is still a severely underestimated competitive advantage that I don't think people will gain unless they build something in these communities. Data TeamIt goes without saying that you should also be looking for people in these communities to join your team. Let’s break down the important web3 data skills and experiences further so you can really know what you’re searching for. If you want to be hired, think of this as the skills and experiences you seek out! At a minimum, an analyst should be an Etherscan detective and know how to read the Dune dashboard. This can take 1 month to get used to with leisurely learning, or 2 weeks if you really want to study like crazy. There is a lot more you need to consider, especially time allocation and skill transfer. 1. Time: In web3, about 30-40% of a data analyst’s time will be spent keeping up with other analysts and protocols in the ecosystem. Make sure you don’t overwhelm them, otherwise, it will become a long-term detriment to everyone. It is necessary to learn, contribute, and build with the larger data community. 2. Transferability: In this field, both skills and fields are highly transferable. If different protocols are used, it may reduce the time to get started because the table schema of the on-chain data is the same. Remember, it’s not important to know how to use the tools, every analyst should be able to write SQL or create a data dashboard at some point. It’s all about how to contribute and work with the community. If the person you’re interviewing is not part of any web3 data community (and doesn’t seem to have any interest in it), you might want to ask yourself if this is a red flag. |
More and more women can find their own love in th...
No matter what kind of debt it is, you have to pa...
In 2004, the total amount of data in the world wa...
We need to guard against the insidious and cunnin...
According to Bitcoin.com, California investment c...
What kind of women can attract the most money? Wh...
The heart line is one of the most important parts...
Investors and cryptocurrency enthusiasts have lon...
In life, a person's smoking posture often rev...
The facial features of women who are likely to be...
Everyone will encounter setbacks, but some people...
As one of the traditional physiognomy techniques, ...
Where is the noble tattoo? The location of the no...
Regarding the upcoming SegWit2x hard fork in Nove...
Different people have moles in different positions...