Skip to content

Commit 02b65f2

Browse files
authored
Merge pull request #588 from GraphScope/refactor_workflow
feat: Refactor the workflow
2 parents 1c00f67 + 54d7715 commit 02b65f2

File tree

203 files changed

+7091
-2009
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

203 files changed

+7091
-2009
lines changed

python/graphy/README.md

Lines changed: 51 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,36 @@
11
# Graphy'ourData
2-
Graphy is an end-to-end platform designed for extracting, visualizing, and analyzing large volumes of unstructured data. Without structured organization, valuable insights in such data often remain hidden. Graphy empowers users to extract predefined structures from unstructured data, organizing it into a graph format for enhanced visualization, analysis, and exploration.
2+
Have you heard the buzz about the incredible power of large language models (LLMs) and their advanced applications, like Retrieval-Augmented Generation (RAG) or AI Agents? It’s exciting, right? But here’s the real challenge:
33

4-
![graphy](inputs/figs/graphy.png "The pipeline of Graphy")
4+
> How can you truly empower your existing data with these cutting-edge techniques—especially when your data is mostly unstructured?
55
6-
This repository offers the first prototype of the Graphy platform, as shown in the above figure, focusing on academic papers, which are typically publicly accessible. In this scenario, the primary unstructured data consists of PDF documents of research papers. Given a paper or a zip file of multiple papers, the platform enables users to define workflows for extracting structured information from the papers using LLMs. Additionally, it provides features to fetch reference PDFs from sources like [Arxiv](./utils/arxiv_fetcher.py) and [Google Scholar](./utils/scholar_fetcher.py), allowing for the construction of a rich, interconnected database of academic papers.
6+
Preprocessing unstructured data is often a tedious and time-consuming task. And let’s not forget, building a practical, LLM-based system that can fully leverage the potential of your data? That can be an even bigger hurdle.
7+
8+
**Graphy** is an intuitive end-to-end platform that transforms unstructured data into actionable insights. Unstructured data often hides valuable information, making it difficult to access and utilize. Graphy bridges this gap by leveraging LLMs to effortlessly extract meaningful structures from unstructured data, transforming it into an organized graph format. This enables intuitive visualization, seamless exploration, and powerful LLM-based analysis, unlocking the full potential of your data.
9+
10+
![graphy](inputs/figs/workflow.png "The pipeline of Graphy")
11+
12+
This repository introduces the initial prototype of the Graphy platform, as illustrated above, with a focus on academic papers, which are often publicly accessible. In this prototype, the primary unstructured data consists of research paper PDFs. Graphy’s workflow is built upon two key abstractions:
13+
- **Inspector**: The Inspector abstraction defines the structured information to be extracted from papers. It utilizes an inner Directed Acyclic Graph (DAG), where each subnode represents specific instructions for LLMs to extract targeted information from the paper. This DAG mirrors the commonly referenced ["Tree of Thought"](https://arxiv.org/abs/2305.10601) pipeline in the LLM literature.
14+
- **Navigator**: The Navigator abstraction determines how related papers can be fetched and processed via the Inspector. Currently, two navigators are available:
15+
- [Arxiv Fetcher](./utils/arxiv_fetcher.py) for retrieving PDFs from ArXiv.
16+
- [Google Scholar Fetcher](./utils/scholar_fetcher.py) for fetching PDFs via Google Scholar.
17+
18+
These navigators enable the creation of a rich, interconnected database of academic papers.
19+
20+
## Workflow to Graph Mapping
21+
As illustrated in the figure above, the workflow maps naturally to a structured graph model. In this graph:
22+
- Primary nodes (or "Fact" nodes) represent papers, containing key extracted information.
23+
- Connected nodes (or "Dimension" nodes) represent specific pieces of information extracted from the papers by the Inspector.
24+
- The Navigator links papers to related papers, forming an interconnected web of academic resources.
725

826
With this structured database in place, various analyses can be conducted. Our [frontend server](../../examples/graphy/README.md) demonstrates data visualizations, exploration tools, and analytics that support numerous downstream tasks, including tracking research trends, drafting related work sections, and generating prompts for slide creation—all with just a few clicks.
927

28+
## Potential Extensions
29+
30+
- **Customized Inspector**: The Inspector can be tailored to extract any type of information from paper documents. It can also be extended to handle other types of unstructured data, such as legal documents, medical records, or financial reports.
31+
- **Customized Navigator**: The Navigator can be expanded to fetch data from additional sources, such as PubMed, IEEE, or Springer. Furthermore, navigators could be developed to connect papers to supplementary sources like GitHub repositories, enabling even richer datasets and analyses.
32+
33+
1034

1135

1236
# Install Dependencies
@@ -23,12 +47,34 @@ source venv/bin/activate
2347
pip install -r requirements.txt
2448
```
2549

26-
# Run Backend Server
27-
50+
## Setting Python Environment
2851
We have not built and installed the python package yet. So, it is important to add the path to the python package to the `PYTHONPATH` before running the server.
2952

3053
```bash
3154
export PYTHONPATH=$PYTHONPATH:$(pwd)
55+
```
56+
57+
# Run Offline Paper Scrapper
58+
The provided utility allows you to scrape research papers from arXiv. Using a set of seed papers as input, the scraper can iteratively fetch papers from the references of these seed papers. The process continues until a specified number of papers (`max_inspectors`) has been downloaded and processed.
59+
60+
**Usage**:
61+
```bash
62+
python paper_scrapper.py --max-workers 4 --max-inspectors 500 --workflow <path_to_workflow> <path_to_seed_papers>
63+
```
64+
65+
- `--max-workers` (optional): Specifies the maximum number of parallel workers (default: 4).
66+
- `--max-inspectors` (optional): Defines the maximum number of papers to fetch (default: 100).
67+
- `--workflow` (optional): Path to a workflow configuration file. If not provided, the default configuration file config/workflow.json will be used.
68+
- `<path_to_seed_papers>`: Provide the path containing seed papers. Each paper is a PDF document.
69+
70+
> If no `workflow` provided, a default workflow configuration in `config/workflow.json` will be used.
71+
Ensure that the workflow configuration contains your custom LLM model settings by modifying the "llm_model" field.
72+
73+
# Run Backend Server
74+
A backend demo application is included in this project, accessible as a standalone server.
75+
76+
**Usage**:
77+
```bash
3278
python apps/demo_app.py
3379
```
3480

0 commit comments

Comments
 (0)