Show HN: FastGraphRAG – Better RAG using good old PageRank

github.com

446 points · liukidar · 2 days ago

Hey there HN! We’re Antonio, Luca, and Yuhang, and we’re excited to introduce Fast GraphRAG, an open-source RAG approach that leverages knowledge graphs and the 25 years old PageRank for better information retrieval and reasoning.

Building a good RAG pipeline these days takes a lot of manual optimizations. Most engineers intuitively start from naive RAG: throw everything in a vector database and hope that semantic search is powerful enough. This can work for use cases where accuracy isn’t too important and hallucinations are tolerable, but it doesn’t work for more difficult queries that involve multi-hop reasoning or more advanced domain understanding. Also, it’s impossible to debug it.

To address these limitations, many engineers find themselves adding extra layers like agent-based preprocessing, custom embeddings, reranking mechanisms, and hybrid search strategies. Much like the early days of machine learning when we manually crafted feature vectors to squeeze out marginal gains, building an effective RAG system often becomes an exercise in crafting engineering “hacks.”

Earlier this year, Microsoft seeded the idea of using Knowledge Graphs for RAG and published GraphRAG - i.e. RAG with Knowledge Graphs. We believe that there is an incredible potential in this idea, but existing implementations are naive in the way they create and explore the graph. That’s why we developed Fast GraphRAG with a new algorithmic approach using good old PageRank.

There are two main challenges when building a reliable RAG system:

(1) Data Noise: Real-world data is often messy. Customer support tickets, chat logs, and other conversational data can include a lot of irrelevant information. If you push noisy data into a vector database, you’re likely to get noisy results.

(2) Domain Specialization: For complex use cases, a RAG system must understand the domain-specific context. This requires creating representations that capture not just the words but the deeper relationships and structures within the data.

Our solution builds on these insights by incorporating knowledge graphs into the RAG pipeline. Knowledge graphs store entities and their relationships, and can help structure data in a way that enables more accurate and context-aware information retrieval. 12 years ago Google announced the knowledge graph we all know about [1]. It was a pioneering move. Now we have LLMs, meaning that people can finally do RAG on their own data with tools that can be as powerful as Google’s original idea.

Before we built this, Antonio was at Amazon, while Luca and Yuhang were finishing their PhDs at Oxford. We had been thinking about this problem for years and we always loved the parallel between pagerank and the human memory [2]. We believe that searching for memories is incredibly similar to searching the web.

Here’s how it works:

- Entity and Relationship Extraction: Fast GraphRAG uses LLMs to extract entities and their relationships from your data and stores them in a graph format [3].

- Query Processing: When you make a query, Fast GraphRAG starts by finding the most relevant entities using vector search, then runs a personalized PageRank algorithm to determine the most important “memories” or pieces of information related to the query [4].

- Incremental Updates: Unlike other graph-based RAG systems, Fast GraphRAG natively supports incremental data insertions. This means you can continuously add new data without reprocessing the entire graph.

- Faster: These design choices make our algorithm faster and more affordable to run than other graph-based RAG systems because we eliminate the need for communities and clustering.

Suppose you’re analyzing a book and want to focus on character interactions, locations, and significant events:

  from fast_graphrag import GraphRAG
  
  DOMAIN = "Analyze this story and identify the characters. Focus on how they interact with each other, the locations they explore, and their relationships."
  
  EXAMPLE_QUERIES = [
      "What is the significance of Christmas Eve in A Christmas Carol?",
      "How does the setting of Victorian London contribute to the story's themes?",
      "Describe the chain of events that leads to Scrooge's transformation.",
      "How does Dickens use the different spirits (Past, Present, and Future) to guide Scrooge?",
      "Why does Dickens choose to divide the story into \"staves\" rather than chapters?"
  ]
  
  ENTITY_TYPES = ["Character", "Animal", "Place", "Object", "Activity", "Event"]
  
  grag = GraphRAG(
      working_dir="./book_example",
      domain=DOMAIN,
      example_queries="\n".join(EXAMPLE_QUERIES),
      entity_types=ENTITY_TYPES
  )
  
  with open("./book.txt") as f:
      grag.insert(f.read())
  
  print(grag.query("Who is Scrooge?").response)
This code creates a domain-specific knowledge graph based on your data, example queries, and specified entity types. Then you can query it in plain English while it automatically handles all the data fetching, entity extractions, co-reference resolutions, memory elections, etc. When you add new data, locking and checkpointing is handled for you as well.

This is the kind of infrastructure that GenAI apps need to handle large-scale real-world data. Our goal is to give you this infrastructure so that you can focus on what’s important: building great apps for your users without having to care about manually engineering a retrieval pipeline. In the managed service, we also have a suite of UI tools for you to explore and debug your knowledge graph.

We have a free hosted solution with up to 100 monthly requests. When you’re ready to grow, we have paid plans that scale with you. And of course you can self host our open-source engine.

Give us a spin today at https://circlemind.co and see our code at https://github.com/circlemind-ai/fast-graphrag

We’d love feedback :)

[1] https://blog.google/products/search/introducing-knowledge-gr...

[2] Griffiths, T. L., Steyvers, M., & Firl, A. (2007). Google and the Mind: Predicting Fluency with PageRank. Psychological Science, 18(12), 1069–1076. http://www.jstor.org/stable/40064705

[3] Similarly to Microsoft’s GraphRAG: https://github.com/microsoft/graphrag

[4] Similarly to OSU’s HippoRAG: https://github.com/OSU-NLP-Group/HippoRAG

https://vhs.charm.sh/vhs-4fCicgsbsc7UX0pemOcsMp.gif


118 comments
michelpp · 2 days ago
PageRank is one of several interesting centrality metrics that could be applied to a graph to influence RAG on structural data, another one is Triangle Centrality which counts triangles around nodes to figure out their centrality based on the concept that triangles close relationships into a strong bond, where open bonds dilute centrality by drawing weight away from the center:

https://arxiv.org/abs/2105.00110

The paper shows high efficiency compared to other centralities like PageRank, however in some research using the GraphBLAS I and my coauthors found that TC was slower on a variety of sparse graphs than our sparse formulation of PR for graphs up to 1.8 billion edges, but that TC appears to scale better as graphs get larger and is likely more efficient in the trillion edge realm.

https://fossies.org/linux/SuiteSparse/GraphBLAS/Doc/The_Grap...

Show replies

LASR · 2 days ago
So I've done a ton of work in this area.

Few learnings I've collected:

1. Lexical search with BM25 alone gives you very relevant results if you can do some work during ingestion time with an LLM.

2. Embeddings work well only when the size of the query is roughly on the same order of what you're actually storing in the embedding store.

3. Hypothetical answer generation from a query using an LLM, and then using that hypothetical answer to query for embeddings works really well.

So combining all 3 learnings, we landed on a knowledge decomposition and extraction step very similar to yours. But we stick a metaprompter to essentially auto-generate the domain / entity types.

LLMs are naively bad at identifying the correct level of granularity for the decomposed knowledge. One trick we found is to ask the LLM to output a mermaid.js mindmap to hierarchically break down the input into a tree. At the end of that output, ask the LLM to state which level is the appropriate root for a knowledge node.

Then the node is used to generate questions that could be answered from the knowledge contained in this node. We then index the text of these questions and also embed them.

You can directly match the user's query from these questions using purely BM25 and get good outputs. But a hybrid approach works even better, though not by that much.

Not using LLMs are query time also means we can hierarchically walk down the root into deeper and deeper nodes, using the embedding similiarity as a cost function for the traversal.

Show replies

AIorNot · 2 days ago
This is very cool, I signed up and uploaded a few docs (PDFs) to the dashboard

Our Use case: We have been looking at farming out this work (analyzing complaince documents (manufacturing paperwork) for our AI Startup however we need to understand the potential scale this can operate under and the cost model for it to be useful to us

We will have about 300K PDF documents per client and expect about a 10% change in that document set, month to month -any GraphRag system has to handle documents at scale - we can use S3 as an igestion mechanism but have to understand the cost and processing time needed for the system to be ready to use duiring:

1. inital loading 2. regular updates -how do we delete data from system for example

cool framework btw..

Show replies

inboulder · 1 days ago
PageRank for better centrality seems neat, but it still doesn't address the probably unsolvable flaw with RAG, the reason why RAG basically can't work. All RAG DBs under-perform expectations because RAG fundamentally can't find relationships between words necessary to find the information the user cares about. Weird right, isn't this what the 'attention' mechanism is supposed to be good for? It just isn't good enough.

Example: Say you're searching an article and you want to know what occupation a mentioned person has, let's say the person 'Sharon,' is mentioned to have attended several physical chemistry conferences but her occupation is never explicitly mentioned. There's a very good chance every single rag approach will fail to return correct results, will fail to make this connection between 'occupation' attends conference, type of conference and infers 'chemist'. I could go on, but this sort of error is pervasive along all types of information when trying to retrieve with RAG. In the end, solutions like the above seem to just sort of reinvent other query methods, SQL, pagerank etc, with extra steps... there's little point in vectorization at that point...

Show replies

jillesvangurp · 2 days ago
Cool idea. IMHO traditional information retrieval is the way to go with RAG. Vector search is nice but also slow and expensive and people seem to use it as magic pixie dust. It works nice for unstructured data but not necessarily that well for structured data.

And unless tuned very well, vector search is not actually a whole lot better than a good old well tuned query. Putting everything together, the practice of turning structured data into unstructured data just so you can do vector search or prompt engineering on it, which I've seen teams do, feels a bit backwards. It kind of works but there are probably smarter ways to get the same results. Graph RAG is essentially about making use of structure of data. Whether that's through SQL joins or by querying some graph database doesn't really matter much.

There is probably some value into teaching LLMs how to query as well; or letting them interface with existing search/query APIs. And you can compensate for poor ranking with larger context sizes and simply fetch a few hundred or even more results with multiple queries. It's going to be a lot faster and cheaper than vector search to scale that.