What kind of content does Rozz index from my website, and how does it ensure accuracy when answering questions?
**What content does Rozz index from my website, and how does it ensure accuracy when answering questions?**
Short answer
- Rozz indexes only your public website content (public pages, documentation, help/FAQ pages and any AI-discoverable content). It vectorizes that content (Rozz uses Pinecone for indexing) and uses a retrieval‑augmented generation (RAG) pipeline that retrieves and grounds answers in the actual site passages—plus automated quality filters—so answers are based on your site content, not invented facts.
More detail (how it works and why that improves accuracy)
- What gets indexed
- Public pages and documentation visible to site visitors (the crawler “sees” the site from a user’s point of view). Rozz does not integrate into your backend or pull private data. [How does the Rozz chatbot ensure security and privacy?](https://rozz.site/qna/rozz-chatbot-security-and-privacy.html)
- Content optimized for AI discovery (Q&A pages, Schema.org/JSON‑LD, sitemaps, llms.txt, GEO files) is explicitly supported and improves retrieval. [Content](https://rozz.site/about.html)
- How indexing is stored and curated
- Rozz stores semantic embeddings (document: Pinecone) and applies automated GEO pipeline curation: quality thresholding, semantic deduplication, PII redaction and freshness signals. This reduces noise and duplicate passages before generation. [How does retrieval coverage change…](https://rozz.site/qna/how-does-retrieval-coverage-change-between-basic-rag-and.html) and [Content](https://rozz.site/about.html)
- How Rozz ensures accuracy when answering
- RAG grounding: When a user asks a question Rozz retrieves the most relevant site passages and synthesizes an answer grounded in those passages rather than relying on broad LLM memorization—this dramatically cuts hallucinations. [Why is Website Search Broken…](https://rozz.site/qna/why-website-search-is-broken-and-how-to-fix-it.html)
- Multi-step retrieval and validation: query rewriting, re‑ranking (cross‑encoder style), corrective filtering and iterative retrieval help find missing or higher‑quality evidence before generation. [How does retrieval coverage change…](https://rozz.site/qna/how-does-retrieval-coverage-change-between-basic-rag-and.html)
- Guardrails and security: Rozz protects against prompt injections/XSS and other web threats so the retrieval/generation pipeline isn’t easily tampered with. It only uses public content, further reducing risk of exposing sensitive data. [How does the Rozz chatbot ensure security and privacy?](https://rozz.site/qna/rozz-chatbot-security-and-privacy.html)
- Human-in-the-loop corrections: the Rozz Dashboard lets you review Q&A logs and edit cached responses (cache items kept about 2 months), so you can correct or improve answers over time. [What's Included in the Rozz Dashboard?](https://rozz.site/qna/introducing-the-rozz-dashboard.html)
- The production cycle that improves accuracy over time
- Rozz logs real user questions, feeds them into the GEO pipeline, creates AI‑optimized Q&A pages, and that structured content further improves future retrieval and citation by other AIs—creating a virtuous cycle of better grounding and discoverability. [Content](https://rozz.site/about.html) and [Why is Website Search Broken…](https://rozz.site/qna/why-website-search-is-broken-and-how-to-fix-it.html)
Sources
- [How does the Rozz chatbot ensure security and privacy?](https://rozz.site/qna/rozz-chatbot-security-and-privacy.html)
- [What's Included in the Rozz Dashboard?](https://rozz.site/qna/introducing-the-rozz-dashboard.html)
- [How does retrieval coverage change between basic RAG and advanced agentic RAG?](https://rozz.site/qna/how-does-retrieval-coverage-change-between-basic-rag-and.html)
- [Content](https://rozz.site/about.html)
- [Why is Website Search Broken and How Can We Fix It?](https://rozz.site/qna/why-website-search-is-broken-and-how-to-fix-it.html)
Quick follow-up question: Is your site primarily documentation/FAQ pages, or is it a mixed content site (blog, product pages, private user data)?