Indexify in Action: Build Multi-Stage ML Pipelines at Scale

In today's data-driven world, processing and analyzing large amounts of information efficiently is crucial. Enter Indexify, a powerful compute framework designed to build and serve multi-stage data-intensive workflows. In this blog post, we'll explore Indexify's capabilities and walk through a practical example of building a news analysis workflow.

What is Indexify?

Indexify is a compute engine that allows you to create durable data-intensive workflows and serve them as APIs. Its key features include:

Elastic workflows that can scale across multiple machines
Parallel function execution for improved performance
Automatic data movement between dependent functions
Serving workflows as live API endpoints for easy integration

Building a News Analysis Workflow with Indexify

Let's dive into a practical example of how Indexify can be used to create a multi-stage workflow for news analysis. Our workflow will fetch news articles, summarize them, and perform sentiment analysis.

Step 1: Setting Up the Environment

First, we'll import the necessary modules and define our custom types:

from indexify import indexify_function, Graph, RemoteGraph
from typing import List, Dict
from pydantic import BaseModel

# Custom types and imports...

Step 2: Defining Workflow Functions

Our workflow consists of three main functions:

fetch_news_urls: Retrieves a list of news article URLs based on a given topic.
scrape_and_extract: Scrapes the content of each article and extracts key information.
generate_summary: Summarizes the articles and provides sentiment analysis using GPT-3.5.

Here's an example of one of these functions:

@indexify_function()
def fetch_news_urls(topic: str, num_articles: int = 5) -> List[str]:
    import requests

    response = requests.get(
        f"https://newsapi.org/v2/everything?q={topic}&pageSize={num_articles}&apiKey=YOUR_API_KEY"
    )
    articles = response.json()["articles"]
    return [article["url"] for article in articles]

Step 3: Creating the Workflow Graph

With our functions defined, we can now create the workflow graph:

g: Graph = Graph(name="news-analyzer-2", start_node=fetch_news_urls)
g.add_edge(fetch_news_urls, scrape_and_extract)
g.add_edge(scrape_and_extract, generate_summary)

Step 4: Deploying and Running the Workflow

Finally, we can deploy our workflow and run it:

if __name__ == "__main__":
    remote_graph = RemoteGraph.deploy(g, server_url="http://localhost:8900")

    graph = RemoteGraph.by_name(
        name="news-analyzer-2", server_url="http://localhost:8900"
    )

    invocation_id = graph.run(
        block_until_done=True, topic="artificial intelligence", num_articles=5
    )

    results = graph.get_output(invocation_id, fn_name="generate_summary")
    print(results)

Benefits of Using Indexify

Scalability: Indexify's elastic workflows can handle large-scale data processing tasks efficiently.
Modularity: Each function in the workflow can be developed and tested independently.
Parallelism: Functions can run in parallel across multiple machines, improving overall performance.
API Integration: Workflows are served as live API endpoints, making it easy to integrate with existing systems.
Flexibility: Indexify can be used for a wide range of data-intensive tasks, from ETL processes to machine learning pipelines.

Conclusion

Indexify provides a powerful and flexible framework for building and serving multi-stage data-intensive workflows. By automating the complexities of distributed computing and data movement, it allows developers to focus on creating robust and efficient data processing pipelines. Whether you're working on news analysis, like in our example, or tackling other data-intensive tasks, Indexify can help streamline your workflow and improve your productivity.

Give Indexify a try and experience the benefits of this innovative compute engine for yourself!