Data Engineer
This is a remote position.<\/p>
Data Engineer \u2014 MCA Corporate Filings Pipeline
<\/h1>About Verizol
<\/h2>
<\/h2>
Verizol is building India's most comprehensive new\-company intelligence platform. Every day, thousands of companies register with MCA and GST across India. We capture this data the same day it becomes available, enrich it with director contact details and financial intelligence, and deliver it to CA firms, agencies, NBFCs, and businesses through our subscription platform and white\-label reseller network.
<\/p>
This role is the technical core of our product. The data pipeline you build is literally what subscribers pay for every month.
<\/p>\n
About This Role
<\/h2>
We are looking for a Data Engineer to own and build the MCA Corporate Filings Intelligence Pipeline<\/b> \u2014 the system that converts unstructured government filings (PDFs, XBRL, scanned documents, and web data) into clean, structured, queryable business intelligence.
<\/p>
MCA does not provide an official public API. This role requires building a resilient data acquisition system using a combination of unofficial endpoints, web scraping, third\-party enrichment APIs, and AI\-based document extraction \u2014 and making it run reliably, every single day, without breaking.
<\/p>
If you enjoy the challenge of \"the data is out there, but it's a mess \u2014 go make it useful,\" this role is built for you.
<\/p>\n
What You Will Build
<\/h2>
Daily Company Ingestion Pipeline<\/b> Build and maintain the pipeline that fetches newly incorporated companies (Private Limited, LLP, OPC) from MCA every day, using a combination of MCA's unofficial v3 endpoints, monthly ROC bulk files, and Selenium\-based scraping as a fallback. This pipeline must run every morning before 8 AM and handle rate limits, CAPTCHAs, and IP rotation gracefully.
<\/p>
Director and Contact Enrichment<\/b> Enrich every new company with director details (name, DIN, designation) and, where possible, director mobile numbers and emails \u2014 using a chained fallback across multiple third\-party APIs (Sandbox, CompData, Apollo) and GST cross\-referencing.
<\/p>
Financial Filings Extraction Pipeline<\/b> Build the system that downloads AOC\-4, MGT\-7, CHG\-1, DIR\-12, and PAS\-3 filings, and extracts structured financial data from them \u2014 using XBRL parsing for structured filings and a combination of OCR (Tesseract) plus AI extraction (Claude API) for scanned PDFs.
<\/p>
Data Transformation and Intelligence Layer<\/b> Normalise extracted financial data (currency units, date formats, validation), compute financial ratios (debt\-to\-equity, current ratio, profit margins), generate a 0\-100 financial health score per company, and detect business signals (growth companies, loan opportunities, financial distress, recent funding).
<\/p>
Director Network Graph<\/b> Build and maintain a graph of directors\-to\-companies relationships \u2014 used to detect connected companies, serial founders, and director disqualification risks (MCA Section 164).
<\/p>
Pipeline Orchestration and Monitoring<\/b> Schedule and monitor all jobs using AWS Step Functions / Bull queues with cron scheduling. Build comprehensive failure handling, retry logic, and real\-time WhatsApp alerting when pipelines fail or quality drops.
<\/p>
Data Quality and Compliance<\/b> Build validation rules, quality scoring, duplicate detection, and DPDPA\-compliant data handling (stripping prohibited personal data fields, honeypot record management, opt\-out suppression).
<\/p>\n
\n What Makes This Role Interesting
<\/h2>
You are solving a real puzzle, not following a spec.<\/b> MCA has no official API. There is no documentation. You will be reverse\-engineering endpoints, building fallback chains, and constantly adapting when government websites change without notice. This is data engineering at its most hands\-on.
<\/p>
Your output is the product.<\/b> Every subscriber's morning data alert, every financial health score shown on the dashboard, every \"company registered yesterday\" notification \u2014 all of it comes from the pipeline you build and maintain.
<\/p>
You will work with cutting\-edge AI extraction.<\/b> Using Claude to turn messy scanned PDFs of Indian balance sheets into clean structured JSON is a genuinely novel application \u2014 you will be designing and refining prompts that directly affect data accuracy for thousands of users.
<\/p>
High ownership, fast feedback loops.<\/b> If the pipeline breaks at 6 AM, you will know by 6:15 AM, fix it, and see it reflected for subscribers within the hour. No multi\-week deployment cycles. We aim to complete the entire process within 7 to 10 days. Send your resume and GitHub/portfolio link to careers@verizol.ai<\/a><\/b> with the subject line \"Data Engineer Application \u2014 [Your Name]\". Include a short note (3 to 4 lines) on a data pipeline or scraping project you have built \u2014 especially if it involved messy, unstructured, or unofficial data sources. We read every application personally. Verizol is an equal opportunity employer. We welcome applications from all backgrounds and experience levels that meet the must\-have criteria.<\/i> You do not need prior experience with every item on this list \u2014 but you should be excited to learn government data systems, OCR, and AI\-based extraction if you haven't worked with them before. Must\-Have<\/b> Good to Have<\/b> Not Required<\/b>
<\/p>\n A Typical Week Might Include
<\/h2>\n \n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n <\/ul>\n
\n Interview Process
<\/h2>\n \n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n <\/ol>
<\/p>\n How to Apply
<\/h2>
<\/p>
<\/p>\n
<\/p>\n
\n <\/div><\/span>Requirements<\/h3>
Tech Stack You Will Use
<\/h2>\n \n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n <\/ul>
<\/p>\n What We Are Looking For
<\/h2>
<\/p>\n \n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n <\/ul>
<\/p>\n \n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n <\/ul>
<\/p>\n \n
<\/li>\n
<\/li>\n <\/ul>\n
\n <\/div><\/span>Benefits<\/h3>
Compensation and Benefits
<\/h2>\n \n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n
<\/li>\n <\/ul>\n
\n <\/div><\/span>
\n <\/body>\n<\/html>