The consistent naming of crops is essential to building an accurate and reliable model. But the many naming conventions for crops and districts in such a large and diverse country was just one of the many challenges AIR researchers faced when building the AIR Multiple-Peril Crop Insurance (MPCI) Model for India, anticipated for release later this year.
While English is still an important language in India there are 22 additional officially recognized languages. Hindi is noted as the official language of the country but individual states have the option to make a regional language their official language. There are hundreds of different living languages in the subcontinent, 30 of which are used by 1 million+ native speakers according to the 2001 census. This leads to there being “official” languages at the state and central level that may not be Hindi, which in turn leads to official state reports, including those related to crops, being written in various regional languages.
Multiple Names for a Single Crop
We discussed in a previous blog post how AIR leveraged the National Informatics Center (NIC) and Village Dynamics in South Asia (VDSA) yield databases as two main sources of reported crop yield values by crop and district in India. While these two data sets included a wealth of information, there was a great deal of discrepancy in the reports when it comes to the naming conventions of crops and districts.
For example, the crop commonly referred to as pearl millet was listed as Bajra in the NIC database, and the crop typically called sorghum was listed as Jowar. Both Bajra and Jowar are local names, but alternative English names can also be used by different agencies. In another case, VDSA uses the name chickpea, which is labeled as Gram in the NIC database and is also known as Bengal Gram.
Varied Growing Seasons
Discrepancies in crop reports can be caused by the growing season as well. Two major crop seasons are Kharif, which takes place from May to October and is characterized by the monsoon season, and Rabi, which takes place from October to April. But some states chose to report their yields in a different way. For example, West Bengal, the top rice-producing state, categorizes its rice production under three seasons, Summer, Autumn, and Winter. In this case, Autumn and Winter refer to the Kharif season, while Summer indicates the Rabi season.
Evolving District Names
District names have changed significantly over time, usually due to the splitting of districts with the geographic evolution of administrative boundaries. There were 593 districts in 2001; 640 in 2011; and 722 as of April 2019.
Maharajganj in Uttar Pradesh, for example, split from Gorakhpur to become a separate district in October 1989. VDSA only reports yield for Gorakhpur, while there are separate yield timeseries for each of these two districts in the NIC database. Chengalpattu district in Tamil Nadu split into the two separate districts of Tiruvallur and Kanchipuram in July 1997, and Kanchipuram was further subdivided to create a new, but much smaller, Chengalpattu district in July 2019. But VDSA only has data listed for the former Chengalpattu, while NIC has data listed for the 1997-2019 boundaries of Tiruvallur and Kanchipuram.
There are more cases when district names have changes for other reasons. Kadapa in Andhra Pradesh, which was spelled Cuddapah by the British, was renamed as YSR District to honor former chief minister Y S Rajasekhara Reddy. All three versions of the district’s name were used by VDSA, while NIC only uses the Kadapa spelling to report the crop yield in that district.
Conversion to Latin Script
Lastly, discrepancies in district names can be caused by different spellings thanks to slightly different ways of Romanizing local languages. For example, the Deogarh district in Odisha is sometimes spelled Debagarh, the Kendujhar district can also be spelled Keonjhar. If a district name includes a cardinal direction, such as West Medinipur or East Medinipur in West Bengal, sometimes the directions will be translated. In the Bengali language, which is the official language of West Bengal, West is Paschim and East is Purba. So the official names in that district are Paschim Medinipur and Purba Medinipur; however, they are labeled as West and East in some reports.
With so many discrepancies across crop and district names, AIR researchers combed through ample data to verify and remove duplicate or inconsistent district and crop names to ensure consistency across data sets and generate the most comprehensive district-level crop yield database for India.