and huge potential of our proposed data generation method
in enhancing chart comprehension.
1. Introduction
In the past year, the field of artificial intelligence has un-
dergone remarkable advancements. A key highlight is the
emergence of large language models (LLMs) like GPT-
4 [23]. These models [3, 24, 29–31, 35] have demonstrated
a remarkable capability to comprehend and generate intri-
cate textual data, opening doors to myriads of applications
in both academia and industry. Taking this progress a step
further, the introduction of GPT-4V [33] marked another
milestone. It endows LLMs with the ability to interpret vi-
sual information, essentially providing them with a vision.
As a result, they can now extract and analyze data from im-
ages, marking a significant evolution in the capacities of
these models.
However, despite the achievements and potentials of
models like GPT-4V, the details behind GPT-4V’s architec-
ture remain a mystery. This opacity has given rise to ques-
tions within the academic world about the best practices for
designing multi-modal LLMs. Notably, pioneering research
initiatives, like LLaVA [17, 18] and MiniGPT [4, 40], pro-
vide insightful directions in this regard. Their findings
suggest that by incorporating visual encoders into exist-
ing LLMs and then fine-tuning them using multi-modal
instruction-tuning datasets, LLMs can be effectively trans-
formed into multi-modal LLMs. It’s noteworthy that these
multi-modal datasets are typically derived from established
benchmarks, presenting a cost-effective method for accu-
mulating data required for instruction tuning.
Datasets grounded on established benchmarks, such
as COCO [13], have significantly enhanced the abilities
of multi-modal LLMs to interpret everyday photographs
adeptly. However, when confronted with specialized visual
representations, such as charts, they reveal a noticeable lim-
itation [16, 33]. Charts are important visual instruments that
translate complex data sets into digestible visual narratives,
playing a crucial role in facilitating understanding, shaping
insights, and efficiently conveying information. Their per-
vasive presence, from academic publications to corporate
presentations, underscores the essentiality of enhancing the
capability of multi-modal LLMs in interpreting charts. In-
deed, gathering data specifically to refine instructions for
understanding charts presents several challenges. These
typically stem from two areas: understanding and gener-
ation. An effective chart understanding model should be
capable of extracting and summarizing data from various
types of charts and making predictions based on this infor-
mation.
However, most existing datasets [8, 20–22] only provide
support for simple question-answering or captioning, pri-
marily due to the absence of detailed chart information and
annotations that provide a high-level understanding of raw
data. The high dependency on manually annotated charts
gathered by web crawlers negatively affects the quality of
these datasets. Thus, the previous annotating methods could
only result in chart datasets with lower quality and less com-
prehensive annotations. Compared with chart understand-
ing, generating chart figures is a more challenging task for
the model because existing deep-learning-based generation
methods [26, 27] struggle to accurately create images based
on instructions. Using Python code to generate charts seems
promising which needs the corresponding annotations to su-
pervise models. Most charts obtained from the web are de-
void of detailed annotations, making it challenging to anno-
tate the generation code. The absence of code annotations
makes it challenging to supervise models in code genera-
tion. These issues combined impede the model’s ability to
understand charts and learn generation jointly.
To address this, we introduce an adaptive and innovative
data collection approach exclusively tailored to chart un-
derstanding and generation. At the heart of our methodol-
ogy is the strategic employment of GPT-4’s robust linguistic
and coding capabilities, which facilitate the creation of rich
multi-modal datasets. This innovative integration not only
optimizes data accuracy but also ensures its wide-ranging
diversity. Specifically, our method comprises three main
phases:
1) Chart Data Generation. Our strategy for data collec-
tion stands out for its flexibility. Rather than limiting data
collection to conventional data sources such as the web or
existing datasets, we harness the power of GPT-4 to produce
synthesized data. By providing specific characteristics such
as topics, distributions, and trends, we guide GPT-4 to pro-
duce data that is both diverse and precise.
2) Chart Figure Generation. Subsequently, GPT-4’s com-
mendable coding skills are utilized to script chart plots us-
ing the open-sourced library, like Matplotlib, given the data
and function documentation. The result is a collection of
meticulously rendered charts that span various forms, each
accurately representing its underlying data.
3) Instruction data generation. Beyond chart rendering,
GPT-4 is further employed to interpret and narrate chart
content, ensuring a holistic understanding. It is prompted
to construct relevant question-answer pairs correlating with
the charts. This results in a comprehensive instruction-
tuning corpus, amalgamating the narrative texts, question-
answer pairs, and source or modified codes of the charts.
A standout feature of our methodology is its flexibility,
which diminishes the potential for bias while simultane-
ously offering scalability. Building on this robust method-
ology, we’ve crafted a benchmark dataset, which is made
available for public access. This dataset stands out, not only
for its superior quality but also its unparalleled diversity.
2