In 2011, Congress created the CFPB to ensure the protection of consumer interests in many financial markets. The CFPB receives and processes consumer complaints pertaining to various financial services, including credit cards, mortgages, bank accounts, student loans, consumer loans, credit reports, payday loans, and debt collection. The CFPB updates raw consumer complaints data every night and makes it publicly available to download from the CFPB’s website ( This database, believed to be the largest public collection of consumer financial complaints, includes basic information about the complaints such as submission date, consumer’s zip code, the company, the product type, the relevant issue, the consumer narratives, and how company has addressed the complaint. Making the data publicly available not only allows the financial institutions and their consumers to view the overall quality of their financial products and services, but also encourages public users such as analysts, data scientists and academicians to explore this information, and build on valuable insights accordingly. Nonetheless, there have been very few published works documenting formal analysis of the CFPB database (Ayres, I., Lingwall, J., & Steinway, S. 2016. Skeletons in the Database: An Early Analysis of the CFPB’s Consumer Complaints. Fordham J. Corp. & Fin. L., 19, 343; Littwin, A. K. 2015. Examination as a Method of Consumer Protection. Temple Law Review, 87(807))

Although the aforementioned studies lead to conclusions that should be of interest to the CFPB, financial companies, and consumers, they lack analysis of consumer complaint narratives that certainly would provide a more value to the financial community. In fact, the narratives provide context to complaints upon which more interesting insights could be built. This context can be extracted by a thorough analysis of the narrative data using text mining approaches. To that end, this post will demonstrate the utility of text mining approaches in analyzing CFPB complaint narratives.

We propose an analytic framework based on a probabilistic topic modeling approach known as latent Dirichlet allocation (LDA) (Blei, D. M., Ng, A. Y., & Jordan, M. I. 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022) for CFPB complaint narratives. LDA is an automated approach using a hierarchical Bayesian analysis of the original texts to discover patterns of word use (i.e. topics) in the narratives and connect narratives exhibiting similar patterns. The LDA approach enables us to summarize automatically unstructured consumer complaints into a mixture of topics through which the underlying semantic structure would be revealed. This task would not be possible by human annotations as it requires reading a large volume of consumer narratives and identifying semantic similarities among them, a process both difficult and time-consuming.

The first step in the proposed framework is data preprocessing to prepare the text data for LDA analysis. Typically, text data (especially in the realm of consumer narratives) is noisy and unstructured. Data preprocessing provides a consistent format for the data, and removes non-critical words that are not contributing to the text analysis. Our data preprocessing includes the following five tasks: (1) convert to lowercase, (2) remove special characters and tokenize them into terms, (3) remove stop words, (4) stemming, and (5) construct term-document matrix. The output of preprocessing is the term-document matrix representing the frequency of the critical terms (words) in the documents (consumer complaints).

The second step in the proposed framework is conducting LDA analysis on the term-document matrix to discover the pattern of the word use in the documents, i.e. topic modeling. The extracted topics from consumer narratives are presented in our viz which can be reached at the following Tableau Viz:!/vizhome/CFPBTopicModeling/CFPBTopicModeling


As each consumer complaint could cover multiple problems (e.g., credit reporting and harassment), LDA can be used as an effective approach to capture these issues by summarizing each complaint into a mixture of topics. In other words, for each consumer complaint, LDA assigns a mixture of topics along with their proportions, where the assigned topics represent the problems captured by LDA and the proportions indicate their relevance weights (between zero and one). This is certainly useful as LDA facilitates organizing and delivering the content of a large collection of narratives by assigning each with a mixture of topics and their relevance proportions. Indeed, one of the primary benefits of LDA is that it helps overcome the shortcomings with the CFPB convention in labeling the consumer complaints, which is discussed in the following.

One of the fields in CFPB datasets is “Issue” which describes the consumer’s primary complaint. These issues are predetermined labels defined by CFPB practitioners and provided in a drop-down menu for the consumers at the time of complaint submission. When a consumer submits a complaint through the CFPB website, he selects the issue using this drop-down menu. This CFPB convention appears to have two shortcomings. First, the consumer selects the label best describing her complaint issue from the drop-down menu. Due to the consumer’s lack of knowledge or understanding of the labels, she might select a wrong label (mislabeling), or the issue she would like to complain about might not be described by the labels provided in the drop-down menu. As such, she would be forced to select a label that does not best describe her issue. Second, the consumer complaint might be about multiple issues (labels), but due to the limitation of CFPB labeling convention, he can only select one label. These two shortcomings are addressed by our proposed analytic framework as follows:

(1) Topic modeling is carried out by LDA posterior inferences. LDA has no prior notion of the existence of the topics. Instead, LDA learns the topics and their assignments by analyzing the original texts, with no need for input from the consumers on the issue.

(2) A mixture of topics along with their relevance proportions is assigned to each complaint to account for the possibility that the consumer might be complaining about multiple problems.

Another benefit of LDA analysis on CFPB consumer complaints is that now (as explained above) each complaint is assigned a mixture of topics, which facilitates the analysis of the topics with respect to other fields of data in CFPB dataset. One of these fields is “Date Received” which represents the date CFPB received the consumer complaint. This data field is particularly interesting because it becomes possible to describe the frequency/popularity of the topics over time, and, accordingly, analyze the responsiveness of the financial companies with respect to these topics. For example, if a topic trend is decreasing over time, it may indicate that the CFPB regulations have taken into account the relevant topic, while companies have been improving their operations with respect to that topic.

The topic popularity analysis was carried out and the results were presented in our Tableau viz. It can be observed that the time trends of the topic popularity could be increasing, decreasing, or highly variable. For example, the topics “Rewards and Promotions”, “Account Management”, “Credit Reporting”, “CFPB”, “Credit Score” and “Dispute” generally have increasing trends. By contrast, decreasing trends are evident for “Harassment”, Mortgage/Loan Modification and Foreclosure” and “Loan/Student Loan” (please see Figure below).


“Credit Reporting”, “Harassment”, “Account Management”, and ”Communication” have been the most popular topics, respectively. This indicates that consumer complaints have been mainly associated to these topics over time. Hence, CFPB might consider more efforts to address these topics, which represent the highest proportion of consumer complaints.

Furthermore, some topics have shown a sudden increase, decrease, or spike in their time trends. For example, a high spike occurred in October 2015 in “Unauthorized Inquiries” and “Fund and Deposit” (about 100% increase in its computed topic popularity). In the time trend of “Rewards and Promotions” two consecutive high increases occurred in May 2016 (about 100% increase), followed by a much higher increase in June 2016 (about 400% increase). The topic popularity computed for “Mortgage/Loan Modification and Foreclosure” presents a high decrease in July 2016 (about 100% decrease). The sudden changes in the topics above might be relevant because of some special events, policies, or new regulations. Hence, a more thorough study is required to reveal the main causes of the sudden changes for these topics. Such discussion is provided below.

This time trend of topic popularity can certainly be used as a quantitative metric to evaluate the quality and effectiveness of CFPB regulations. To that end, it can inform the decision-making process of financial institutions as they look to improve their consumer-oriented cultures. “Harassment”, “Loan/Student Loan”, and “Mortgage/Loan Modification and Foreclosure” generally have decreasing trends (especially within the last 4 months). For instance, “Harassment” essentially refers to the consumer complaints about debt collectors’ actions and communication tactics; consumers most commonly complain that the collectors harass them and request amounts that consumers do not owe. To protect the consumers from harassment, federal and state officials considered debt collection as a top priority, and accordingly, since October 2012, the CFPB started using its enforcement authority to routinely supervise debt collectors. The routine supervisory authority of the Bureau over the debt collectors might be a reason for the decreasing trend in “Harassment” topic popularity over time. Another example is the decreasing trend of “Mortgage/Loan Modification and Foreclosure” topic, which can be tied to CFPB’s introduction of new mortgage-servicing rules in 2014. This rule was issued to expand foreclosure protections for struggling borrowers and homeowners, and it has been consistently updated since then ( cfpb servicing-rules summary.pdf).

As opposed to decreasing trend, increasing trend or high variability of topic popularity could indicate that the CFPB regulations have not been adequately effective in addressing the consumer complaints with respect to those topics. For example, the CFPB published an interim final rule to implement Regulation V (Fair Credit Reporting) in November 2012 to enhance the standards for communication and use of information bearing on a consumer’s creditworthiness and credit standing. Since then, the “Credit Reporting” trend has generally increased, becoming one of the top three popular topics over the past few years. “Account Management” is another topic with an increasing trend, showing a dramatic increase in May 2016. Similarly, the “Rewards and Promotions” topic increased significantly in May and June 2016 and should be worthy of additional CFPB investigation.

Although the effectiveness of CFPB regulations could be the main reason for the overall trend of any topic’s popularity, the financial institutions’ roles should not be ignored. After all, they employ CFPB regulations as part of their organizational decision-making processes. Hence, the level of commitment to and compliance with these regulations could be another reason for decreasing or increasing trends of topic popularity over time. One of the fields in CFPB datasets is “Company,” which indicates the financial institution to which that complaint relates. In the future, it might be interesting to monitor the topic popularity for each financial institution. This would provide noteworthy insights and enable us to investigate how individual institutions comply with CFPB regulations according to the topic popularity computed using their consumer complaints.

As an early follow-up investigation to our future research, we refer the readers to topic popularity of “Fund and Deposit” which can be seen in the graphic below. Our analysis discloses a high spike in October 2015, which is primarily associated with Empowerment Ventures, LLC (RushCard). In October 2015, thousands of RushCard customers experienced problems as a result of a software conversion. Consumers were frustrated because they were not able to access their cash, check their balances, or deposit any money into their accounts. As such, they started sending out complaints to CFPB. An analysis of the complaint narratives, similar to that which is discussed in this document, may have led to detection more quickly.


The proposed framework certainly can be useful in monitoring the consumer complaint narratives for emerging topics. It can be operationalized easily (in a number of steps including data preprocessing, LDA-based topic modeling, and topic popularity analysis) and run monthly (even weekly) to automate the prediction of topic assignments of future complaint narratives. Similar to the procedures explained in this analysis, monthly (weekly) topic popularity analysis of the new complaints can reveal emerging topics (e.g., as “Rewards and Promotions” and “Account Management” were revealed in this analysis) based on high increases captured in their very recent time trend. The proposed text mining framework is never intended to replace human analysts. Instead, it is designed as a complementary analytic tool for the existing CFPB analysts to investigate consumer complaints more efficiently and effectively, eventually improving consumer protection from unfair, deceptive or abusive practices in the financial markets.