Meet DeepSeek-Coder-V2 by DeepSeek AI: The First Open-Source AI Model to Surpass GPT4-Turbo in Coding and Math, Supporting 338 Languages and 128K Context Length

Code intelligence focuses on creating advanced models capable of understanding and generating programming code. This interdisciplinary area leverages natural language processing and software engineering to enhance programming efficiency and accuracy. Researchers have developed models to interpret code, generate new code snippets, and debug existing code. These advancements reduce the manual effort required in coding tasks, making the development process faster and more reliable. Code intelligence models have been progressively improving, showing promise in various applications, from software development to education and beyond.

A significant challenge in code intelligence is the performance disparity between open-source code models and cutting-edge closed-source models. Despite the open-source community’s considerable efforts, these models must catch up to their closed-source counterparts in specific coding and mathematical reasoning tasks. This gap poses a barrier to the widespread adoption of open-source solutions in professional and educational settings. More powerful and accurate open-source models are crucial to democratizing access to advanced coding tools and fostering innovation in software development.

Existing methods in code intelligence include notable open-source models like StarCoder, CodeLlama, and the original DeepSeek-Coder. These models have shown steady improvement thanks to the contributions of the open-source community. However, they must still catch up to the capabilities of leading closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro. These closed-source models benefit from extensive proprietary datasets and significant computational resources, enabling them to perform exceptionally well in coding and mathematical reasoning tasks. Despite these advancements, the need for competitive open-source alternatives remains.

Researchers from DeepSeek AI introduced DeepSeek-Coder-V2, a new open-source code language model developed by DeepSeek-AI. Built upon the foundation of DeepSeek-V2, this model undergoes further pre-training with an additional 6 trillion tokens, enhancing its code and mathematical reasoning capabilities. DeepSeek-Coder-V2 aims to bridge the performance gap with closed-source models, offering an open-source alternative that delivers competitive results in various benchmarks.

DeepSeek-Coder-V2 employs a Mixture-of-Experts (MoE) framework, supporting 338 programming languages and extending the context from 16K to 128K tokens. The model’s architecture includes 16 billion and 236 billion parameters, designed to efficiently utilize computational resources while achieving superior performance in code-specific tasks. The training data for DeepSeek-Coder-V2 consists of 60% source code, 10% math corpus, and 30% natural language corpus, sourced from GitHub and CommonCrawl. This comprehensive dataset ensures the model’s robustness and versatility in handling diverse coding scenarios.

The DeepSeek-Coder-V2 model comes in four distinct variants, each tailored for specific use cases and performance needs:

DeepSeek-Coder-V2-Instruct: Designed for advanced text generation tasks, this variant is optimized for instruction-based coding scenarios, providing robust capabilities for complex code generation and understanding.

DeepSeek-Coder-V2-Base: This variant offers a solid foundation for general text generation, suitable for a wide range of applications, and serves as the core model upon which other variants are built.

DeepSeek-Coder-V2-Lite-Base: This lightweight version of the base model focuses on efficiency, making it ideal for environments with limited computational resources while still delivering strong performance in text generation tasks.

DeepSeek-Coder-V2-Lite-Instruct: Combining the efficiency of the Lite series with the instruction-optimized capabilities, this variant excels in instruction-based tasks, providing a balanced solution for efficient yet powerful code generation and text understanding.

DeepSeek-Coder-V2 outperformed leading closed-source models in coding and math tasks in benchmark evaluations. The model achieved a 90.2% score on the HumanEval benchmark, a notable improvement over its predecessors. Additionally, it scored 75.7% on the MATH benchmark, demonstrating its enhanced mathematical reasoning capabilities. Compared to previous versions, DeepSeek-Coder-V2 showed significant advancements in accuracy and performance, making it a formidable competitor in code intelligence. The model’s ability to handle complex and extensive coding tasks marks an important milestone in developing open-source code models.

This research highlights DeepSeek-Coder-V2’s notable improvements in code intelligence, addressing existing gaps in the field. The model’s superior performance in coding and mathematical tasks positions it as a formidable open-source alternative to state-of-the-art closed-source models. With its expanded support for 338 programming languages and the ability to handle context lengths up to 128K tokens, DeepSeek-Coder-V2 marks a significant step forward in code model development. These advancements enhance the model’s capabilities and democratize access to powerful coding tools, fostering innovation and collaboration in software development.

In conclusion, the introduction of DeepSeek-Coder-V2 by researchers represents a significant advancement in code intelligence. By addressing the performance disparity between open-source and closed-source models, this research provides a powerful and accessible tool for coding and mathematical reasoning. The model’s architecture, extensive training dataset, and superior benchmark performance highlight its potential to revolutionize the landscape of code intelligence. As an open-source alternative, DeepSeek-Coder-V2 enhances coding efficiency and promotes innovation and collaboration within the software development community. This research underscores the importance of continued efforts to improve open-source models, ensuring that all advanced coding tools are available.

Check out the Paper and Models. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Chat with DeepSeek-Coder-V2 (230B)

Access Coder-V2 APIs at the same unbeatable prices as DeepSeek-V2

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 44k+ ML SubReddit

Meet DeepSeek-Coder-V2 by DeepSeek AI: The First Open-Source AI Model to Surpass GPT4-Turbo in Coding and Math, Supporting 338 Languages and 128K Context Length Read More »

Advances in Bayesian Deep Neural Network Ensembles and Active Learning for Preference Modeling

Machine learning has seen significant advancements in integrating Bayesian approaches and active learning methods. Two notable research papers contribute to this development: “Bayesian vs. PAC-Bayesian Deep Neural Network Ensembles” by University of Copenhagen researchers and “Deep Bayesian Active Learning for Preference Modeling in Large Language Models” by University of Oxford researchers. Let’s synthesize the findings and implications of these works, highlighting their contributions to ensemble learning and active learning for preference modeling.

Bayesian vs. PAC-Bayesian Deep Neural Network Ensembles

University of Copenhagen researchers explore the efficacy of different ensemble methods for deep neural networks, focusing on Bayesian and PAC-Bayesian approaches. Their research addresses the epistemic uncertainty in neural networks by comparing traditional Bayesian neural networks (BNNs) and PAC-Bayesian frameworks, which provide alternative strategies for model weighting and ensemble construction.

Bayesian neural networks aim to quantify uncertainty by learning a posterior distribution over model parameters. This creates a Bayes ensemble, where networks are sampled and weighted according to this posterior. However, the authors argue that this method needs to effectively leverage the cancellation of errors effect due to its lack of support for error correction among ensemble members. This limitation is highlighted through the Bernstein-von Mises theorem, which indicates that Bayes ensembles converge towards the maximum likelihood estimate rather than exploiting ensemble diversity.

In contrast, the PAC-Bayesian framework optimizes model weights using a PAC-generalization bound, which considers correlations between models. This approach increases the robustness of the ensemble, allowing it to include multiple models from the same learning process without relying on early stopping for weight selection. The study presents empirical results on four classification datasets, demonstrating that PAC-Bayesian weighted ensembles outperform traditional Bayes ensembles, achieving better generalization and predictive performance.

Deep Bayesian Active Learning for Preference Modeling

University of Oxford researchers focus on improving the efficiency of data selection and labeling in preference modeling for large language models (LLMs). They introduce the Bayesian Active Learner for Preference Modeling (BAL-PM). This novel stochastic acquisition policy combines Bayesian active learning with entropy maximization to select the most informative data points for human feedback.

Due to naive epistemic uncertainty estimation, traditional active learning methods often need more than redundant sample acquisition. BAL-PM addresses this issue by targeting points of high epistemic uncertainty and maximizing the entropy of the acquired prompt distribution in the LLM’s feature space. This approach reduces the number of required preference labels by 33% to 68% in two popular human preference datasets, outperforming previous stochastic Bayesian acquisition policies.

The method leverages task-agnostic uncertainty estimation, encouraging diversity in the acquired training set and preventing redundant exploration. Experiments on Reddit TL;DR and CNN/DM datasets validate BAL-PM’s effectiveness, showing substantial reductions in the data required for training. The method scales well with larger LLMs, maintaining efficiency across different model sizes.

Synthesis and Implications

Both studies underscore the importance of optimizing ensemble methods and active learning strategies to enhance model performance and efficiency. University of Copenhagen researchers’ work on PAC-Bayesian ensembles highlights the potential of leveraging model correlations and generalization bounds to create more robust ensembles. This approach addresses the limitations of traditional Bayesian methods, providing a pathway to more effective ensemble learning.

University of Oxford researchers BAL-PM demonstrates the practical application of Bayesian active learning in LLM preference modeling. By combining epistemic uncertainty with entropy maximization, BAL-PM significantly improves data acquisition efficiency, which is critical for the scalability of LLMs in real-world applications. Their method’s ability to maintain performance across different model sizes further emphasizes its versatility and robustness.

These advancements collectively push the boundaries of machine learning, offering innovative solutions to longstanding challenges in model uncertainty and data efficiency. Integrating PAC-Bayesian principles and advanced active learning techniques sets the stage for further research and application in diverse domains, from NLP to predictive analytics.

In conclusion, these research contributions provide valuable insights into optimizing neural network ensembles and active learning methodologies. Their findings pave the way for more efficient and accurate machine learning models, ultimately enhancing AI systems’ capability to learn from and adapt to complex, real-world data.

Sources

Advances in Bayesian Deep Neural Network Ensembles and Active Learning for Preference Modeling Read More »

Central Florida schools looking to restrict artificial intelligence use by students – WKMG News 6 & ClickOrlando

This post was originally published on this site SEMINOLE COUNTY, Fla. – The School Board in Seminole County is set to sign off on the Code of Conduct and Honor Code for the 2024-2025 school year, and it includes consequences for students who submit Artificial Intelligence-generated work without credit or consent. Submitting assignments or completing

Central Florida schools looking to restrict artificial intelligence use by students – WKMG News 6 & ClickOrlando Read More »

VCU launches new minor in artificial intelligence – WWBT

This post was originally published on this site RICHMOND, Va. (WWBT) -A new and timely education track is being offered at Virginia Commonwealth University. VCU is offering a minor in artificial intelligence and mixed and immersive reality studies. It will cover topics like AI in mass media, the business applications of AI, machine learning and

VCU launches new minor in artificial intelligence – WWBT Read More »

Not Loving It: McDonald’s Giving Up on Artificial Intelligence at Drive-Thru – The New York Sun

This post was originally published on this site McDonald’s has announced it will discontinue its artificial intelligence drive-thru order-taking program, ceasing a project that had been in testing for more than three years. The AI-powered initiative, which began in 2021, was being tested at more than 100 McDonald’s locations. The program aimed to streamline the

Not Loving It: McDonald’s Giving Up on Artificial Intelligence at Drive-Thru – The New York Sun Read More »

Improving air quality with generative AI

As of this writing, Ghana ranks as the 27th most polluted country in the world, facing significant challenges due to air pollution. Recognizing the crucial role of air quality monitoring, many African countries, including Ghana, are adopting low-cost air quality sensors.
The Sensor Evaluation and Training Centre for West Africa (Afri-SET), aims to use technology to address these challenges. Afri-SET engages with air quality sensor manufacturers, providing crucial evaluations tailored to the African context. Through evaluations of sensors and informed decision-making support, Afri-SET empowers governments and civil society for effective air quality management.
On December 6th-8th 2023, the non-profit organization, Tech to the Rescue, in collaboration with AWS, organized the world’s largest Air Quality Hackathon – aimed at tackling one of the world’s most pressing health and environmental challenges, air pollution. More than 170 tech teams used the latest cloud, machine learning and artificial intelligence technologies to build 33 solutions. The solution addressed in this blog solves Afri-SET’s challenge and was ranked as the top 3 winning solutions.

This post presents a solution that uses a generative artificial intelligence (AI) to standardize air quality data from low-cost sensors in Africa, specifically addressing the air quality data integration problem of low-cost sensors. The solution harnesses the capabilities of generative AI, specifically Large Language Models (LLMs), to address the challenges posed by diverse sensor data and automatically generate Python functions based on various data formats. The fundamental objective is to build a manufacturer-agnostic database, leveraging generative AI’s ability to standardize sensor outputs, synchronize data, and facilitate precise corrections.
Current challenges
Afri-SET currently merges data from numerous sources, employing a bespoke approach for each of the sensor manufacturers. This manual synchronization process, hindered by disparate data formats, is resource-intensive, limiting the potential for widespread data orchestration. The platform, although functional, deals with CSV and JSON files containing hundreds of thousands of rows from various manufacturers, demanding substantial effort for data ingestion.
The objective is to automate data integration from various sensor manufacturers for Accra, Ghana, paving the way for scalability across West Africa. Despite the challenges, Afri-SET, with limited resources, envisions a comprehensive data management solution for stakeholders seeking sensor hosting on their platform, aiming to deliver accurate data from low-cost sensors. The attempt is disadvantaged by the current focus on data cleaning, diverting valuable skills away from building ML models for sensor calibration. Additionally, they aim to report corrected data from low-cost sensors, which requires information beyond specific pollutants.
The solution had the following requirements:

Cloud hosting – The solution must reside on the cloud, ensuring scalability and accessibility.
Automated data ingestion – An automated system is essential for recognizing and synchronizing new (unseen), diverse data formats with minimal human intervention.
Format flexibility – The solution should accommodate both CSV and JSON inputs and be flexible on the formatting (any reasonable column names, units of measure, any nested structure, or malformed CSV such as missing columns or extra columns)
Golden copy preservation – Retaining an untouched copy of the data is imperative for reference and validation purposes.
Cost-effective – The solution should only invoke LLM to generate reusable code on an as-needed basis instead of manipulating the data directly to be as cost-effective as possible.

The goal was to build a one-click solution that takes different data structure and formats (CSV and JSON) and automatically converts them to be integrated into a database with unified headers, as shown in the following figure. This allows for data to be aggregated for further manufacturer-agnostic analysis.

Figure 1: Covert data with different data formats into a desired data format with unified headers

Overview of solution
The proposed solution uses Anthropic’s Claude 2.1 foundation model through Amazon Bedrock to generate Python codes, which converts input data into a unified data format. LLMs excel at writing code and reasoning over text, but tend to not perform as well when interacting directly with time-series data. In this solution, we leverage the reasoning and coding abilities of LLMs for creating reusable Extract, Transform, Load (ETL), which transforms sensor data files that do not conform to a universal standard to be stored together for downstream calibration and analysis. Additionally, we take advantage of the reasoning capabilities of LLMs to understand what the labels mean in the context of air quality sensor, such as particulate matter (PM), relative humidity, temperature, etc.
The following diagram shows the conceptual architecture:

Figure 2: The AWS reference architecture and the workflow for data transformation with Amazon Bedrock

Solution walkthrough
The solution reads raw data files (CSV and JSON files) from Amazon Simple Storage Service (Amazon S3) (Step 1) and checks if it has seen the device type (or data format) before. If yes, the solution retrieves and executes the previously-generated python codes (Step 2) and the transformed data is stored in S3 (Step 10). The solution only invokes the LLM for new device data file type (code has not yet been generated). This is done to optimize performance and minimize cost of LLM invocation. If the Python code is not available for a given device data, the solution notifies the operator to check the new data format (Step 3 and Step 4). At this time, the operator checks the new data format and validates if the new data format is from a new manufacturer (Step 5). Further, the solution checks if the file is CSV or JSON. If it is a CSV file, the data can be directly converted to a Pandas data frame by a Python function without LLM invocation. If it is a JSON file, the LLM is invoked to generate a Python function that creates a Pandas data frame from the JSON payload considering its schema and how nested it is (Step 6).
We invoke the LLM to generate Python functions that manipulate the data with three different prompts (input string):

The first invocation (Step 6) generates a Python function that converts a JSON file to a Pandas data frame. JSON files from manufacturers have different schemas. Some input data uses a pair of value type and value for a measurement. The latter format results in data frames containing one column of value type and one column of value. Such columns need to be pivoted.
The second invocation (Step 7) determines if the data needs to be pivoted and generates a Python function for pivoting if needed. Another issue of the input data is that the same air quality measurement can have different names from different manufacturers; for example, “P1” and “PM1” are for the same type of measurement.
The third invocation (Step 8) focuses on data cleaning. It generates a Python function to convert data frames to a common data format. The Python function may include steps for unifying column names for the same type of measurement and dropping columns.

All LLM generated Python codes are stored in the repository (Step 9) so that this can be used to process daily raw device data files for transformation into a common format.
The data is then stored in Amazon S3 (Step 10) and can be published to OpenAQ so other organizations can use the calibrated air quality data.
The following screenshot shows the proposed frontend for illustrative purposes only as the solution is designed to integrate with Afri-SET’s existing backend system

Results
The proposed method minimizes LLM invocations, thus optimizing cost and resources. The solution only invokes the LLM when a new data format is detected. The code that is generated is stored, so that an input data with the same format (seen before) can reuse the code for data processing.
A human-in-the-loop mechanism safeguards data ingestion. This happens only when a new data format is detected to avoid overburdening scarce Afri-SET resources. Having a human-in-the-loop to validate each data transformation step is optional.
Automatic code generation reduces data engineering work from months to days. Afri-SET can use this solution to automatically generate Python code, based on the format of input data. The output data is transformed to a standardized format and stored in a single location in Amazon S3 in Parquet format, a columnar and efficient storage format. If useful, it can be further extended to a data lake platform that uses AWS Glue (a serverless data integration service for data preparation) and Amazon Athena (a serverless and interactive analytics service) to analyze and visualize data. With AWS Glue custom connectors, it’s effortless to transfer data between Amazon S3 and other applications. Additionally, this is a no-code experience for Afri-SET’s software engineer to effortlessly build their data pipelines.
Conclusion
This solution allows for easy data integration to help expand cost-effective air quality monitoring. It offers data-driven and informed legislation, fostering community empowerment and encouraging innovation.
This initiative, aimed at gathering precise data, is a significant step towards a cleaner and healthier environment. We believe that AWS technology can help address poor air quality through technical solutions similar to the one described here. If you want to prototype similar solutions, apply to the AWS Health Equity initiative.
As always, AWS welcomes your feedback. Please leave your thoughts and questions in the comments section.

About the authors
Sandra Topic is an Environmental Equity Leader at AWS. In this role, she leverages her engineering background to find new ways to use technology for solving the world’s “To Do list” and drive positive social impact. Sandra’s journey includes social entrepreneurship and leading sustainability and AI efforts in tech companies.
Qiong (Jo) Zhang, PhD, is a Senior Partner Solutions Architect at AWS, specializing in AI/ML. Her current areas of interest include federated learning, distributed training, and generative AI.  She holds 30+ patents and has co-authored 100+ journal/conference papers. She is also the recipient of the Best Paper Award at IEEE NetSoft 2016, IEEE ICC 2011, ONDM 2010, and IEEE GLOBECOM 2005.
Gabriel Verreault is a Senior Partner Solutions Architect at AWS for the Industrial Manufacturing segment. Gabriel works with AWS partners to define, build, and evangelize solutions around Smart Manufacturing, Sustainability and AI/ML. Gabriel also has expertise in industrial data platforms, predictive maintenance, and combining AI/ML with industrial workloads.
Venkatavaradhan (Venkat) Viswanathan is a Global Partner Solutions Architect at Amazon Web Services. Venkat is a Technology Strategy Leader in Data, AI, ML, generative AI, and Advanced Analytics. Venkat is a Global SME for Databricks and helps AWS customers design, build, secure, and optimize Databricks workloads on AWS.

Improving air quality with generative AI Read More »

Use zero-shot large language models on Amazon Bedrock for custom named entity recognition

Name entity recognition (NER) is the process of extracting information of interest, called entities, from structured or unstructured text. Manually identifying all mentions of specific types of information in documents is extremely time-consuming and labor-intensive. Some examples include extracting players and positions in an NFL game summary, products mentioned in an AWS keynote transcript, or key names from an article on a favorite tech company. This process must be repeated for every new document and entity type, making it impractical for processing large volumes of documents at scale. With more access to vast amounts of reports, books, articles, journals, and research papers than ever before, swiftly identifying desired information in large bodies of text is becoming invaluable.
Traditional neural network models like RNNs and LSTMs and more modern transformer-based models like BERT for NER require costly fine-tuning on labeled data for every custom entity type. This makes adopting and scaling these approaches burdensome for many applications. However, new capabilities of large language models (LLMs) enable high-accuracy NER across diverse entity types without the need for entity-specific fine-tuning. By using the model’s broad linguistic understanding, you can perform NER on the fly for any specified entity type. This capability is called zero-shot NER and enables the rapid deployment of NER across documents and many other use cases. This ability to extract specified entity mentions without costly tuning unlocks scalable entity extraction and downstream document understanding.
In this post, we cover the end-to-end process of using LLMs on Amazon Bedrock for the NER use case. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. In particular, we show how to use Amazon Textract to extract text from documents such PDFs or image files, and use the extracted text along with user-defined custom entities as input to Amazon Bedrock to conduct zero-shot NER. We also touch on the usefulness of text truncation for prompts using Amazon Comprehend, along with the challenges, opportunities, and future work with LLMs and NER.
Solution overview
In this solution, we implement zero-shot NER with LLMs using the following key services:

Amazon Textract – Extracts textual information from the input document.
Amazon Comprehend (optional) – Identifies predefined entities such as names of people, dates, and numeric values. You can use this feature to limit the context over which the entities of interest are detected.
Amazon Bedrock – Calls an LLM to identify entities of interest from the given context.

The following diagram illustrates the solution architecture.

The main inputs are the document image and target entities. The objective is to find values of the target entities within the document. If the truncation path is chosen, the pipeline uses Amazon Comprehend to reduce the context. The output of LLM is postprocessed to generate the output as entity-value pairs.
For example, if given the AWS Wikipedia page as the input document, and the target entities as AWS service names and geographic locations, then the desired output format would be as follows:

AWS service names:
Geographic locations:

In the following sections, we describe the three main modules to accomplish this task. For this post, we used Amazon SageMaker notebooks with ml.t3.medium instances along with Amazon Textract, Amazon Comprehend, and Amazon Bedrock.
Extract context
Context is the information that is taken from the document and where the values to the queried entities are found. When consuming a full document (full context), context significantly increases the input token count to the LLM. We provide an option of using the entire document or local context around relevant parts of the document, as defined by the user.
First, we extract context from the entire document using Amazon Textract. The code below uses the amazon-textract-caller library as a wrapper for the Textract API calls. You need to install the library first:

python -m pip install amazon-textract-caller

Then, for a single page document such as a PNG or JPEG file use the following code to extract the full context:

from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import get_text_from_layout_json

document_name = “sample_data/synthetic_sample_data.png”

# call Textract
layout_textract_json = call_textract(
input_document = document_name,
features = [Textract_Features.LAYOUT]
)

# extract the text from the JSON response
full_context = get_text_from_layout_json(textract_json = layout_textract_json)[1]

Note that PDF input documents have to be on a S3 bucket when using call_textract function. For multi-page TIFF files make sure to set force_async_api=True.
Truncate context (optional)
When the user-defined custom entities to be extracted are sparse compared to the full context, we provide an option to identify relevant local context and then look for the custom entities within the local context. To do so, we use generic entity extraction with Amazon Comprehend. This is assuming that the user-defined custom entity is a child of one of the default Amazon Comprehend entities, such as “name”, “location”, “date”, or “organization”. For example, “city” is a child of “location”. We extract the default generic entities through the AWS SDK for Python (Boto3) as follows:

import pandas as pd
comprehend_client = boto3.client(“comprehend”)
generic_entities = comprehend_client.detect_entities(Text=full_context,
LanguageCode=”en”)
df_entities = pd.DataFrame.from_dict(generic_entities[“Entities”])

It outputs a list of dictionaries containing the entity as “Type”, the value as “Text”, along with other information such as “Score”, “BeginOffset”, and “EndOffset”. For more details, see DetectEntities. The following is an example output of Amazon Comprehend entity extraction, which provides the extracted generic entity-value pairs and location of the value within the text.

{
“Entities”: [
{
“Text”: “AWS”,
“Score”: 0.98,
“Type”: “ORGANIZATION”,
“BeginOffset”: 21,
“EndOffset”: 24
},
{
“Text”: “US East”,
“Score”: 0.97,
“Type”: “LOCATION”,
“BeginOffset”: 1100,
“EndOffset”: 1107
}
],
“LanguageCode”: “en”
}

The extracted list of generic entities may be more exhaustive than the queried entities, so a filtering step is necessary. For example, a queried entity is “AWS revenue” and generic entities contain “quantity”, “location”, “person”, and so on. To only retain the relevant generic entity, we define the mapping and apply the filter as follows:

query_entities = [‘XX’]
user_defined_map = {‘XX’: ‘QUANTITY’, ‘YY’: ‘PERSON’}
entities_to_keep = [v for k,v in user_defined_map.items() if k in query_entities]
df_filtered = df_entities.loc[df_entities[‘Type’].isin(entities_to_keep)]

After we identify a subset of generic entity-value pairs, we want to preserve the local context around each pair and mask out everything else. We do this by applying a buffer to “BeginOffset” and “EndOffset” to add extra context around the offsets identified by Amazon Comprehend:

StrBuff, EndBuff =20,10
df_offsets = df_filtered.apply(lambda row : pd.Series({‘BeginOffset’:max(0, row[‘BeginOffset’]-StrBuff),’EndOffset’:min(row[‘EndOffset’]+EndBuff, len(full_context))}), axis=1).reset_index(drop=True)

We also merge any overlapping offsets to avoid duplicating context:

for index, _ in df_offsets.iterrows():
if (index >0) and (df_offsets.iloc[index][‘BeginOffset’]

Use zero-shot large language models on Amazon Bedrock for custom named entity recognition Read More »

Scroll to Top