Skip to content

Commit

Permalink
clear outputs
Browse files Browse the repository at this point in the history
  • Loading branch information
willpoulett committed Mar 15, 2024
1 parent 5347b96 commit 9e5ba6d
Showing 1 changed file with 8 additions and 144 deletions.
152 changes: 8 additions & 144 deletions dev.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -51,18 +51,9 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\Will Poulett\\OneDrive - NHS\\Documents\\RAG\\ds_251_RAG\\.venv\\Lib\\site-packages\\langchain_community\\llms\\anthropic.py:180: UserWarning: This Anthropic LLM is deprecated. Please use `from langchain_community.chat_models import ChatAnthropic` instead\n",
" warnings.warn(\n"
]
}
],
"outputs": [],
"source": [
"rag_pipeline = models.RagPipeline(config['EMBEDDING_MODEL'], config['PERSIST_DIRECTORY'])"
]
Expand All @@ -76,7 +67,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -94,38 +85,9 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\Will Poulett\\OneDrive - NHS\\Documents\\RAG\\ds_251_RAG\\.venv\\Lib\\site-packages\\langchain_core\\_api\\deprecation.py:117: LangChainDeprecationWarning: The function `__call__` was deprecated in LangChain 0.1.7 and will be removed in 0.2.0. Use invoke instead.\n",
" warn_deprecated(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"The main benefits of Reproducible Analytical Pipelines (RAP) are:\n",
"\n",
"1. Improved replicability: Using RAPs helps ensure your analysis can be replicated by colleagues or your future self, improving scientific rigor. Automating data processing and analysis steps makes it easier to reproduce results.\n",
"\n",
"2. Reuse and sharing: RAPs make analysis code and workflows reusable by your future self or others in your organization. Packaging code and data together facilitates collaboration and extensions of your work.\n",
"\n",
"3. Efficiency gains: With automated steps for data cleaning, processing, and analysis, RAPs save time compared to manual work. Changes to the analysis flow can be quickly propagated.\n",
"\n",
"4. Better documentation: RAPs explicitly lay out analysis steps and the code underlying them. This helps serve as documentation, refreshing your memory or informing new team members.\n",
"\n",
"5. Error reduction: Automating analysis steps leaves less room for manual error during data manipulation. Automated tracking of data provenance aids debugging.\n",
"\n",
"In summary, adopting RAP principles leads to more reliable, reusable, and efficient analytical workflows that promote transparency, rigor, and collaboration. The investment required pays off in time savings and better output down the road\n"
]
}
],
"outputs": [],
"source": [
"question = \"Explain the main benefits of Reproducible Analytical Pipelines (RAP)\"\n",
"\n",
Expand All @@ -143,107 +105,9 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\u001b[1m> Entering new StuffDocumentsChain chain...\u001b[0m\n",
"\n",
"\n",
"\u001b[1m> Entering new LLMChain chain...\u001b[0m\n",
"Prompt after formatting:\n",
"\u001b[32;1m\u001b[1;3mYou are a helpful assistant that helps people with their questions. You are not a replacement for human judgement, but you can help humansmake more informed decisions. If you are asked a question you cannot answer based on your following instructions, you should say so.Be concise and professional in your responses.\n",
"\n",
" Given the following extracted parts of a long document and a question, create a final answer with references (\"SOURCES\"). If you don't know the answer, just say that you don't know. Don't try to make up an answer. ALWAYS return a \"SOURCES\" part in your answer.\n",
"\n",
"Example 1: \"**RAP** is to be the foundation of analyst training. SOURCES: (goldacre_review.txt)\"\n",
"Example 2: \"Open source code is a good idea because:\n",
"* it's cheap (goldacre_review.txt)\n",
"* it's easy for people to access and use (open_source_guidlines.txt)\n",
"* it's easy to share (goldacre_review.txt)\n",
"\n",
"SOURCES: (goldacre_review.txt, open_source_guidlines.txt)\"\n",
"\n",
"QUESTION: Explain the main benefits of Reproducible Analytical Pipelines (RAP)\n",
"=========\n",
"docs\\goldacre_review.txt:\n",
"By contrast to this manual approach, Reproducible Analytical Pipelines deliver the same work more efficiently and reliably, using commonplace contemporary practices that have been developed over time by the analytic, data science and software development community. The adoption of standard working practices from the software development community is important, as it reflects the reality that data analysis is implemented by writing code. RAPs reflect a modern, open, collaborative and software-driven approach to delivering high quality analytics that are reproducible, re-usable, auditable, efficient, high quality, and more likely to be free from error.\n",
"\n",
"At minimum a RAP will meet the following criteria (adapted from Government Statistical Service):\n",
"\n",
"docs\\goldacre_review.txt:\n",
"In the chapter on Open Working there is a detailed description of Reproducible Analytical Pipelines. This is a powerful brand first developed in 2017 by the Government Digital Service (GDS) to describe a range of contemporary best practices for data management and analysis in the public sector. It is built around a single core principle: ‘At any point in the future we should be able to look back at this work and be able to reproduce everything that we have done today – something that is difficult with manual and semi-manual processes.’ RAP emphasises a range of working principles. It promotes the use of open source languages such as R and Python rather than proprietary tools: this ensures that all subsequent users are guaranteed to have access to the same tools; and reflects the emphasis placed by the open source software community on good documentation, flexibility, and extensibility, which are all powerful principles for all data analysis. RAP is now a very strong, very broad\n",
"\n",
"docs\\goldacre_review.txt:\n",
"regularly re-evaluate and compare all currently performant or realistic mechanisms to achieve the above, and ensure that only the safest are used\n",
"ensure all outputs are checked for potentially disclosive material by a mixture of appropriately validated automated methods, and manual checking\n",
"2. Support RAP and modern, efficient, high quality, reproducible data analysis:\n",
"\n",
"docs\\goldacre_review.txt:\n",
"There is great work across government to use the principles of Reproducible Analytical Pipelines in data analysis. But this is not yet the norm. Changes are needed if this approach is to become the first choice. This will require collaboration and strong leadership within organisations and across government. This is especially important for health and care data, which is characterised by large data sets but dispersed expertise and limited knowledge sharing.\n",
"\n",
"Ed Humpherson, Director General, OSR\n",
"\n",
"docs\\goldacre_review.txt:\n",
"These working practices achieve a range of important outcomes. Minimising manual steps makes analyses faster to execute. This makes it easier to deliver timely outputs, dashboards and reports that reflect the current raw data, rather than out of date information. This speed and low cost also makes it easier to re-execute the whole pipeline swiftly when errors or shortcomings in one aspect of the work are found and addressed, or when modifications have been implemented. Sharing code widely allows others to see the work, and to re-use it in their own identical or related analyses where helpful. Open code also adds an extra layer of assurance, as it allows a wider community of engaged users and experts to help to identify problems, or offer improvements; it also helps build capacity across the system, because people using data can see what others have done with it, and learn from their prior work. Adequate documentation – embedded alongside the code itself – makes the work intelligible\n",
"\n",
"docs\\goldacre_review.txt:\n",
"It is clear from the minimum working practices of RAP, and the descriptions of more advanced computational approaches, that code sharing is a core feature of delivering high quality, sustainable outputs and data infrastructure. The practical value of openness is covered above. In outline, open code helps to drive quality through review that identify errors, and by ensuring all users are fully aware of the operations on which they are dependent. It supports efficient re-use, and iterative improvement, in a modular collaborative ecosystem. It supports capacity building, through easy access to prior related technical work. Open code also helps to build trust in statistics from the public, policymakers and professionals, by sharing a comprehensive description of how the raw data was converted into the final analytic outputs to be acted upon; this may be particularly important on contentious issues around performance monitoring, or the risks and benefits of particular treatments,\n",
"\n",
"docs\\goldacre_review.txt:\n",
"Analytic approaches and Reproducible Analytical Pipelines\n",
"As a consequence of the structural and organisational challenges outlined above, it is clear that there is very substantial variation in analytic approaches taken between different settings. There are many outstanding examples of excellent work, using modern and open approaches computational data science, often driven by a single individual or small group in one setting. But these pockets were largely invisible to those outside of their group or organisation. It is clear that there is also a strong reliance across the system on more outdated and inefficient means of data management and analysis, using ‘point and click’ tools such as Excel which undoubtedly have a role but can commonly obstruct reproducibility, transferability, efficient updates, scaling, real-time analytics, and error-checking in analyses.\n",
"\n",
"docs\\goldacre_review.txt:\n",
"placed by the open source software community on good documentation, flexibility, and extensibility, which are all powerful principles for all data analysis. RAP is now a very strong, very broad movement across government departments with extensive training and deep experience of implementing change in diverse settings.\n",
"\n",
"docs\\goldacre_review.txt:\n",
"The Office of National Statistics (ONS) and the GDS have already developed, over recent years, a set of best practice principles for modern, open, collaborative work with data. This work is branded as ‘Reproducible Analytical Pipelines’ (RAP) with a clear set of design principles to support high quality analytics that are reproducible, re-usable, auditable, efficient, high quality, and more likely to be free from error. At minimum a RAP will meet various criteria. It will minimise manual steps (such as copy-paste, point-click or drag-drop operations; where it is necessary to include them, they must be properly documented). It will be built using open source software for data management, analysis and visualisation (such as R or python) as this is standard, portable, and available to all for checking and re-use. The code will be open to anyone for review and re-use, with all code shared openly through open standard file and code sharing platforms such as GitHub. The code will be well\n",
"\n",
"docs\\goldacre_review.txt:\n",
"The various texts on RAP describe the prior norms around statistics production in government, in similar terms to current working practices in health seen during this review.\n",
"\n",
"Broadly speaking, data are extracted from a datastore (whether it is a data lake, database, spreadsheet, or flat file), and are manipulated in a proprietary statistical software package, and possibly in proprietary spreadsheet software. Formatted tables are often then ‘copy and pasted’ into a word processor, before being converted to PDF format, and finally published… This is quite a simplification, as statistical publications are usually produced by several people, so this process is likely to be happening in parallel many times… [Quality assurance is then] a manual process which can take up a significant portion of the overall production… as any changes will require the manual process of production to be repeated.\n",
"=========\n",
"FINAL ANSWER:\u001b[0m\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"d:\\PycharmProjects\\ds_251_RAG\\.venv\\Lib\\site-packages\\langchain_core\\_api\\deprecation.py:117: LangChainDeprecationWarning: The function `__call__` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead.\n",
" warn_deprecated(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\u001b[1m> Finished chain.\u001b[0m\n",
"\n",
"\u001b[1m> Finished chain.\u001b[0m\n",
"\n",
"The main benefits of Reproducible Analytical Pipelines (RAP) are:\n",
"\n",
"1. They deliver work more efficiently and reliably by minimizing manual steps, which speeds up analysis, allows easy re-running of the analysis pipeline, and makes it easier to share code for collaboration\n",
"\n",
"2. They use open source software like R and Python that guarantees all users have access to the tools, and emphasizes documentation, flexibility, and extensibility\n",
"\n",
"3. They produce analyses and code that are reusable, auditable, efficient, high quality, and more likely to be free of errors\n",
"\n",
"4. They support open sharing of code and analyses, which allows wider review to identify problems and offer improvements, drives re-use and capacity building across the system, and builds trust in the analyses and statistics produced\n",
"\n",
"SOURCES: (goldacre_review.txt)\n"
]
}
],
"outputs": [],
"source": [
"result = rag_pipeline.answer_question(question, rag=True)\n",
"\n",
Expand Down

0 comments on commit 9e5ba6d

Please sign in to comment.