Wednesday, November 28, 2018

What’s Your Data Strategy - 2 Types of Data Strategy

https://hbr.org/2017/05/whats-your-data-strategy?autocomplete=true


Saturday, July 28, 2018

Data Monetization.

https://www.accenture.com/us-en/insight-data-monetization-summary#block-datastrategy


Wednesday, December 20, 2017

Improving Data Quality with Machine Learning Techniques.



https://data.bloomberglp.com/promo/sites/12/750171296-FinDataGovernance.pdf

============================================================
https://www.bis.org/ifc/events/ifc_nbb_workshop/ifc_nbb_workshop_2d3.pdf

Improving Data Quality and Closing Data Gaps
with Machine Learning
Tobias Cagala*
May 5, 2017

=======================================================================


http://conteudo.icmc.usp.br/pessoas/gbatista/files/aai2003.pdf


An Analysis of Four Missing Data
Treatment Methods for Supervised Learning
Gustavo E. A. P. A. Batista and Maria Carolina Monard
University of S˜ao Paulo - USP
Institute of Mathematics and Computer Science - ICMC
Department of Computer Science and Statistics - SCE
Laboratory of Computational Intelligence - LABIC
P. O. Box 668, 13560-970 - S˜ao Carlos, SP, Brazil
{gbatista, mcmonard}@icmc.usp.br

==================================================================


http://dimacs-algorithmic-mdm.wdfiles.com/local--files/start/Methodologies%20for%20Data%20Quality%20Assessment%20and%20Improvement.pdf

Methodologies for Data Quality Assessment and Improvement
CARLO BATINI
Universit`a di Milano - Bicocca
CINZIA CAPPIELLO
Politecnico di Milano
CHIARA FRANCALANCI
Politecnico di Milano
and
ANDREA MAURINO
Universit`a di Milano - Bicocca


====================================================================

https://www.waterstechnology.com/reference-data-data-management/data-governance/3309956/webinar-making-enterprise-data-quality-a-reality

Webinar: Making Enterprise Data Quality a Reality


======================================================================

https://www.waterstechnology.com/innovationhub


=========================================================

https://www.bloomberg.com/professional/blog/machine-learning-plays-critical-role-improving-data-quality/

=============================================================


Monday, December 11, 2017

DATA QUALITY STRATEGY

DATA QUALITY STRATEGY: A STEP-BY-STEP APPROACH

“Strategy is a cluster of decisions centered on goals that determine what actions to take and how
to apply resources.”

Certainly a cluster of decisions – in this case concerning six specific factors
– will need to be made to effectively improve the data. Corporate goals will determine
how the data is used and the level of quality needed. Actions are the processes improved and invoked
to manage the data. Resources are the people,systems, financing, and data itself.
We can therefore apply the selected definition in the context of data,and arrive at the definition
of data quality strategy:

“A cluster of decisions centered on organizational data quality goals that determine the data
processes to improve, solutions to implement, and people to engage.”

EXECUTIVE SUMMARY
This paper will discuss:
• Goals that drive a data quality strategy
• Six factors that should be considered when building a strategy
• Decisions within each factor
• Actions stemming from those decisions
• Resources affected by the decisions and needed to support the actions.

You will see how these six factors — when added together in different combinations — provide the
answer as to how people, process and technology are the integral and fundamental elements of
information quality.

GOALS OF DATA QUALITY
Goals drive strategy. Data quality goals must support on-going functional operations, data management
processes, or other initiatives such as the implementation of a new data warehouse (DW), CRM
application, or loan processing system.

THE SIX FACTORS OF DATA QUALITY

When creating a data quality strategy there are six factors, or aspects of an organization’s operations that
must be considered. Those six factors include:
• Context — the type of data being cleansed and the purposes for which it is used
• Storage — where the data resides
• Data Flow — how the data enters and moves through the organization
• Work Flow — how work activities interact with and use the data
• Stewardship— people responsible for managing the data
• Continuous Monitoring — processes for regularly validating the data Figure 1 depicts the six factors centered on the goals of a data quality initiative, and shows that each factor requires decisions to be made, actions that need to be carried, and resources to be allocated.

TYING IT ALL TOGETHER

In order for any strategy framework to be useful and effective it must be scalable. The strategy framework provided here is scalable from a simple one-field update such as validating gender codes of male and female, to an enterprise-wide initiative where 97 ERP systems need to be cleansed and consolidated into 1 system. To ensure the success of the strategy, and hence the project, each of the six factors must be evaluated. The size (number of records/rows) and scope (number databases, tables, and columns) determines the depth to which each factor is evaluated.

Taken all together or in smaller groups, the six factors act as operands in data quality strategy formulas.

• Context by itself = The type of cleansing algorithms needed
• Context + Storage + Data Flow + Work Flow = The types of cleansing and monitoring technology implementations needed
• Stewardship + Work Flow = Near-term personnel impacts
• Stewardship + Work Flow + Continuous Monitoring = Long-term personnel impacts
• Data Flow + Work Flow + Continuous Monitoring = Changes to processes It is a result of using these formulas the people come to understand that information quality truly is the integration of people, process, and technology in the pursuit of deriving value from information assets. 



To help the practitioner employ the data quality strategy methodology,
the core practices have been extracted from the factors and listed here.
a) A statement of the goals driving the project
b) A list of data sets and elements that support the goal
c) A list of data types and categories to be cleansed(1)
d) A catalog, schema or map of where the data resides(2)
e) A discussion of cleansing solutions per category of data(3)
f) Dataflow diagrams of applicable existing dataflows
g) Work flow diagrams of applicable existing work flows
h) A plan for when and where the data is accessed for cleansing(4)
i) A discussion of how the dataflow will change after project implementation
j) A discussion of how the workfflow will change after project implementation
k) A list of stakeholders affected by the project
l) A plan for educating stakeholders as to the benefits of the project
m) A plan for training operators and users
n) A list of data quality measurements and metrics to monitor
o) A plan for when and where to monitor(5)
p) A plan for initial and then regularly scheduled cleansing

• Stewardship— people responsible for managing the data
• Continuous Monitoring — processes for regularly validating the data
Figure 1 depicts the six factors centered on the goals of a data quality initiative, and shows that each
factor requires decisions to be made, actions that need to be carried, and resources to be allocated.

Credit goes to Frank Dravis
https://pdfs.semanticscholar.org/5b5b/a15e8ea1bd89fe4d14d5e97ce456436291e0.pdf





Sunday, November 12, 2017

Functional Programming

https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-001-structure-and-interpretation-of-computer-programs-spring-2005/video-lectures/


https://www.youtube.com/watch?v=7Zlp9rKHGD4


https://pragprog.com/magazines/2013-01/functional-programming-basics

https://maryrosecook.com/blog/post/a-practical-introduction-to-functional-programming



===============================================================

Functional Programming based on lambda calculus.
It is without assignment statements.

statement has a side effect but where as expression has no side effects.

Expression based programming paradigm , using expressions as opposed to statements and combining expressions to form functions and combing functions to form complex behaviors.

=================================================================

Funtional Programming Jargon


Pure Functions


Immutability


Recursion


No Side Effects


Higher Order Functions


Category Theory


Lambda Calculus


Currying


Type Strictness : Int is type without classes but where string is a type with class.

Saturday, October 28, 2017

Best Content on Spark

https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson.pdf

https://spark-summit.org/2014/wp-content/uploads/2015/03/SparkSummitEast2015-AdvDevOps-StudentSlides.pdf

Following is the Summary of the the above 2 pdf documents

Spark Major core components: 

– Execution Model
– The Shuffle
– Caching

Spark Execution Model 

1. Create DAG of RDDs to represent computation
2. Create logical execution plan for DAG
        Pipeline as much as possible
     • Split into “stages” based on need to reorganize data.
3. Schedule and execute individual tasks
        Split each stage into tasks
      • A task is data + computation
      • Execute all tasks within a stage before moving on
---------------------------------------------







“The main abstraction in Spark is that of a resilient distributed dataset (RDD), which represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

Users can explicitly cache an RDD in memory across machines and reuse it in multiple MapReduce-like parallel operations.

RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.”


“We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.

RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools.

In both cases, keeping data in memory can improve performance by an order of magnitude.”


An RDD can be created 2 ways: - 

1)  Parallelize a collection -
2)  Read data from an external source (S3, C*, HDFS, etc)

Life Cycle of a Spark Program

1) Create some input RDDs from external data or parallelize a collection in your driver program.

2) Lazily transform them to define new RDDs using transformations like filter() or map()

3) Ask Spark to cache() any intermediate RDDs that will need to be reused.

4) Launch actions such as count() and collect() to kick off a parallel computation, which is then optimized and executed by Spark.


Transformations (Lazy)

map()
intersection()
cartesion()
flatMap()
distinct()
pipe()
filter()
groupByKey()
coalesce()
mapPartitions()
reduceByKey()
repartition()
mapPartitionsWithIndex()
sortByKey()
partitionBy()
sample()
join() ...
union()
cogroup() ...

(lazy) - Most transformations are element-wise (they work on one element at a time), but this is not true for all transformations

Actions

reduce()
takeOrdered()
collect()
saveAsTextFile()
count()
saveAsSequenceFile()
first()
saveAsObjectFile()
take()
countByKey()
takeSample()
foreach()
saveToCassandra()

Types of RDDs

HadoopRDD • FilteredRDD • MappedRDD • PairRDD • ShuffledRDD • UnionRDD • PythonRDD • DoubleRDD • JdbcRDD • JsonRDD • SchemaRDD • VertexRDD • EdgeRDD


RDD Interface

1) Set of partitions (“splits”)
2) List of dependencies on parent RDDs
3) Function to compute a partition given parents
4) Optional preferred locations
5) Optional partitioning info for k/v RDDs (Partitioner)

Example: HADOOPRDD

Partitions = one per HDFS block
Dependencies = none
Compute (partition) = read corresponding block
preferredLocations (part) = HDFS block location
Partitioner = none

Example : FilteredRDD

Partitions = same as parent RDD
Dependencies = “one-to-one” on parent
Compute (partition) = compute parent and filter it
preferredLocations (part) = none (ask parent)
Partitioner = none


Example : JoinedRDD

Partitions = One per reduce task
Dependencies = “shuffle” on each parent
Compute (partition) = read and join shuffled data
preferredLocations (part) = none
Partitioner = HashPartitioner(numTasks)


Memory and Persistance

                                                         Worker Machine

Recommended to use at most only 75% of a machine’s memory
for Spark Minimum Executor heap size should be 8 GB Max
Executor heap size depends… maybe 40 GB (watch GC)
Memory usage is greatly affected by storage level and serialization format

- If RDD fits in memory, choose MEMORY_ONLY
- If not, use MEMORY_ONLY_SER w/ fast serialization library
- Don’t spill to disk unless functions that computed the datasets are very expensive or they filter a large amount of data. (recomputing may be as fast as reading from disk)
- Use replicated storage levels sparingly and only if you want fast fault recovery (maybe to serve requests from a web app)
- Intermediate data is automatically persisted during shuffle operations.

Spark uses Memory for 

RDD Storage: when you call .persist() or .cache(). Spark will limit the amount of memory used when caching to a certain fraction of the JVM’s overall heap, set by spark.storage.memoryFraction

Shuffle and aggregation buffers: When performing shuffle operations, Spark will create intermediate buffers for storing shuffle output data. These buffers are used to store intermediate results of aggregations in addition to buffering data that is going to be directly output as part of the shuffle.

User code: Spark executes arbitrary user code, so user functions can themselves require substantial memory. For instance, if a user application allocates large arrays or other objects, these will content for overall memory usage. User code has access to everything “left” in the JVM heap after the space for RDD storage and shuffle storage are allocated.


Serialization is used when:




Jobs       ---------------->          Stages  ---------->     Tasks




Schedulers




Lineage

“One of the challenges in providing RDDs as an abstraction is choosing a representation for them that can track lineage across a wide range of transformations.”

“The most interesting question in designing this interface is how to represent dependencies between RDDs.”

“We found it both sufficient and useful to classify dependencies into two types:
narrow dependencies, where each partition of the parent RDD is used by at most one partition of the child RDD
wide dependencies, where multiple child partitions may depend on it.”



Stages





Dependencies Narrow vs Wider

“This distinction is useful for two reasons:

1) Narrow dependencies allow for pipelined execution on one cluster node, which can compute all the parent partitions. For example, one can apply a map followed by a filter on an element-by-element basis. In contrast, wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation.

2) Recovery after a node failure is more efficient with a narrow dependency, as only the lost parent partitions need to be recomputed, and they can be recomputed in parallel on different nodes. In contrast, in a lineage graph with wide dependencies, a single failed node might cause the loss of some partition from all the ancestors of an RDD, requiring a complete re-execution.”



How do you know if a shuffle will be called on a Transformation?

-   repartition , join, cogroup, and any of the *By or *ByKey transformations can result in shuffles
-   If you declare a numPartitions parameter, it’ll probably shuffle
-   If a transformation constructs a shuffledRDD, it’ll probably shuffle
-   combineByKey calls a shuffle (so do other transformations like groupByKey, which actually
    end up calling combineByKey)

Common Performance issue checklist 

1. Ensure enough partitions for concurrency
2. Minimize memory consumption (esp. of sorting and large keys in groupBys)
3. Minimize amount of data shuffled 4. Know the standard library



Spark supports 2 types of shared variables: 

• Broadcast variables – allows your program to efficiently send a large, read-only value to all the worker nodes for use in one or more Spark operations. Like sending a large, read-only lookup table to all the nodes.

Broadcast variables let programmer keep a readonly variable cached on each machine rather than shipping a copy of it with tasks

For example, to give every node a copy of a large input dataset efficiently

Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost

• Accumulators – allows you to aggregate values from worker nodes back to the driver program. Can be used to count the # of errors seen in an RDD of lines spread across 100s of nodes. Only the driver can access the value of an accumulator, tasks cannot. For tasks, accumulators are write-only.


Accumulators are variables that can only be “added” to through an associative operation.
Used to implement counters and sums, efficiently in parallel.
Spark natively supports accumulators of numeric value types and standard mutable collections, and programmers can extend for new types.
Only the driver program can read an accumulator’s value, not the tasks

Spark usecases

















Sunday, November 6, 2016

Types of Data Analytical Techniques.


Original WebPage ->
http://www.forbes.com/sites/bernardmarr/2016/02/04/the-18-best-analytics-tools-every-business-manager-should-know/2/#52e7d9d717ed

Original Author - bernardmarr
  1. Business experiments: Business experiments, experimental design and AB testing are all techniques for testing the validity of something – be that a strategic hypothesis, new product packaging or a marketing approach. It is basically about trying something in one part of the organization and then comparing it with another where the changes were not made (used as a control group). It’s useful if you have two or more options to decide between.
  2. Visual analytics: Data can be analyzed in different ways and the simplest way is to create a visual or graph and look at it to spot patterns. This is an integrated approach that combines data analysis with data visualization and human interaction. It is especially useful when you are trying to make sense of a huge volume of data.
  3. Correlation analysis: This is a statistical technique that allows you to determine whether there is a relationship between two separate variables and how strong that relationship may be. It is most useful when you ‘know’ or suspect that there is a relationship between two variables and you would like to test your assumption.
  4. Regression analysis: Regression analysis is a statistical tool for investigating the relationship between variables; for example, is there a causal relationship between price and product demand? Use it if you believe that one variable is affecting another and you want to establish whether your hypothesis is true.
  5. Scenario analysis: Scenario analysis, also known as horizon analysis or total return analysis, is an analytic process that allows you to analyze a variety of possible future events or scenarios by considering alternative possible outcomes. Use it when you are unsure which decision to take or which course of action to pursue.
  6. Forecasting/time series analysis: Time series data is data that is collected at uniformly spaced intervals. Time series analysis explores this data to extract meaningful statistics or data characteristics. Use it when you want to assess changes over time or predict future events based on what has happened in the past.
  7. Forecasting/time series analysis: Time series data is data that is collected at uniformly spaced intervals. Time series analysis explores this data to extract meaningful statistics or data characteristics. Use it when you want to assess changes over time or predict future events based on what has happened in the past.
  8. Data mining: This is an analytic process designed to explore data, usually very large business-related data sets – also known as ‘big data’ – looking for commercially relevant insights, patterns or relationships between variables that can improve performance. It is therefore useful when you have large data sets that you need to extract insights from.
  9. Text analytics: Also known as text mining, text analytics is a process of extracting value from large quantities of unstructured text data. You can use it in a number of ways, including information retrieval, pattern recognition, tagging and annotation, information extraction, sentiment assessment and predictive analytics.
  10. Sentiment analysis: Sentiment analysis, also known as opinion mining, seeks to extract subjective opinion or sentiment from text, videoor audio data. The basic aim is to determine the attitude of an individual or group regarding a particular topic or overall context. Use it when you want to understand stakeholder opinion.
  11. Image analytics: Image analytics is the process of extracting information, meaning and insights from images such as photographs, medical images or graphics. As a process it relies heavily on pattern recognition, digital geometry and signal processing. Image analytics can be used in a number of ways, such as facial recognition for security purposes.
    1. Video analytics: Video analytics is the process of extracting information, meaning and insights from video footage. It includes everything that image analytics can do plus it can also measure and track behavior. You could use it if you wanted to know more about who is visiting your store or premises and what they are doing when they get there.
    1. Voice analytics: Voice analytics, also known as speech analytics, is the process of extracting information from audio recordings of conversations. This form of analytics can analyze the topics or actual words and phrases being used, as well as the emotional content of the conversation. You could use voice analytics in a call center to help identify recurring customer complaints or technical issues.
    1. Monte Carlo Simulation: The Monte Carlo Simulation is a mathematical problem-solving and risk-assessment technique that approximates the probability of certain outcomes, and the risk of certain outcomes, using computerized simulations of random variables. It is useful if you want to better understand the implications and ramifications of a particular course of action or decision.
    1. Linear programming: Also known as linear optimization, this is a method of identifying the best outcome based on a set of constraints using a linear mathematical model. It allows you to solve problems involving minimizing and maximizing conditions, such as how to maximize profit while minimizing costs. It’s useful if you have a number of constraints such as time, raw materials, etc. and you wanted to know the best combination or where to direct your resources for maximum profit.
    2. Cohort analysis: This is a subset of behavioral analytics, which allows you to study the behavior of a group over time. It is especially useful if you want to know more about the behavior of a group of stakeholders, such as customers or employees.
      1. Factor analysis: This is the collective name given to a group of statistical techniques that are used primarily for data reduction and structure detection. It can reduce the number of variables within data to help make it more useful. Use it if you need to analyze and understand more about the interrelationships among a large number of variables.
      1. Neural network analysis: A neural network is a computer program modeled on the human brain, which can process a huge amount of information and identify patterns in a similar way that we do. Neural network analysis is therefore the process of analyzing the mathematical modeling that makes up a neural network. This technique is particularly useful if you have a large amount of data.
      2. Meta analytics/literature analysis: Meta analysis is the term that describes the synthesis of previous studies in an area in the hope of identifying patterns, trends or interesting relationships among the pre-existing literature and study results. Essentially, it is the study of previous studies. It is useful whenever you want to obtain relevant insights without conducting any studies yourself.

Monday, November 30, 2015

Data Quality




CPPS -- Cleansing , Profiling , Parsing and Standarization
MEM --  Matching , Enrichment ,  Monitoring.


Data Profiling is a methodical approach to identify the data issues like inconsistency , Missing Data , duplicates etc.  A data Profiling will be carried out to analyze the health of the data ..

Data Quality Metrics --

Completeness , Conformity , Consistency ,  Duplicates, Accuracy , and  Validity .

IDQ has two types of transformations -- a. general (power center transformations) b. data quality transformations



Profiling Data Overview

Use profiling to find the content, quality, and structure of data sources of an application, schema, or enterprise. The data source content includes value frequencies and data types. The data source structure includes keys and functional dependencies.

 Analysts and developers can use these tools to collaborate, identify data quality issues, and analyze data relationships


Data profiling is often the first step in a project. You can run a profile to evaluate the structure of data and verify that data columns are populated with the types of information you expect. If a profile reveals problems in data, you can define steps in your project to fix those problems. For example, if a profile reveals that a column contains values of greater than expected length, you can design data quality processes to remove or fix the problem values.


A profile that analyzes the data quality of selected columns is called a column profile.

Note: You can also use the Developer tool to discover primary key, foreign key, and functional dependency relationships, and to analyze join conditions on data columns.

A column profile provides the following facts about data:

• The number of unique and null values in each column, expressed as a number and a percentage.
 • The patterns of data in each column, and the frequencies with which these values occur.
• Statistics about the column values, such as the maximum and minimum lengths of values and the first and last values in each column.
• For join analysis profiles, the degree of overlap between two data columns, displayed as a Venn diagram and as a percentage value. Use join analysis profiles to identify possible problems with column join conditions.

You can run a column profile at any stage in a project to measure data quality and to verify that changes to the data meet your project objectives. You can run a column profile on a transformation in a mapping to indicate the effect that the transformation will have on data

You can perform the following tasks in both the Developer tool and Analyst tool:

• Perform column profiling. The process includes discovering the number of unique values, null values, and data patterns in a column.
• Perform data domain discovery. You can discover critical data characteristics within an enterprise.
• Curate profile results including data types, data domains, primary keys, and foreign keys.
 • Create scorecards to monitor data quality.
 • View scorecard lineage for each scorecard metric and metric group.
• Create and assign tags to data objects.
• Look up the meaning of an object name as a business term in the Business Glossary Desktop. For example, you can look up the meaning of a column name or profile name to understand its business requirement and current implementation.


You can perform the following tasks in the Developer tool:

• Discover the degree of potential joins between two data columns in a data source.
• Determine the percentage of overlapping data in pairs of columns within a data source or multiple data sources.
• Compare the results of column profiling.
• Generate a mapping object from a profile.
• Discover primary keys in a data source.
• Discover foreign keys in a set of one or more data sources.
• Discover functional dependency between columns in a data source.
• Run data discovery tasks on a large number of data sources across multiple connections. The data discovery tasks include column profile, inference of primary key and foreign key relationships, data domain discovery, and generating a consolidated graphical summary of the data relationships.

You can perform the following tasks in the Analyst tool:

• Perform enterprise discovery on a large number of data sources across multiple connections. You can view a consolidated discovery results summary of column metadata and data domains.
• Perform discovery search to find where the data and metadata exists in the enterprise. You can search for specific assets, such as data objects, rules, and profiles. Discovery search finds assets and identifies relationships to other assets in the databases and schemas of the enterprise




Informatica Analyst Frequently Asked Questions


Can I use one user account to access the Administrator tool, the Developer tool, and the Analyst tool? Yes. You can give a user permission to access all three tools. You do not need to create separate user accounts for each client application.


Where is my reference data stored?

You can use the Developer tool and the Analyst tool to create and share reference data objects.
The Model repository stores the reference data object metadata. The reference data database stores reference table data values. Configure the reference data database on the Content Management Service.


Informatica Developer Frequently Asked Questions

What is the difference between a source and target in PowerCenter and a physical data object in the Developer tool?

In PowerCenter, you create a source definition to include as a mapping source. You create a target definition to include as a mapping target. In the Developer tool, you create a physical data object that you can use as a mapping source or target.

What is the difference between a mapping in the Developer tool and a mapping in PowerCenter?

A PowerCenter mapping specifies how to move data between sources and targets.
A Developer tool mapping specifies how to move data between the mapping input and output.
A PowerCenter mapping must include one or more source definitions, source qualifiers, and target definitions. A PowerCenter mapping can also include shortcuts, transformations, and mapplets.
A Developer tool mapping must include mapping input and output. A Developer tool mapping can also include transformations and mapplets.


The Developer tool has the following types of mappings:

• Mapping that moves data between sources and targets. This type of mapping differs from a PowerCenter mapping only in that it cannot use shortcuts and does not use a source qualifier.

• Logical data object mapping. A mapping in a logical data object model. A logical data object mapping can contain a logical data object as the mapping input and a data object as the mapping output. Or, it can contain one or more physical data objects as the mapping input and logical data object as the mapping output.

• Virtual table mapping. A mapping in an SQL data service. It contains a data object as the mapping input and a virtual table as the mapping output.

• Virtual stored procedure mapping. Defines a set of business logic in an SQL data service. It contains an Input Parameter transformation or physical data object as the mapping input and an Output Parameter transformation or physical data object as the mapping output.



What is the difference between a mapplet in PowerCenter and a mapplet in the Developer tool?

A mapplet in PowerCenter and in the Developer tool is a reusable object that contains a set of transformations. You can reuse the transformation logic in multiple mappings.

A PowerCenter mapplet can contain source definitions or Input transformations as the mapplet input. It must contain Output transformations as the mapplet output.

A Developer tool mapplet can contain data objects or Input transformations as the mapplet input. It can contain data objects or Output transformations as the mapplet output.

A mapping in the Developer tool also includes the following features:

• You can validate a mapplet as a rule. You use a rule in a profile.

• A mapplet can contain other mapplets.


What is the difference between a mapplet and a rule?

You can validate a mapplet as a rule. A rule is business logic that defines conditions applied to source data when you run a profile. You can validate a mapplet as a rule when the mapplet meets the following requirements:

• It contains an Input and Output transformation.

• The mapplet does not contain active transformations.

• It does not specify cardinality between input groups.

Sunday, November 22, 2015

Database Session Topics - Newbies

Database Session Topics
What is RDBMS – Abstraction Layers
Concepts
§  SQL
§  Relationships
§  Transactions
§  ACID – Atomicity Consistency Isolation Durability
§  CRUD
§  DATABASE – data file & log file
§  ODBC / JDBC
§  Database Objects
§  Table -- column row
§  View
§  Transaction Log
§  Trigger
§  Index
§  Stored procedure
§  Functions
§  Cursor
§  Indexes:  Primary key, foreign key 
Components
§  Concurrency control
§  Data dictionary
§  Query language
§  Query optimizer
§  Query plan
Conclusion
§  Data Models : ER Models / Dimensional Models

§  OLTP/OLAP/ETL